skills$openclaw/pymupdf-pdf
bsinriclawd1.1k

by bsinriclawd

pymupdf-pdf – OpenClaw Skill

pymupdf-pdf is an OpenClaw Skills integration for coding workflows. Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.

1.1k stars3.7k forksSecurity L1
Updated Feb 7, 2026Created Feb 7, 2026coding

Skill Snapshot

namepymupdf-pdf
descriptionFast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders. OpenClaw Skills integration.
ownerbsinriclawd
repositorybsinriclawd/hvac-estimate-takeoffpath: pymupdf-pdf-parser-clawdbot-skill
languageMarkdown
licenseMIT
topics
securityL1
installopenclaw add @bsinriclawd/hvac-estimate-takeoff:pymupdf-pdf-parser-clawdbot-skill
last updatedFeb 7, 2026

Maintainer

bsinriclawd

bsinriclawd

Maintains pymupdf-pdf in the OpenClaw Skills directory.

View GitHub profile
File Explorer
6 files
pymupdf-pdf-parser-clawdbot-skill
references
pymupdf-notes.md
595 B
scripts
pymupdf_parse.py
3.4 KB
README.md
3.4 KB
SKILL.md
1.4 KB
SKILL.md

name: pymupdf-pdf description: Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.

PyMuPDF PDF

Overview

Parse PDFs locally using PyMuPDF for fast, lightweight extraction into Markdown by default, with optional JSON and image/table outputs in a per-document directory.

Prereqs / when to read references

If you hit import errors (PyMuPDF not installed) or Nix libstdc++ issues, read:

  • references/pymupdf-notes.md

Quick start (single PDF)

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/file.pdf \
  --format md \
  --outroot ./pymupdf-output

Options

  • --format md|json|both (default: md)
  • --images to extract images
  • --tables to extract a simple line-based table JSON (quick/rough)
  • --outroot DIR to change output root
  • --lang adds a language hint into JSON output metadata

Output conventions

  • Create ./pymupdf-output/<pdf-basename>/ by default.
  • Markdown output: output.md
  • JSON output: output.json (includes lang)
  • Images: images/ subdir
  • Tables: tables.json (rough line-based)

Notes

  • PyMuPDF is fast but less robust on complex PDFs.
  • For more robust parsing, use a heavy-duty OCR parser (e.g., MinerU) if installed.
README.md

PyMuPDF PDF Parser - Clawdbot Skill

A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.

Features

  • Fast processing — Parses PDFs in ~1 second per page
  • Lightweight — Single pip dependency, no heavy models
  • Markdown output — Clean text extraction with page markers
  • JSON output — Simple structured text per page
  • Image extraction — Optional embedded image extraction
  • NixOS compatible — Includes notes for libstdc++ issues

Installation

Prerequisites

  1. Python 3.8+
  2. PyMuPDF: pip install pymupdf
  3. Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git

# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/

# Install dependency
pip install pymupdf

NixOS users

If you hit libstdc++ import errors:

export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib

See pymupdf-pdf/references/pymupdf-notes.md for details.

Usage

Quick start

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf

Options

./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output
OptionDefaultDescription
--formatmdOutput format: md, json, or both
--outroot./pymupdf-outputOutput root directory
--imagesoffExtract embedded images
--tablesoffExtract line-based table approximation
--langenLanguage hint (stored in JSON metadata)

Output

Creates a per-document folder under the output root:

./pymupdf-output/
└── document-name/
    ├── output.md      # Markdown with page markers
    ├── output.json    # Simple JSON (~1KB, text per page)
    ├── images/        # Extracted images (if --images)
    └── tables.json    # Line-based tables (if --tables)

Output quality

PyMuPDF produces fast, minimal output:

  • Plain text extraction (no layout preservation)
  • Simple JSON with text per page
  • Optional image extraction

Best for: Quick text extraction, batch processing, or when speed matters.

Comparison with MinerU

AspectPyMuPDFMinerU
SpeedFast (~1s/page)Slower (~15-30s/page)
JSON outputMinimal (~1KB, text only)Rich (~50KB+, layout data)
Image extractionOptionalAutomatic
Layout preservationBasicExcellent
DependenciesLight (pip install)Heavy (~20GB models)

Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test changes with various PDF types before submitting.

Permissions & Security

Security level L1: Low-risk skills with minimal permissions. Review inputs and outputs before running in production.

Requirements

  • OpenClaw CLI installed and configured.
  • Language: Markdown
  • License: MIT
  • Topics:

FAQ

How do I install pymupdf-pdf?

Run openclaw add @bsinriclawd/hvac-estimate-takeoff:pymupdf-pdf-parser-clawdbot-skill in your terminal. This installs pymupdf-pdf into your OpenClaw Skills catalog.

Does this skill run locally or in the cloud?

OpenClaw Skills execute locally by default. Review the SKILL.md and permissions before running any skill.

Where can I verify the source code?

The source repository is available at https://github.com/openclaw/skills/tree/main/skills/bsinriclawd/hvac-estimate-takeoff. Review commits and README documentation before installing.