5.5k★by kesslerio
pymupdf-pdf – OpenClaw Skill
pymupdf-pdf is an OpenClaw Skills integration for coding workflows. Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
Skill Snapshot
| name | pymupdf-pdf |
| description | Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders. OpenClaw Skills integration. |
| owner | kesslerio |
| repository | kesslerio/pymupdf-pdf-parser-clawdbot-skill |
| language | Markdown |
| license | MIT |
| topics | |
| security | L1 |
| install | openclaw add @kesslerio/pymupdf-pdf-parser-clawdbot-skill |
| last updated | Feb 7, 2026 |
Maintainer

name: pymupdf-pdf description: Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
PyMuPDF PDF
Overview
Parse PDFs locally using PyMuPDF for fast, lightweight extraction into Markdown by default, with optional JSON and image/table outputs in a per-document directory.
Prereqs / when to read references
If you hit import errors (PyMuPDF not installed) or Nix libstdc++ issues, read:
references/pymupdf-notes.md
Quick start (single PDF)
# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/file.pdf \
--format md \
--outroot ./pymupdf-output
Options
--format md|json|both(default:md)--imagesto extract images--tablesto extract a simple line-based table JSON (quick/rough)--outroot DIRto change output root--langadds a language hint into JSON output metadata
Output conventions
- Create
./pymupdf-output/<pdf-basename>/by default. - Markdown output:
output.md - JSON output:
output.json(includeslang) - Images:
images/subdir - Tables:
tables.json(rough line-based)
Notes
- PyMuPDF is fast but less robust on complex PDFs.
- For more robust parsing, use a heavy-duty OCR parser (e.g., MinerU) if installed.
PyMuPDF PDF Parser - Clawdbot Skill
A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.
Features
- Fast processing — Parses PDFs in ~1 second per page
- Lightweight — Single pip dependency, no heavy models
- Markdown output — Clean text extraction with page markers
- JSON output — Simple structured text per page
- Image extraction — Optional embedded image extraction
- NixOS compatible — Includes notes for libstdc++ issues
Installation
Prerequisites
- Python 3.8+
- PyMuPDF:
pip install pymupdf - Clawdbot installed
Install the skill
# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git
# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/
# Install dependency
pip install pymupdf
NixOS users
If you hit libstdc++ import errors:
export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib
See pymupdf-pdf/references/pymupdf-notes.md for details.
Usage
Quick start
# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf
Options
./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output
| Option | Default | Description |
|---|---|---|
--format | md | Output format: md, json, or both |
--outroot | ./pymupdf-output | Output root directory |
--images | off | Extract embedded images |
--tables | off | Extract line-based table approximation |
--lang | en | Language hint (stored in JSON metadata) |
Output
Creates a per-document folder under the output root:
./pymupdf-output/
└── document-name/
├── output.md # Markdown with page markers
├── output.json # Simple JSON (~1KB, text per page)
├── images/ # Extracted images (if --images)
└── tables.json # Line-based tables (if --tables)
Output quality
PyMuPDF produces fast, minimal output:
- Plain text extraction (no layout preservation)
- Simple JSON with text per page
- Optional image extraction
Best for: Quick text extraction, batch processing, or when speed matters.
Comparison with MinerU
| Aspect | PyMuPDF | MinerU |
|---|---|---|
| Speed | Fast (~1s/page) | Slower (~15-30s/page) |
| JSON output | Minimal (~1KB, text only) | Rich (~50KB+, layout data) |
| Image extraction | Optional | Automatic |
| Layout preservation | Basic | Excellent |
| Dependencies | Light (pip install) | Heavy (~20GB models) |
Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.
License
Apache 2.0
Contributing
Issues and PRs welcome. Please test changes with various PDF types before submitting.
Related
- MinerU PDF Parser Skill — Rich, layout-aware alternative
- PyMuPDF — The underlying PDF library
- Clawdbot — The AI agent framework
Permissions & Security
Security level L1: Low-risk skills with minimal permissions. Review inputs and outputs before running in production.
Requirements
- OpenClaw CLI installed and configured.
- Language: Markdown
- License: MIT
- Topics:
FAQ
How do I install pymupdf-pdf?
Run openclaw add @kesslerio/pymupdf-pdf-parser-clawdbot-skill in your terminal. This installs pymupdf-pdf into your OpenClaw Skills catalog.
Does this skill run locally or in the cloud?
OpenClaw Skills execute locally by default. Review the SKILL.md and permissions before running any skill.
Where can I verify the source code?
The source repository is available at https://github.com/openclaw/skills/tree/main/skills/kesslerio/pymupdf-pdf-parser-clawdbot-skill. Review commits and README documentation before installing.
