PDF Splitter
Split-Process-Merge pipeline for converting large PDFs to Docling documents. This pipeline is useful for projects like soliplex that might need to break large PDF files apart before processing.
Installation
With pip:
With uv:
From source:
git clone https://github.com/soliplex/pdf-splitter.git
cd pdf-splitter
# pip
pip install -e . # core only
pip install -e ".[dev]" # with dev tools
# uv
uv sync # core only
uv sync --group dev # with dev tools
Requires Python 3.12+.
Usage
pdf-splitter analyze doc.pdf # analyze structure
pdf-splitter chunk doc.pdf -o ./chunks # split into chunks
pdf-splitter convert ./chunks -o out.json # process & merge
pdf-splitter validate out.json ./chunks # validate output
Options
| Option | Description |
|---|---|
-v |
Verbose logging |
-s <strategy> |
Force: fixed, hybrid, enhanced |
--max-pages N |
Max pages per chunk (default: 100) |
-w N |
Worker processes |
--keep-parts |
Output individual chunks |
Python API
from pdf_splitter.segmentation_enhanced import smart_split_to_files
from pdf_splitter.processor import BatchProcessor
from pdf_splitter.reassembly import merge_from_results
chunks, _ = smart_split_to_files("doc.pdf", output_dir="./chunks")
results = BatchProcessor(max_workers=4).execute_parallel(chunks)
merged = merge_from_results(results)
merged.export_to_json("output.json")