Initial commit
This commit is contained in:
284
skills/pdf/briefs/process-advanced.md
Executable file
284
skills/pdf/briefs/process-advanced.md
Executable file
@@ -0,0 +1,284 @@
|
||||
# Process Brief: Advanced Reference
|
||||
|
||||
Edge-case tools and techniques. **Load only when the main `process.md` doesn't cover the task.**
|
||||
|
||||
```
|
||||
Edge case
|
||||
├─ No text extracted from PDF → §OCR Fallback
|
||||
├─ PDF is encrypted/locked → §Encrypted PDFs
|
||||
├─ PDF is corrupted/damaged → §Corrupted PDFs
|
||||
├─ Need precise text coordinates → §pdfplumber Advanced
|
||||
├─ Need fast rendering → §pypdfium2
|
||||
├─ Advanced page extraction → §poppler-utils Advanced
|
||||
├─ Complex page ranges / merge → §qpdf Page Manipulation
|
||||
├─ Optimize file size → §qpdf Optimization
|
||||
├─ Batch process many PDFs → §Batch Processing
|
||||
└─ Memory issues with large PDF → §Performance Optimization
|
||||
```
|
||||
|
||||
For basic operations (extract, merge, split, fill forms, convert), go back to `process.md`.
|
||||
|
||||
---
|
||||
|
||||
## §pypdfium2 (Apache/BSD License)
|
||||
|
||||
A Python binding for PDFium (Chromium's PDF library). Excellent for fast rendering and text extraction — serves as a PyMuPDF replacement.
|
||||
|
||||
#### Render PDF to Images
|
||||
```python
|
||||
import pypdfium2 as pdfium
|
||||
|
||||
pdf = pdfium.PdfDocument("document.pdf")
|
||||
|
||||
# Render single page
|
||||
page = pdf[0]
|
||||
bitmap = page.render(scale=2.0, rotation=0)
|
||||
img = bitmap.to_pil()
|
||||
img.save("page_1.png", "PNG")
|
||||
|
||||
# Batch render all pages
|
||||
for i, page in enumerate(pdf):
|
||||
bitmap = page.render(scale=1.5)
|
||||
img = bitmap.to_pil()
|
||||
img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
|
||||
```
|
||||
|
||||
#### Extract Text with pypdfium2
|
||||
```python
|
||||
import pypdfium2 as pdfium
|
||||
|
||||
pdf = pdfium.PdfDocument("document.pdf")
|
||||
for i, page in enumerate(pdf):
|
||||
text = page.get_text()
|
||||
print(f"Page {i+1}: {len(text)} chars")
|
||||
```
|
||||
|
||||
## §pdfplumber Advanced Features
|
||||
|
||||
#### Extract Text with Precise Coordinates
|
||||
```python
|
||||
import pdfplumber
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
page = pdf.pages[0]
|
||||
|
||||
# All characters with coordinates
|
||||
for char in page.chars[:10]:
|
||||
print(f"'{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
|
||||
|
||||
# Extract text within a specific bounding box (left, top, right, bottom)
|
||||
bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
|
||||
```
|
||||
|
||||
#### Advanced Table Extraction with Custom Settings
|
||||
```python
|
||||
import pdfplumber
|
||||
|
||||
with pdfplumber.open("complex_table.pdf") as pdf:
|
||||
page = pdf.pages[0]
|
||||
|
||||
table_settings = {
|
||||
"vertical_strategy": "lines",
|
||||
"horizontal_strategy": "lines",
|
||||
"snap_tolerance": 3,
|
||||
"intersection_tolerance": 15
|
||||
}
|
||||
tables = page.extract_tables(table_settings)
|
||||
|
||||
# Visual debugging
|
||||
img = page.to_image(resolution=150)
|
||||
img.save("debug_layout.png")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §poppler-utils Advanced Features
|
||||
|
||||
### Extract Text with Bounding Box Coordinates
|
||||
```bash
|
||||
pdftotext -bbox-layout document.pdf output.xml
|
||||
```
|
||||
|
||||
### Advanced Image Conversion
|
||||
```bash
|
||||
# High-resolution PNG
|
||||
pdftoppm -png -r 300 document.pdf output_prefix
|
||||
|
||||
# Specific page range
|
||||
pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
|
||||
|
||||
# JPEG with quality setting
|
||||
pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
|
||||
```
|
||||
|
||||
### Extract Embedded Images
|
||||
```bash
|
||||
pdfimages -all document.pdf images/img
|
||||
pdfimages -list document.pdf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §qpdf Advanced Features
|
||||
|
||||
### Complex Page Manipulation
|
||||
```bash
|
||||
# Split into groups of N pages
|
||||
qpdf --split-pages=3 input.pdf output_group_%02d.pdf
|
||||
|
||||
# Extract complex page ranges
|
||||
qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
|
||||
|
||||
# Merge specific pages from multiple PDFs
|
||||
qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
|
||||
```
|
||||
|
||||
### PDF Optimization and Repair
|
||||
```bash
|
||||
qpdf --linearize input.pdf optimized.pdf
|
||||
qpdf --optimize-level=all input.pdf compressed.pdf
|
||||
qpdf --check input.pdf
|
||||
qpdf --fix-qdf damaged.pdf repaired.pdf
|
||||
```
|
||||
|
||||
### Encryption and Decryption
|
||||
```bash
|
||||
qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
|
||||
qpdf --show-encryption encrypted.pdf
|
||||
qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §Encrypted PDFs
|
||||
|
||||
```python
|
||||
from pypdf import PdfReader
|
||||
|
||||
try:
|
||||
reader = PdfReader("encrypted.pdf")
|
||||
if reader.is_encrypted:
|
||||
reader.decrypt("password")
|
||||
except Exception as e:
|
||||
print(f"Failed to decrypt: {e}")
|
||||
```
|
||||
|
||||
Or via qpdf:
|
||||
```bash
|
||||
qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §Corrupted PDFs
|
||||
|
||||
```bash
|
||||
qpdf --check corrupted.pdf
|
||||
qpdf --replace-input corrupted.pdf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §OCR Fallback for Scanned PDFs
|
||||
|
||||
When `pdfplumber` extracts 0 characters, the PDF is likely a scanned image:
|
||||
|
||||
```python
|
||||
import pytesseract
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
def extract_text_with_ocr(pdf_path):
|
||||
images = convert_from_path(pdf_path)
|
||||
text = ""
|
||||
for i, image in enumerate(images):
|
||||
text += pytesseract.image_to_string(image)
|
||||
return text
|
||||
```
|
||||
|
||||
Prerequisites: `pip install pdf2image pytesseract` + install Tesseract OCR and poppler.
|
||||
|
||||
---
|
||||
|
||||
## §Batch Processing with Error Handling
|
||||
|
||||
```python
|
||||
import os, glob
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def batch_process_pdfs(input_dir, operation='merge'):
|
||||
pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
|
||||
|
||||
if operation == 'merge':
|
||||
writer = PdfWriter()
|
||||
for pdf_file in pdf_files:
|
||||
try:
|
||||
reader = PdfReader(pdf_file)
|
||||
for page in reader.pages:
|
||||
writer.add_page(page)
|
||||
logger.info(f"Processed: {pdf_file}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {pdf_file}: {e}")
|
||||
continue
|
||||
with open("batch_merged.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
|
||||
elif operation == 'extract_text':
|
||||
for pdf_file in pdf_files:
|
||||
try:
|
||||
reader = PdfReader(pdf_file)
|
||||
text = "".join(page.extract_text() for page in reader.pages)
|
||||
output_file = pdf_file.replace('.pdf', '.txt')
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
logger.info(f"Extracted: {pdf_file}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed: {pdf_file}: {e}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## §Performance Optimization
|
||||
|
||||
### Text Extraction
|
||||
- `pdftotext -bbox-layout` is fastest for plain text
|
||||
- Use pdfplumber for structured data and tables
|
||||
- Avoid `pypdf.extract_text()` for very large documents
|
||||
|
||||
### Image Extraction
|
||||
- `pdfimages` is much faster than rendering entire pages
|
||||
- Use low resolution for previews, high resolution for final output
|
||||
|
||||
### Memory Management for Large PDFs
|
||||
```python
|
||||
def process_large_pdf(pdf_path, chunk_size=10):
|
||||
reader = PdfReader(pdf_path)
|
||||
total_pages = len(reader.pages)
|
||||
|
||||
for start_idx in range(0, total_pages, chunk_size):
|
||||
end_idx = min(start_idx + chunk_size, total_pages)
|
||||
writer = PdfWriter()
|
||||
for i in range(start_idx, end_idx):
|
||||
writer.add_page(reader.pages[i])
|
||||
with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Extended Tooling Inventory
|
||||
|
||||
| Library / Tool | Role | Licence |
|
||||
|----------------|------|---------|
|
||||
| pikepdf | Low-level PDF manipulation (forms, pages, metadata) | MPL-2.0 |
|
||||
| pdfplumber | Content extraction (text, tables) | MIT |
|
||||
| pypdfium2 | Fast rendering, text extraction (PyMuPDF alternative) | Apache/BSD |
|
||||
| pypdf | Merge, split, crop, metadata, encryption | BSD |
|
||||
| poppler-utils | CLI text/image extraction, rendering | GPL-2 |
|
||||
| qpdf | Page manipulation, optimization, encryption, repair | Apache |
|
||||
| pytesseract | OCR for scanned PDFs | Apache |
|
||||
| pdf2image | PDF-to-image conversion via poppler | MIT |
|
||||
| LibreOffice | Office format conversion engine | MPL-2.0 |
|
||||
Reference in New Issue
Block a user