Initial commit

2026-06-06 05:21:10 +00:00
commit 6664758a6d
493 changed files with 135653 additions and 0 deletions
--- a/skills/pdf/briefs/process-advanced.md
+++ b/skills/pdf/briefs/process-advanced.md
@@ -0,0 +1,284 @@
+# Process Brief: Advanced Reference
+
+Edge-case tools and techniques. **Load only when the main `process.md` doesn't cover the task.**
+
+```
+Edge case
+  ├─ No text extracted from PDF    → §OCR Fallback
+  ├─ PDF is encrypted/locked       → §Encrypted PDFs
+  ├─ PDF is corrupted/damaged      → §Corrupted PDFs
+  ├─ Need precise text coordinates → §pdfplumber Advanced
+  ├─ Need fast rendering           → §pypdfium2
+  ├─ Advanced page extraction      → §poppler-utils Advanced
+  ├─ Complex page ranges / merge   → §qpdf Page Manipulation
+  ├─ Optimize file size            → §qpdf Optimization
+  ├─ Batch process many PDFs       → §Batch Processing
+  └─ Memory issues with large PDF  → §Performance Optimization
+```
+
+For basic operations (extract, merge, split, fill forms, convert), go back to `process.md`.
+
+---
+
+## §pypdfium2 (Apache/BSD License)
+
+A Python binding for PDFium (Chromium's PDF library). Excellent for fast rendering and text extraction — serves as a PyMuPDF replacement.
+
+#### Render PDF to Images
+```python
+import pypdfium2 as pdfium
+
+pdf = pdfium.PdfDocument("document.pdf")
+
+# Render single page
+page = pdf[0]
+bitmap = page.render(scale=2.0, rotation=0)
+img = bitmap.to_pil()
+img.save("page_1.png", "PNG")
+
+# Batch render all pages
+for i, page in enumerate(pdf):
+    bitmap = page.render(scale=1.5)
+    img = bitmap.to_pil()
+    img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
+```
+
+#### Extract Text with pypdfium2
+```python
+import pypdfium2 as pdfium
+
+pdf = pdfium.PdfDocument("document.pdf")
+for i, page in enumerate(pdf):
+    text = page.get_text()
+    print(f"Page {i+1}: {len(text)} chars")
+```
+
+## §pdfplumber Advanced Features
+
+#### Extract Text with Precise Coordinates
+```python
+import pdfplumber
+
+with pdfplumber.open("document.pdf") as pdf:
+    page = pdf.pages[0]
+
+    # All characters with coordinates
+    for char in page.chars[:10]:
+        print(f"'{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
+
+    # Extract text within a specific bounding box (left, top, right, bottom)
+    bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
+```
+
+#### Advanced Table Extraction with Custom Settings
+```python
+import pdfplumber
+
+with pdfplumber.open("complex_table.pdf") as pdf:
+    page = pdf.pages[0]
+
+    table_settings = {
+        "vertical_strategy": "lines",
+        "horizontal_strategy": "lines",
+        "snap_tolerance": 3,
+        "intersection_tolerance": 15
+    }
+    tables = page.extract_tables(table_settings)
+
+    # Visual debugging
+    img = page.to_image(resolution=150)
+    img.save("debug_layout.png")
+```
+
+---
+
+## §poppler-utils Advanced Features
+
+### Extract Text with Bounding Box Coordinates
+```bash
+pdftotext -bbox-layout document.pdf output.xml
+```
+
+### Advanced Image Conversion
+```bash
+# High-resolution PNG
+pdftoppm -png -r 300 document.pdf output_prefix
+
+# Specific page range
+pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
+
+# JPEG with quality setting
+pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
+```
+
+### Extract Embedded Images
+```bash
+pdfimages -all document.pdf images/img
+pdfimages -list document.pdf
+```
+
+---
+
+## §qpdf Advanced Features
+
+### Complex Page Manipulation
+```bash
+# Split into groups of N pages
+qpdf --split-pages=3 input.pdf output_group_%02d.pdf
+
+# Extract complex page ranges
+qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
+
+# Merge specific pages from multiple PDFs
+qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
+```
+
+### PDF Optimization and Repair
+```bash
+qpdf --linearize input.pdf optimized.pdf
+qpdf --optimize-level=all input.pdf compressed.pdf
+qpdf --check input.pdf
+qpdf --fix-qdf damaged.pdf repaired.pdf
+```
+
+### Encryption and Decryption
+```bash
+qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
+qpdf --show-encryption encrypted.pdf
+qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
+```
+
+---
+
+## §Encrypted PDFs
+
+```python
+from pypdf import PdfReader
+
+try:
+    reader = PdfReader("encrypted.pdf")
+    if reader.is_encrypted:
+        reader.decrypt("password")
+except Exception as e:
+    print(f"Failed to decrypt: {e}")
+```
+
+Or via qpdf:
+```bash
+qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
+```
+
+---
+
+## §Corrupted PDFs
+
+```bash
+qpdf --check corrupted.pdf
+qpdf --replace-input corrupted.pdf
+```
+
+---
+
+## §OCR Fallback for Scanned PDFs
+
+When `pdfplumber` extracts 0 characters, the PDF is likely a scanned image:
+
+```python
+import pytesseract
+from pdf2image import convert_from_path
+
+def extract_text_with_ocr(pdf_path):
+    images = convert_from_path(pdf_path)
+    text = ""
+    for i, image in enumerate(images):
+        text += pytesseract.image_to_string(image)
+    return text
+```
+
+Prerequisites: `pip install pdf2image pytesseract` + install Tesseract OCR and poppler.
+
+---
+
+## §Batch Processing with Error Handling
+
+```python
+import os, glob
+from pypdf import PdfReader, PdfWriter
+import logging
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+def batch_process_pdfs(input_dir, operation='merge'):
+    pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
+
+    if operation == 'merge':
+        writer = PdfWriter()
+        for pdf_file in pdf_files:
+            try:
+                reader = PdfReader(pdf_file)
+                for page in reader.pages:
+                    writer.add_page(page)
+                logger.info(f"Processed: {pdf_file}")
+            except Exception as e:
+                logger.error(f"Failed: {pdf_file}: {e}")
+                continue
+        with open("batch_merged.pdf", "wb") as output:
+            writer.write(output)
+
+    elif operation == 'extract_text':
+        for pdf_file in pdf_files:
+            try:
+                reader = PdfReader(pdf_file)
+                text = "".join(page.extract_text() for page in reader.pages)
+                output_file = pdf_file.replace('.pdf', '.txt')
+                with open(output_file, 'w', encoding='utf-8') as f:
+                    f.write(text)
+                logger.info(f"Extracted: {pdf_file}")
+            except Exception as e:
+                logger.error(f"Failed: {pdf_file}: {e}")
+```
+
+---
+
+## §Performance Optimization
+
+### Text Extraction
+- `pdftotext -bbox-layout` is fastest for plain text
+- Use pdfplumber for structured data and tables
+- Avoid `pypdf.extract_text()` for very large documents
+
+### Image Extraction
+- `pdfimages` is much faster than rendering entire pages
+- Use low resolution for previews, high resolution for final output
+
+### Memory Management for Large PDFs
+```python
+def process_large_pdf(pdf_path, chunk_size=10):
+    reader = PdfReader(pdf_path)
+    total_pages = len(reader.pages)
+
+    for start_idx in range(0, total_pages, chunk_size):
+        end_idx = min(start_idx + chunk_size, total_pages)
+        writer = PdfWriter()
+        for i in range(start_idx, end_idx):
+            writer.add_page(reader.pages[i])
+        with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
+            writer.write(output)
+```
+
+---
+
+## Extended Tooling Inventory
+
+| Library / Tool | Role | Licence |
+|----------------|------|---------|
+| pikepdf | Low-level PDF manipulation (forms, pages, metadata) | MPL-2.0 |
+| pdfplumber | Content extraction (text, tables) | MIT |
+| pypdfium2 | Fast rendering, text extraction (PyMuPDF alternative) | Apache/BSD |
+| pypdf | Merge, split, crop, metadata, encryption | BSD |
+| poppler-utils | CLI text/image extraction, rendering | GPL-2 |
+| qpdf | Page manipulation, optimization, encryption, repair | Apache |
+| pytesseract | OCR for scanned PDFs | Apache |
+| pdf2image | PDF-to-image conversion via poppler | MIT |
+| LibreOffice | Office format conversion engine | MPL-2.0 |