feat: Add complete Agentic Compaction & Pipeline System
- Context Compaction System with token counting and summarization - Deterministic State Machine for flow control (no LLM decisions) - Parallel Execution Engine (up to 12 concurrent sessions) - Event-Driven Coordination via Event Bus - Agent Workspace Isolation (tools, memory, identity, files) - YAML Workflow Integration (OpenClaw/Lobster compatible) - Claude Code integration layer - Complete demo UI with real-time visualization - Comprehensive documentation and README Components: - agent-system/: Context management, token counting, subagent spawning - pipeline-system/: State machine, parallel executor, event bus, workflows - skills/: AI capabilities (LLM, ASR, TTS, VLM, image generation, etc.) - src/app/: Next.js demo application Total: ~100KB of production-ready TypeScript code
This commit is contained in:
30
skills/pdf/LICENSE.txt
Executable file
30
skills/pdf/LICENSE.txt
Executable file
@@ -0,0 +1,30 @@
|
||||
© 2025 Anthropic, PBC. All rights reserved.
|
||||
|
||||
LICENSE: Use of these materials (including all code, prompts, assets, files,
|
||||
and other components of this Skill) is governed by your agreement with
|
||||
Anthropic regarding use of Anthropic's services. If no separate agreement
|
||||
exists, use is governed by Anthropic's Consumer Terms of Service or
|
||||
Commercial Terms of Service, as applicable:
|
||||
https://www.anthropic.com/legal/consumer-terms
|
||||
https://www.anthropic.com/legal/commercial-terms
|
||||
Your applicable agreement is referred to as the "Agreement." "Services" are
|
||||
as defined in the Agreement.
|
||||
|
||||
ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the
|
||||
contrary, users may not:
|
||||
|
||||
- Extract these materials from the Services or retain copies of these
|
||||
materials outside the Services
|
||||
- Reproduce or copy these materials, except for temporary copies created
|
||||
automatically during authorized use of the Services
|
||||
- Create derivative works based on these materials
|
||||
- Distribute, sublicense, or transfer these materials to any third party
|
||||
- Make, offer to sell, sell, or import any inventions embodied in these
|
||||
materials
|
||||
- Reverse engineer, decompile, or disassemble these materials
|
||||
|
||||
The receipt, viewing, or possession of these materials does not convey or
|
||||
imply any license or right beyond those expressly granted above.
|
||||
|
||||
Anthropic retains all right, title, and interest in these materials,
|
||||
including all copyrights, patents, and other intellectual property rights.
|
||||
1534
skills/pdf/SKILL.md
Executable file
1534
skills/pdf/SKILL.md
Executable file
File diff suppressed because it is too large
Load Diff
205
skills/pdf/forms.md
Executable file
205
skills/pdf/forms.md
Executable file
@@ -0,0 +1,205 @@
|
||||
**CRITICAL: You MUST complete these steps in order. Do not skip ahead to writing code.**
|
||||
|
||||
If you need to fill out a PDF form, first check to see if the PDF has fillable form fields. Run this script from this file's directory:
|
||||
`python scripts/check_fillable_fields <file.pdf>`, and depending on the result go to either the "Fillable fields" or "Non-fillable fields" and follow those instructions.
|
||||
|
||||
# Fillable fields
|
||||
If the PDF has fillable form fields:
|
||||
- Run this script from this file's directory: `python scripts/extract_form_field_info.py <input.pdf> <field_info.json>`. It will create a JSON file with a list of fields in this format:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"rect": ([left, bottom, right, top] bounding box in PDF coordinates, y=0 is the bottom of the page),
|
||||
"type": ("text", "checkbox", "radio_group", or "choice"),
|
||||
},
|
||||
// Checkboxes have "checked_value" and "unchecked_value" properties:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "checkbox",
|
||||
"checked_value": (Set the field to this value to check the checkbox),
|
||||
"unchecked_value": (Set the field to this value to uncheck the checkbox),
|
||||
},
|
||||
// Radio groups have a "radio_options" list with the possible choices.
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "radio_group",
|
||||
"radio_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this radio option),
|
||||
"rect": (bounding box for the radio button for this option)
|
||||
},
|
||||
// Other radio options
|
||||
]
|
||||
},
|
||||
// Multiple choice fields have a "choice_options" list with the possible choices:
|
||||
{
|
||||
"field_id": (unique ID for the field),
|
||||
"page": (page number, 1-based),
|
||||
"type": "choice",
|
||||
"choice_options": [
|
||||
{
|
||||
"value": (set the field to this value to select this option),
|
||||
"text": (display text of the option)
|
||||
},
|
||||
// Other choice options
|
||||
],
|
||||
}
|
||||
]
|
||||
```
|
||||
- Convert the PDF to PNGs (one image for each page) with this script (run from this file's directory):
|
||||
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
|
||||
Then analyze the images to determine the purpose of each form field (make sure to convert the bounding box PDF coordinates to image coordinates).
|
||||
- Create a `field_values.json` file in this format with the values to be entered for each field:
|
||||
```
|
||||
[
|
||||
{
|
||||
"field_id": "last_name", // Must match the field_id from `extract_form_field_info.py`
|
||||
"description": "The user's last name",
|
||||
"page": 1, // Must match the "page" value in field_info.json
|
||||
"value": "Simpson"
|
||||
},
|
||||
{
|
||||
"field_id": "Checkbox12",
|
||||
"description": "Checkbox to be checked if the user is 18 or over",
|
||||
"page": 1,
|
||||
"value": "/On" // If this is a checkbox, use its "checked_value" value to check it. If it's a radio button group, use one of the "value" values in "radio_options".
|
||||
},
|
||||
// more fields
|
||||
]
|
||||
```
|
||||
- Run the `fill_fillable_fields.py` script from this file's directory to create a filled-in PDF:
|
||||
`python scripts/fill_fillable_fields.py <input pdf> <field_values.json> <output pdf>`
|
||||
This script will verify that the field IDs and values you provide are valid; if it prints error messages, correct the appropriate fields and try again.
|
||||
|
||||
# Non-fillable fields
|
||||
If the PDF doesn't have fillable form fields, you'll need to visually determine where the data should be added and create text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.
|
||||
- Convert the PDF to PNG images and determine field bounding boxes.
|
||||
- Create a JSON file with field information and validation images showing the bounding boxes.
|
||||
- Validate the the bounding boxes.
|
||||
- Use the bounding boxes to fill in the form.
|
||||
|
||||
## Step 1: Visual Analysis (REQUIRED)
|
||||
- Convert the PDF to PNG images. Run this script from this file's directory:
|
||||
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
|
||||
The script will create a PNG image for each page in the PDF.
|
||||
- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter text, determine bounding boxes for both the form field label, and the area where the user should enter text. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.
|
||||
|
||||
These are some examples of form structures that you might see:
|
||||
|
||||
*Label inside box*
|
||||
```
|
||||
┌────────────────────────┐
|
||||
│ Name: │
|
||||
└────────────────────────┘
|
||||
```
|
||||
The input area should be to the right of the "Name" label and extend to the edge of the box.
|
||||
|
||||
*Label before line*
|
||||
```
|
||||
Email: _______________________
|
||||
```
|
||||
The input area should be above the line and include its entire width.
|
||||
|
||||
*Label under line*
|
||||
```
|
||||
_________________________
|
||||
Name
|
||||
```
|
||||
The input area should be above the line and include the entire width of the line. This is common for signature and date fields.
|
||||
|
||||
*Label above line*
|
||||
```
|
||||
Please enter any special requests:
|
||||
________________________________________________
|
||||
```
|
||||
The input area should extend from the bottom of the label to the line, and should include the entire width of the line.
|
||||
|
||||
*Checkboxes*
|
||||
```
|
||||
Are you a US citizen? Yes □ No □
|
||||
```
|
||||
For checkboxes:
|
||||
- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
|
||||
- Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
|
||||
- The entry bounding box should cover ONLY the small square, not the text label.
|
||||
|
||||
### Step 2: Create fields.json and validation images (REQUIRED)
|
||||
- Create a file named `fields.json` with information for the form fields and bounding boxes in this format:
|
||||
```
|
||||
{
|
||||
"pages": [
|
||||
{
|
||||
"page_number": 1,
|
||||
"image_width": (first page image width in pixels),
|
||||
"image_height": (first page image height in pixels),
|
||||
},
|
||||
{
|
||||
"page_number": 2,
|
||||
"image_width": (second page image width in pixels),
|
||||
"image_height": (second page image height in pixels),
|
||||
}
|
||||
// additional pages
|
||||
],
|
||||
"form_fields": [
|
||||
// Example for a text field.
|
||||
{
|
||||
"page_number": 1,
|
||||
"description": "The user's last name should be entered here",
|
||||
// Bounding boxes are [left, top, right, bottom]. The bounding boxes for the label and text entry should not overlap.
|
||||
"field_label": "Last name",
|
||||
"label_bounding_box": [30, 125, 95, 142],
|
||||
"entry_bounding_box": [100, 125, 280, 142],
|
||||
"entry_text": {
|
||||
"text": "Johnson", // This text will be added as an annotation at the entry_bounding_box location
|
||||
"font_size": 14, // optional, defaults to 14
|
||||
"font_color": "000000", // optional, RRGGBB format, defaults to 000000 (black)
|
||||
}
|
||||
},
|
||||
// Example for a checkbox. TARGET THE SQUARE for the entry bounding box, NOT THE TEXT
|
||||
{
|
||||
"page_number": 2,
|
||||
"description": "Checkbox that should be checked if the user is over 18",
|
||||
"entry_bounding_box": [140, 525, 155, 540], // Small box over checkbox square
|
||||
"field_label": "Yes",
|
||||
"label_bounding_box": [100, 525, 132, 540], // Box containing "Yes" text
|
||||
// Use "X" to check a checkbox.
|
||||
"entry_text": {
|
||||
"text": "X",
|
||||
}
|
||||
}
|
||||
// additional form field entries
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Create validation images by running this script from this file's directory for each page:
|
||||
`python scripts/create_validation_image.py <page_number> <path_to_fields.json> <input_image_path> <output_image_path>
|
||||
|
||||
The validation images will have red rectangles where text should be entered, and blue rectangles covering label text.
|
||||
|
||||
### Step 3: Validate Bounding Boxes (REQUIRED)
|
||||
#### Automated intersection check
|
||||
- Verify that none of bounding boxes intersect and that the entry bounding boxes are tall enough by checking the fields.json file with the `check_bounding_boxes.py` script (run from this file's directory):
|
||||
`python scripts/check_bounding_boxes.py <JSON file>`
|
||||
|
||||
If there are errors, reanalyze the relevant fields, adjust the bounding boxes, and iterate until there are no remaining errors. Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.
|
||||
|
||||
#### Manual image inspection
|
||||
**CRITICAL: Do not proceed without visually inspecting validation images**
|
||||
- Red rectangles must ONLY cover input areas
|
||||
- Red rectangles MUST NOT contain any text
|
||||
- Blue rectangles should contain label text
|
||||
- For checkboxes:
|
||||
- Red rectangle MUST be centered on the checkbox square
|
||||
- Blue rectangle should cover the text label for the checkbox
|
||||
|
||||
- If any rectangles look wrong, fix fields.json, regenerate the validation images, and verify again. Repeat this process until the bounding boxes are fully accurate.
|
||||
|
||||
|
||||
### Step 4: Add annotations to the PDF
|
||||
Run this script from this file's directory to create a filled-out PDF using the information in fields.json:
|
||||
`python scripts/fill_pdf_form_with_annotations.py <input_pdf_path> <path_to_fields.json> <output_pdf_path>
|
||||
765
skills/pdf/reference.md
Executable file
765
skills/pdf/reference.md
Executable file
@@ -0,0 +1,765 @@
|
||||
# PDF Processing Advanced Reference
|
||||
|
||||
This document contains advanced PDF processing features, detailed examples, and additional libraries not covered in the main skill instructions.
|
||||
|
||||
## pypdfium2 Library (Apache/BSD License)
|
||||
|
||||
### Overview
|
||||
pypdfium2 is a Python binding for PDFium (Chromium's PDF library). It's excellent for fast PDF rendering, image generation, and serves as a PyMuPDF replacement.
|
||||
|
||||
### Render PDF to Images
|
||||
```python
|
||||
import pypdfium2 as pdfium
|
||||
from PIL import Image
|
||||
|
||||
# Load PDF
|
||||
pdf = pdfium.PdfDocument("document.pdf")
|
||||
|
||||
# Render page to image
|
||||
page = pdf[0] # First page
|
||||
bitmap = page.render(
|
||||
scale=2.0, # Higher resolution
|
||||
rotation=0 # No rotation
|
||||
)
|
||||
|
||||
# Convert to PIL Image
|
||||
img = bitmap.to_pil()
|
||||
img.save("page_1.png", "PNG")
|
||||
|
||||
# Process multiple pages
|
||||
for i, page in enumerate(pdf):
|
||||
bitmap = page.render(scale=1.5)
|
||||
img = bitmap.to_pil()
|
||||
img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
|
||||
```
|
||||
|
||||
### Extract Text with pypdfium2
|
||||
```python
|
||||
import pypdfium2 as pdfium
|
||||
|
||||
pdf = pdfium.PdfDocument("document.pdf")
|
||||
for i, page in enumerate(pdf):
|
||||
text = page.get_text()
|
||||
print(f"Page {i+1} text length: {len(text)} chars")
|
||||
```
|
||||
|
||||
## JavaScript Libraries
|
||||
|
||||
### pdf-lib (MIT License)
|
||||
|
||||
pdf-lib is a powerful JavaScript library for creating and modifying PDF documents in any JavaScript environment.
|
||||
|
||||
#### Load and Manipulate Existing PDF
|
||||
```javascript
|
||||
import { PDFDocument } from 'pdf-lib';
|
||||
import fs from 'fs';
|
||||
|
||||
async function manipulatePDF() {
|
||||
// Load existing PDF
|
||||
const existingPdfBytes = fs.readFileSync('input.pdf');
|
||||
const pdfDoc = await PDFDocument.load(existingPdfBytes);
|
||||
|
||||
// Get page count
|
||||
const pageCount = pdfDoc.getPageCount();
|
||||
console.log(`Document has ${pageCount} pages`);
|
||||
|
||||
// Add new page
|
||||
const newPage = pdfDoc.addPage([600, 400]);
|
||||
newPage.drawText('Added by pdf-lib', {
|
||||
x: 100,
|
||||
y: 300,
|
||||
size: 16
|
||||
});
|
||||
|
||||
// Save modified PDF
|
||||
const pdfBytes = await pdfDoc.save();
|
||||
fs.writeFileSync('modified.pdf', pdfBytes);
|
||||
}
|
||||
```
|
||||
|
||||
#### Create Complex PDFs from Scratch
|
||||
|
||||
**Note**: This JavaScript example uses pdf-lib's built-in StandardFonts. For Python/reportlab, always use the six registered fonts defined in SKILL.md (SimHei, Microsoft YaHei, SarasaMonoSC, Times New Roman, Calibri, DejaVuSans).
|
||||
|
||||
```javascript
|
||||
import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';
|
||||
import fs from 'fs';
|
||||
|
||||
async function createPDF() {
|
||||
const pdfDoc = await PDFDocument.create();
|
||||
|
||||
// Add fonts
|
||||
const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);
|
||||
const helveticaBold = await pdfDoc.embedFont(StandardFonts.HelveticaBold);
|
||||
|
||||
// Add page
|
||||
const page = pdfDoc.addPage([595, 842]); // A4 size
|
||||
const { width, height } = page.getSize();
|
||||
|
||||
// Add text with styling
|
||||
page.drawText('Invoice #12345', {
|
||||
x: 50,
|
||||
y: height - 50,
|
||||
size: 18,
|
||||
font: helveticaBold,
|
||||
color: rgb(0.2, 0.2, 0.8)
|
||||
});
|
||||
|
||||
// Add rectangle (header background)
|
||||
page.drawRectangle({
|
||||
x: 40,
|
||||
y: height - 100,
|
||||
width: width - 80,
|
||||
height: 30,
|
||||
color: rgb(0.9, 0.9, 0.9)
|
||||
});
|
||||
|
||||
// Add table-like content
|
||||
const items = [
|
||||
['Item', 'Qty', 'Price', 'Total'],
|
||||
['Widget', '2', '$50', '$100'],
|
||||
['Gadget', '1', '$75', '$75']
|
||||
];
|
||||
|
||||
let yPos = height - 150;
|
||||
items.forEach(row => {
|
||||
let xPos = 50;
|
||||
row.forEach(cell => {
|
||||
page.drawText(cell, {
|
||||
x: xPos,
|
||||
y: yPos,
|
||||
size: 12,
|
||||
font: helveticaFont
|
||||
});
|
||||
xPos += 120;
|
||||
});
|
||||
yPos -= 25;
|
||||
});
|
||||
|
||||
const pdfBytes = await pdfDoc.save();
|
||||
fs.writeFileSync('created.pdf', pdfBytes);
|
||||
}
|
||||
```
|
||||
|
||||
#### Advanced Merge and Split Operations
|
||||
```javascript
|
||||
import { PDFDocument } from 'pdf-lib';
|
||||
import fs from 'fs';
|
||||
|
||||
async function mergePDFs() {
|
||||
// Create new document
|
||||
const mergedPdf = await PDFDocument.create();
|
||||
|
||||
// Load source PDFs
|
||||
const pdf1Bytes = fs.readFileSync('doc1.pdf');
|
||||
const pdf2Bytes = fs.readFileSync('doc2.pdf');
|
||||
|
||||
const pdf1 = await PDFDocument.load(pdf1Bytes);
|
||||
const pdf2 = await PDFDocument.load(pdf2Bytes);
|
||||
|
||||
// Copy pages from first PDF
|
||||
const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices());
|
||||
pdf1Pages.forEach(page => mergedPdf.addPage(page));
|
||||
|
||||
// Copy specific pages from second PDF (pages 0, 2, 4)
|
||||
const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]);
|
||||
pdf2Pages.forEach(page => mergedPdf.addPage(page));
|
||||
|
||||
const mergedPdfBytes = await mergedPdf.save();
|
||||
fs.writeFileSync('merged.pdf', mergedPdfBytes);
|
||||
}
|
||||
```
|
||||
|
||||
### pdfjs-dist (Apache License)
|
||||
|
||||
PDF.js is Mozilla's JavaScript library for rendering PDFs in the browser.
|
||||
|
||||
#### Basic PDF Loading and Rendering
|
||||
```javascript
|
||||
import * as pdfjsLib from 'pdfjs-dist';
|
||||
|
||||
// Configure worker (important for performance)
|
||||
pdfjsLib.GlobalWorkerOptions.workerSrc = './pdf.worker.js';
|
||||
|
||||
async function renderPDF() {
|
||||
// Load PDF
|
||||
const loadingTask = pdfjsLib.getDocument('document.pdf');
|
||||
const pdf = await loadingTask.promise;
|
||||
|
||||
console.log(`Loaded PDF with ${pdf.numPages} pages`);
|
||||
|
||||
// Get first page
|
||||
const page = await pdf.getPage(1);
|
||||
const viewport = page.getViewport({ scale: 1.5 });
|
||||
|
||||
// Render to canvas
|
||||
const canvas = document.createElement('canvas');
|
||||
const context = canvas.getContext('2d');
|
||||
canvas.height = viewport.height;
|
||||
canvas.width = viewport.width;
|
||||
|
||||
const renderContext = {
|
||||
canvasContext: context,
|
||||
viewport: viewport
|
||||
};
|
||||
|
||||
await page.render(renderContext).promise;
|
||||
document.body.appendChild(canvas);
|
||||
}
|
||||
```
|
||||
|
||||
#### Extract Text with Coordinates
|
||||
```javascript
|
||||
import * as pdfjsLib from 'pdfjs-dist';
|
||||
|
||||
async function extractText() {
|
||||
const loadingTask = pdfjsLib.getDocument('document.pdf');
|
||||
const pdf = await loadingTask.promise;
|
||||
|
||||
let fullText = '';
|
||||
|
||||
// Extract text from all pages
|
||||
for (let i = 1; i <= pdf.numPages; i++) {
|
||||
const page = await pdf.getPage(i);
|
||||
const textContent = await page.getTextContent();
|
||||
|
||||
const pageText = textContent.items
|
||||
.map(item => item.str)
|
||||
.join(' ');
|
||||
|
||||
fullText += `\n--- Page ${i} ---\n${pageText}`;
|
||||
|
||||
// Get text with coordinates for advanced processing
|
||||
const textWithCoords = textContent.items.map(item => ({
|
||||
text: item.str,
|
||||
x: item.transform[4],
|
||||
y: item.transform[5],
|
||||
width: item.width,
|
||||
height: item.height
|
||||
}));
|
||||
}
|
||||
|
||||
console.log(fullText);
|
||||
return fullText;
|
||||
}
|
||||
```
|
||||
|
||||
#### Extract Annotations and Forms
|
||||
```javascript
|
||||
import * as pdfjsLib from 'pdfjs-dist';
|
||||
|
||||
async function extractAnnotations() {
|
||||
const loadingTask = pdfjsLib.getDocument('annotated.pdf');
|
||||
const pdf = await loadingTask.promise;
|
||||
|
||||
for (let i = 1; i <= pdf.numPages; i++) {
|
||||
const page = await pdf.getPage(i);
|
||||
const annotations = await page.getAnnotations();
|
||||
|
||||
annotations.forEach(annotation => {
|
||||
console.log(`Annotation type: ${annotation.subtype}`);
|
||||
console.log(`Content: ${annotation.contents}`);
|
||||
console.log(`Coordinates: ${JSON.stringify(annotation.rect)}`);
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Advanced Command-Line Operations
|
||||
|
||||
### poppler-utils Advanced Features
|
||||
|
||||
#### Extract Text with Bounding Box Coordinates
|
||||
```bash
|
||||
# Extract text with bounding box coordinates (essential for structured data)
|
||||
pdftotext -bbox-layout document.pdf output.xml
|
||||
|
||||
# The XML output contains precise coordinates for each text element
|
||||
```
|
||||
|
||||
#### Advanced Image Conversion
|
||||
```bash
|
||||
# Convert to PNG images with specific resolution
|
||||
pdftoppm -png -r 300 document.pdf output_prefix
|
||||
|
||||
# Convert specific page range with high resolution
|
||||
pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
|
||||
|
||||
# Convert to JPEG with quality setting
|
||||
pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
|
||||
```
|
||||
|
||||
#### Extract Embedded Images
|
||||
```bash
|
||||
# Extract all embedded images with metadata
|
||||
pdfimages -j -p document.pdf page_images
|
||||
|
||||
# List image info without extracting
|
||||
pdfimages -list document.pdf
|
||||
|
||||
# Extract images in their original format
|
||||
pdfimages -all document.pdf images/img
|
||||
```
|
||||
|
||||
### qpdf Advanced Features
|
||||
|
||||
#### Complex Page Manipulation
|
||||
```bash
|
||||
# Split PDF into groups of pages
|
||||
qpdf --split-pages=3 input.pdf output_group_%02d.pdf
|
||||
|
||||
# Extract specific pages with complex ranges
|
||||
qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
|
||||
|
||||
# Merge specific pages from multiple PDFs
|
||||
qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
|
||||
```
|
||||
|
||||
#### PDF Optimization and Repair
|
||||
```bash
|
||||
# Optimize PDF for web (linearize for streaming)
|
||||
qpdf --linearize input.pdf optimized.pdf
|
||||
|
||||
# Remove unused objects and compress
|
||||
qpdf --optimize-level=all input.pdf compressed.pdf
|
||||
|
||||
# Attempt to repair corrupted PDF structure
|
||||
qpdf --check input.pdf
|
||||
qpdf --fix-qdf damaged.pdf repaired.pdf
|
||||
|
||||
# Show detailed PDF structure for debugging
|
||||
qpdf --show-all-pages input.pdf > structure.txt
|
||||
```
|
||||
|
||||
#### Advanced Encryption
|
||||
```bash
|
||||
# Add password protection with specific permissions
|
||||
qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
|
||||
|
||||
# Check encryption status
|
||||
qpdf --show-encryption encrypted.pdf
|
||||
|
||||
# Remove password protection (requires password)
|
||||
qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
|
||||
```
|
||||
|
||||
## Advanced Python Techniques
|
||||
|
||||
### pdfplumber Advanced Features
|
||||
|
||||
#### Extract Text with Precise Coordinates
|
||||
```python
|
||||
import pdfplumber
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
page = pdf.pages[0]
|
||||
|
||||
# Extract all text with coordinates
|
||||
chars = page.chars
|
||||
for char in chars[:10]: # First 10 characters
|
||||
print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
|
||||
|
||||
# Extract text by bounding box (left, top, right, bottom)
|
||||
bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
|
||||
```
|
||||
|
||||
#### Advanced Table Extraction with Custom Settings
|
||||
```python
|
||||
import pdfplumber
|
||||
import pandas as pd
|
||||
|
||||
with pdfplumber.open("complex_table.pdf") as pdf:
|
||||
page = pdf.pages[0]
|
||||
|
||||
# Extract tables with custom settings for complex layouts
|
||||
table_settings = {
|
||||
"vertical_strategy": "lines",
|
||||
"horizontal_strategy": "lines",
|
||||
"snap_tolerance": 3,
|
||||
"intersection_tolerance": 15
|
||||
}
|
||||
tables = page.extract_tables(table_settings)
|
||||
|
||||
# Visual debugging for table extraction
|
||||
img = page.to_image(resolution=150)
|
||||
img.save("debug_layout.png")
|
||||
```
|
||||
|
||||
### reportlab Advanced Features
|
||||
|
||||
#### Quick TOC Template (Copy-Paste Ready)
|
||||
|
||||
```python
|
||||
from reportlab.lib.pagesizes import A4
|
||||
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, PageBreak
|
||||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||||
from reportlab.lib import colors
|
||||
from reportlab.lib.units import inch
|
||||
from reportlab.pdfbase import pdfmetrics
|
||||
from reportlab.pdfbase.ttfonts import TTFont
|
||||
from reportlab.pdfbase.pdfmetrics import registerFontFamily
|
||||
|
||||
# Register fonts first
|
||||
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
|
||||
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
|
||||
|
||||
# Setup
|
||||
doc = SimpleDocTemplate("report.pdf", pagesize=A4,
|
||||
leftMargin=0.75*inch, rightMargin=0.75*inch)
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
# Configure heading style
|
||||
styles['Heading1'].fontName = 'Times New Roman'
|
||||
styles['Heading1'].textColor = colors.black # Titles must be black
|
||||
|
||||
story = []
|
||||
|
||||
# Calculate dimensions
|
||||
page_width = A4[0]
|
||||
available_width = page_width - 1.5*inch
|
||||
page_num_width = 50 # Fixed width for page numbers (enough for 3-4 digits)
|
||||
|
||||
# Calculate dots: fill space from title to page number
|
||||
dots_column_width = available_width - 200 - page_num_width # Reserve space for title + page
|
||||
optimal_dot_count = int(dots_column_width / 4.5) # ~4.5pt per dot at 7pt font
|
||||
|
||||
# Define styles
|
||||
toc_style = ParagraphStyle('TOCEntry', parent=styles['Normal'],
|
||||
fontName='Times New Roman', fontSize=11, leading=16)
|
||||
dots_style = ParagraphStyle('LeaderDots', parent=styles['Normal'],
|
||||
fontName='Times New Roman', fontSize=7, leading=16) # Smaller font for more dots
|
||||
|
||||
# Build TOC (use Paragraph with <b></b> for bold heading)
|
||||
toc_data = [
|
||||
[Paragraph('<b>Table of Contents</b>', styles['Heading1']), '', ''],
|
||||
['', '', ''],
|
||||
]
|
||||
|
||||
entries = [('Section 1', '5'), ('Section 2', '10')]
|
||||
for title, page in entries:
|
||||
toc_data.append([
|
||||
Paragraph(title, toc_style),
|
||||
Paragraph('.' * optimal_dot_count, dots_style),
|
||||
Paragraph(page, toc_style)
|
||||
])
|
||||
|
||||
# Use None for title column (auto-expand), fixed for others
|
||||
toc_table = Table(toc_data, colWidths=[None, dots_column_width, page_num_width])
|
||||
toc_table.setStyle(TableStyle([
|
||||
('GRID', (0, 0), (-1, -1), 0, colors.white),
|
||||
('LINEBELOW', (0, 0), (0, 0), 1.5, colors.black),
|
||||
('ALIGN', (0, 0), (0, -1), 'LEFT'),
|
||||
('ALIGN', (1, 0), (1, -1), 'LEFT'),
|
||||
('ALIGN', (2, 0), (2, -1), 'RIGHT'),
|
||||
('VALIGN', (0, 0), (-1, -1), 'TOP'),
|
||||
('LEFTPADDING', (0, 0), (-1, -1), 0),
|
||||
('RIGHTPADDING', (0, 0), (-1, -1), 0),
|
||||
('TOPPADDING', (0, 2), (-1, -1), 3),
|
||||
('BOTTOMPADDING', (0, 2), (-1, -1), 3),
|
||||
('TEXTCOLOR', (1, 2), (1, -1), colors.HexColor('#888888')),
|
||||
]))
|
||||
|
||||
story.append(toc_table)
|
||||
story.append(PageBreak())
|
||||
|
||||
doc.build(story)
|
||||
```
|
||||
|
||||
#### Advanced: Table of Contents with Leader Dots
|
||||
|
||||
**Critical Rules for TOC with Leader Dots:**
|
||||
|
||||
1. **Three-column structure**: [Title, Dots, Page Number] for leader dot style
|
||||
2. **Column width strategy**:
|
||||
- Title: `None` (auto-expands to content)
|
||||
- Dots: Calculated width = `available_width - 200 - 50` (reserves space for title + page)
|
||||
- Page number: Fixed `50pt` (enough for 3-4 digit numbers, ensures right alignment)
|
||||
3. **Dynamic dot count**: `int(dots_column_width / 4.5)` for 7pt font (adjust based on font size)
|
||||
4. **Dot styling**: Small font (7-8pt) and gray color (#888888) for professional look
|
||||
5. **Alignment sequence**: LEFT (title) → LEFT (dots flow from title) → RIGHT (page numbers)
|
||||
6. **Zero padding**: Essential for seamless visual connection between columns
|
||||
7. **Indentation**: Use leading spaces in title text for hierarchy (e.g., " 1.1 Subsection")
|
||||
|
||||
**MANDATORY STYLE REQUIREMENTS:**
|
||||
- ✅ USE FIXED WIDTHS: Percentage-based widths are STRICTLY FORBIDDEN. You MUST use fixed values to guarantee alignment, especially for page numbers.
|
||||
- ✅ DYNAMIC LEADER DOTS: Hard-coded dot counts are STRICTLY FORBIDDEN. You MUST calculate the number of dots dynamically based on the column width to prevent overflow or wrapping.
|
||||
- ✅ MINIMUM COLUMN WIDTH: The page number column MUST be at least 40pt wide. Anything less will prevent proper right alignment.
|
||||
- ✅ DOT FONT SIZE: Leader dot font size MUST NOT EXCEED 8pt. Larger sizes will ruin the dot density and are unacceptable.
|
||||
- ✅ DOT ALIGNMENT: Dots MUST remain left-aligned to maintain the visual flow from the title. Right-aligning dots is forbidden.
|
||||
- ✅ ZERO PADDING: Padding between columns MUST be set to exactly 0. Any gap will create a break in the dot line and is not allowed.
|
||||
- ✅ USE PARAGRAPH OBJECTS: Bold text MUST be wrapped in a Paragraph() object like `Paragraph('<b>Text</b>', style)`. Using plain strings like `'<b>Text</b>'` is strictly STRICTLY FORBIDDEN as styles will not render.
|
||||
|
||||
#### CRITICAL: Table Cell Content Must Use Paragraph
|
||||
|
||||
**ALL text content in table cells MUST be wrapped in `Paragraph()` objects.** This is essential for:
|
||||
- Rendering formatting tags (`<b>`, `<super>`, `<sub>`, `<i>`)
|
||||
- Proper font application
|
||||
- Correct text alignment within cells
|
||||
- Consistent styling across the table
|
||||
|
||||
**The ONLY exception**: `Image()` objects can be placed directly in table cells without Paragraph wrapping.
|
||||
|
||||
```python
|
||||
from reportlab.platypus import Table, TableStyle, Paragraph, Image
|
||||
from reportlab.lib.styles import ParagraphStyle
|
||||
from reportlab.lib import colors
|
||||
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT
|
||||
|
||||
# Define cell styles
|
||||
header_style = ParagraphStyle(
|
||||
name='TableHeader',
|
||||
fontName='Times New Roman',
|
||||
fontSize=11,
|
||||
textColor=colors.white,
|
||||
alignment=TA_CENTER
|
||||
)
|
||||
|
||||
cell_style = ParagraphStyle(
|
||||
name='TableCell',
|
||||
fontName='Times New Roman',
|
||||
fontSize=10,
|
||||
textColor=colors.black,
|
||||
alignment=TA_CENTER
|
||||
)
|
||||
|
||||
# ✅ CORRECT: All text wrapped in Paragraph()
|
||||
data = [
|
||||
[
|
||||
Paragraph('<b>Name</b>', header_style),
|
||||
Paragraph('<b>Formula</b>', header_style),
|
||||
Paragraph('<b>Value</b>', header_style)
|
||||
],
|
||||
[
|
||||
Paragraph('Water', cell_style),
|
||||
Paragraph('H<sub>2</sub>O', cell_style), # Subscript works
|
||||
Paragraph('18.015 g/mol', cell_style)
|
||||
],
|
||||
[
|
||||
Paragraph('Pressure', cell_style),
|
||||
Paragraph('1.01 x 10<super>5</super> Pa', cell_style), # Superscript works
|
||||
Paragraph('<b>Standard</b>', cell_style) # Bold works
|
||||
]
|
||||
]
|
||||
|
||||
# ❌ WRONG: Plain strings - NO formatting will render
|
||||
# data = [
|
||||
# ['<b>Name</b>', '<b>Formula</b>', '<b>Value</b>'], # Bold won't work!
|
||||
# ['Water', 'H<sub>2</sub>O', '18.015 g/mol'], # Subscript won't work!
|
||||
# ]
|
||||
|
||||
# Image exception - Image objects go directly, no Paragraph needed
|
||||
# data_with_image = [
|
||||
# [Paragraph('<b>Logo</b>', header_style), Paragraph('<b>Description</b>', header_style)],
|
||||
# [Image('logo.png', width=50, height=50), Paragraph('Company logo', cell_style)],
|
||||
# ]
|
||||
|
||||
table = Table(data, colWidths=[100, 150, 100])
|
||||
table.setStyle(TableStyle([
|
||||
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
|
||||
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
|
||||
('VALIGN', (0, 0), (-1, -1), 'MIDDLE'),
|
||||
]))
|
||||
```
|
||||
|
||||
#### Debug Tips for Layout Issues
|
||||
|
||||
```python
|
||||
from reportlab.platypus import HRFlowable
|
||||
from reportlab.lib.colors import red
|
||||
|
||||
# Visualize spacing during development
|
||||
story.append(table)
|
||||
story.append(HRFlowable(width="100%", color=red, thickness=0.5, spaceBefore=0, spaceAfter=0))
|
||||
story.append(Spacer(1, 6))
|
||||
story.append(HRFlowable(width="100%", color=red, thickness=0.5, spaceBefore=0, spaceAfter=0))
|
||||
story.append(caption)
|
||||
# This creates visual markers to see actual spacing
|
||||
```
|
||||
|
||||
## Complex Workflows
|
||||
|
||||
### Extract Figures/Images from PDF
|
||||
|
||||
#### Method 1: Using pdfimages (fastest)
|
||||
```bash
|
||||
# Extract all images with original quality
|
||||
pdfimages -all document.pdf images/img
|
||||
```
|
||||
|
||||
#### Method 2: Using pypdfium2 + Image Processing
|
||||
```python
|
||||
import pypdfium2 as pdfium
|
||||
from PIL import Image
|
||||
import numpy as np
|
||||
|
||||
def extract_figures(pdf_path, output_dir):
|
||||
pdf = pdfium.PdfDocument(pdf_path)
|
||||
|
||||
for page_num, page in enumerate(pdf):
|
||||
# Render high-resolution page
|
||||
bitmap = page.render(scale=3.0)
|
||||
img = bitmap.to_pil()
|
||||
|
||||
# Convert to numpy for processing
|
||||
img_array = np.array(img)
|
||||
|
||||
# Simple figure detection (non-white regions)
|
||||
mask = np.any(img_array != [255, 255, 255], axis=2)
|
||||
|
||||
# Find contours and extract bounding boxes
|
||||
# (This is simplified - real implementation would need more sophisticated detection)
|
||||
|
||||
# Save detected figures
|
||||
# ... implementation depends on specific needs
|
||||
```
|
||||
|
||||
### Batch PDF Processing with Error Handling
|
||||
```python
|
||||
import os
|
||||
import glob
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def batch_process_pdfs(input_dir, operation='merge'):
|
||||
pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
|
||||
|
||||
if operation == 'merge':
|
||||
writer = PdfWriter()
|
||||
for pdf_file in pdf_files:
|
||||
try:
|
||||
reader = PdfReader(pdf_file)
|
||||
for page in reader.pages:
|
||||
writer.add_page(page)
|
||||
logger.info(f"Processed: {pdf_file}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process {pdf_file}: {e}")
|
||||
continue
|
||||
|
||||
with open("batch_merged.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
|
||||
elif operation == 'extract_text':
|
||||
for pdf_file in pdf_files:
|
||||
try:
|
||||
reader = PdfReader(pdf_file)
|
||||
text = ""
|
||||
for page in reader.pages:
|
||||
text += page.extract_text()
|
||||
|
||||
output_file = pdf_file.replace('.pdf', '.txt')
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
logger.info(f"Extracted text from: {pdf_file}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to extract text from {pdf_file}: {e}")
|
||||
continue
|
||||
```
|
||||
|
||||
### Advanced PDF Cropping
|
||||
```python
|
||||
from pypdf import PdfWriter, PdfReader
|
||||
|
||||
reader = PdfReader("input.pdf")
|
||||
writer = PdfWriter()
|
||||
|
||||
# Crop page (left, bottom, right, top in points)
|
||||
page = reader.pages[0]
|
||||
page.mediabox.left = 50
|
||||
page.mediabox.bottom = 50
|
||||
page.mediabox.right = 550
|
||||
page.mediabox.top = 750
|
||||
|
||||
writer.add_page(page)
|
||||
with open("cropped.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
## Performance Optimization Tips
|
||||
|
||||
### 1. For Large PDFs
|
||||
- Use streaming approaches instead of loading entire PDF in memory
|
||||
- Use `qpdf --split-pages` for splitting large files
|
||||
- Process pages individually with pypdfium2
|
||||
|
||||
### 2. For Text Extraction
|
||||
- `pdftotext -bbox-layout` is fastest for plain text extraction
|
||||
- Use pdfplumber for structured data and tables
|
||||
- Avoid `pypdf.extract_text()` for very large documents
|
||||
|
||||
### 3. For Image Extraction
|
||||
- `pdfimages` is much faster than rendering pages
|
||||
- Use low resolution for previews, high resolution for final output
|
||||
|
||||
### 4. For Form Filling
|
||||
- pdf-lib maintains form structure better than most alternatives
|
||||
- Pre-validate form fields before processing
|
||||
|
||||
### 5. Memory Management
|
||||
```python
|
||||
# Process PDFs in chunks
|
||||
def process_large_pdf(pdf_path, chunk_size=10):
|
||||
reader = PdfReader(pdf_path)
|
||||
total_pages = len(reader.pages)
|
||||
|
||||
for start_idx in range(0, total_pages, chunk_size):
|
||||
end_idx = min(start_idx + chunk_size, total_pages)
|
||||
writer = PdfWriter()
|
||||
|
||||
for i in range(start_idx, end_idx):
|
||||
writer.add_page(reader.pages[i])
|
||||
|
||||
# Process chunk
|
||||
with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Encrypted PDFs
|
||||
```python
|
||||
# Handle password-protected PDFs
|
||||
from pypdf import PdfReader
|
||||
|
||||
try:
|
||||
reader = PdfReader("encrypted.pdf")
|
||||
if reader.is_encrypted:
|
||||
reader.decrypt("password")
|
||||
except Exception as e:
|
||||
print(f"Failed to decrypt: {e}")
|
||||
```
|
||||
|
||||
### Corrupted PDFs
|
||||
```bash
|
||||
# Use qpdf to repair
|
||||
qpdf --check corrupted.pdf
|
||||
qpdf --replace-input corrupted.pdf
|
||||
```
|
||||
|
||||
### Text Extraction Issues
|
||||
```python
|
||||
# Fallback to OCR for scanned PDFs
|
||||
import pytesseract
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
def extract_text_with_ocr(pdf_path):
|
||||
images = convert_from_path(pdf_path)
|
||||
text = ""
|
||||
for i, image in enumerate(images):
|
||||
text += pytesseract.image_to_string(image)
|
||||
return text
|
||||
```
|
||||
|
||||
## License Information
|
||||
|
||||
- **pypdf**: BSD License
|
||||
- **pdfplumber**: MIT License
|
||||
- **pypdfium2**: Apache/BSD License
|
||||
- **reportlab**: BSD License
|
||||
- **poppler-utils**: GPL-2 License
|
||||
- **qpdf**: Apache License
|
||||
- **pdf-lib**: MIT License
|
||||
- **pdfjs-dist**: Apache License
|
||||
172
skills/pdf/scripts/add_zai_metadata.py
Executable file
172
skills/pdf/scripts/add_zai_metadata.py
Executable file
@@ -0,0 +1,172 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Add Z.ai branding metadata to PDF documents.
|
||||
|
||||
This script adds Z.ai metadata (Author, Creator, Producer) to PDF files.
|
||||
It can process single files or batch process multiple PDFs.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
|
||||
def add_zai_metadata(input_pdf_path, output_pdf_path=None, custom_title=None, verbose=True):
|
||||
"""
|
||||
Add Z.ai branding metadata to a PDF document.
|
||||
|
||||
Args:
|
||||
input_pdf_path: Path to input PDF
|
||||
output_pdf_path: Path to output PDF (default: overwrites input)
|
||||
custom_title: Custom title to use (default: preserves original or uses filename)
|
||||
verbose: Print status messages (default: True)
|
||||
|
||||
Sets:
|
||||
- Author: Z.ai
|
||||
- Creator: Z.ai
|
||||
- Producer: http://z.ai
|
||||
- Title: Custom title, original title, or filename (in that priority)
|
||||
|
||||
Returns:
|
||||
Path to the output PDF file
|
||||
"""
|
||||
# Validate input file exists
|
||||
if not os.path.exists(input_pdf_path):
|
||||
print(f"Error: Input file not found: {input_pdf_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Read the PDF
|
||||
try:
|
||||
reader = PdfReader(input_pdf_path)
|
||||
except Exception as e:
|
||||
print(f"Error: Cannot open PDF: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
writer = PdfWriter()
|
||||
|
||||
# Copy all pages
|
||||
for page in reader.pages:
|
||||
writer.add_page(page)
|
||||
|
||||
# Determine title
|
||||
if custom_title:
|
||||
title = custom_title
|
||||
else:
|
||||
original_meta = reader.metadata
|
||||
if original_meta and original_meta.title and original_meta.title not in ['(anonymous)', 'unspecified', None]:
|
||||
title = original_meta.title
|
||||
else:
|
||||
# Use filename without extension as title
|
||||
title = os.path.splitext(os.path.basename(input_pdf_path))[0]
|
||||
|
||||
# Add Z.ai metadata
|
||||
writer.add_metadata({
|
||||
'/Title': title,
|
||||
'/Author': 'Z.ai',
|
||||
'/Creator': 'Z.ai',
|
||||
'/Producer': 'http://z.ai',
|
||||
})
|
||||
|
||||
# Write output
|
||||
if output_pdf_path is None:
|
||||
output_pdf_path = input_pdf_path
|
||||
|
||||
try:
|
||||
with open(output_pdf_path, "wb") as output:
|
||||
writer.write(output)
|
||||
except Exception as e:
|
||||
print(f"Error: Cannot write output file: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Print status
|
||||
if verbose:
|
||||
print(f"✓ Updated metadata for: {os.path.basename(input_pdf_path)}")
|
||||
print(f" Title: {title}")
|
||||
print(f" Author: Z.ai")
|
||||
print(f" Creator: Z.ai")
|
||||
print(f" Producer: http://z.ai")
|
||||
if output_pdf_path != input_pdf_path:
|
||||
print(f" Output: {output_pdf_path}")
|
||||
|
||||
return output_pdf_path
|
||||
|
||||
|
||||
def main():
|
||||
"""Command-line interface for add_zai_metadata."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Add Z.ai branding metadata to PDF documents',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Add metadata to a single PDF (in-place)
|
||||
%(prog)s document.pdf
|
||||
|
||||
# Add metadata to a single PDF (create new file)
|
||||
%(prog)s input.pdf -o output.pdf
|
||||
|
||||
# Add metadata with custom title
|
||||
%(prog)s report.pdf -t "Q4 Financial Analysis"
|
||||
|
||||
# Batch process all PDFs in current directory
|
||||
%(prog)s *.pdf
|
||||
|
||||
# Quiet mode (no output)
|
||||
%(prog)s document.pdf -q
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'input',
|
||||
nargs='+',
|
||||
help='Input PDF file(s) to process'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-o', '--output',
|
||||
help='Output PDF path (only for single input file)'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-t', '--title',
|
||||
help='Custom title for the PDF'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'-q', '--quiet',
|
||||
action='store_true',
|
||||
help='Quiet mode (no status messages)'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Check if output is specified for multiple files
|
||||
if args.output and len(args.input) > 1:
|
||||
print("Error: --output can only be used with a single input file", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Process each input file
|
||||
for input_path in args.input:
|
||||
# Determine output path
|
||||
if len(args.input) == 1 and args.output:
|
||||
output_path = args.output
|
||||
else:
|
||||
output_path = None # Overwrite in-place
|
||||
|
||||
# Determine title
|
||||
if args.title:
|
||||
custom_title = args.title
|
||||
else:
|
||||
custom_title = None
|
||||
|
||||
# Add metadata
|
||||
add_zai_metadata(
|
||||
input_path,
|
||||
output_pdf_path=output_path,
|
||||
custom_title=custom_title,
|
||||
verbose=not args.quiet
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
70
skills/pdf/scripts/check_bounding_boxes.py
Executable file
70
skills/pdf/scripts/check_bounding_boxes.py
Executable file
@@ -0,0 +1,70 @@
|
||||
from dataclasses import dataclass
|
||||
import json
|
||||
import sys
|
||||
|
||||
|
||||
# Script to check that the `fields.json` file that GLM creates when analyzing PDFs
|
||||
# does not have overlapping bounding boxes. See forms.md.
|
||||
|
||||
|
||||
@dataclass
|
||||
class RectAndField:
|
||||
rect: list[float]
|
||||
rect_type: str
|
||||
field: dict
|
||||
|
||||
|
||||
# Returns a list of messages that are printed to stdout for GLM to read.
|
||||
def get_bounding_box_messages(fields_json_stream) -> list[str]:
|
||||
messages = []
|
||||
fields = json.load(fields_json_stream)
|
||||
messages.append(f"Read {len(fields['form_fields'])} fields")
|
||||
|
||||
def rects_intersect(r1, r2):
|
||||
disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
|
||||
disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
|
||||
return not (disjoint_horizontal or disjoint_vertical)
|
||||
|
||||
rects_and_fields = []
|
||||
for f in fields["form_fields"]:
|
||||
rects_and_fields.append(RectAndField(f["label_bounding_box"], "label", f))
|
||||
rects_and_fields.append(RectAndField(f["entry_bounding_box"], "entry", f))
|
||||
|
||||
has_error = False
|
||||
for i, ri in enumerate(rects_and_fields):
|
||||
# This is O(N^2); we can optimize if it becomes a problem.
|
||||
for j in range(i + 1, len(rects_and_fields)):
|
||||
rj = rects_and_fields[j]
|
||||
if ri.field["page_number"] == rj.field["page_number"] and rects_intersect(ri.rect, rj.rect):
|
||||
has_error = True
|
||||
if ri.field is rj.field:
|
||||
messages.append(f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['description']}` ({ri.rect}, {rj.rect})")
|
||||
else:
|
||||
messages.append(f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['description']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['description']}` ({rj.rect})")
|
||||
if len(messages) >= 20:
|
||||
messages.append("Aborting further checks; fix bounding boxes and try again")
|
||||
return messages
|
||||
if ri.rect_type == "entry":
|
||||
if "entry_text" in ri.field:
|
||||
font_size = ri.field["entry_text"].get("font_size", 14)
|
||||
entry_height = ri.rect[3] - ri.rect[1]
|
||||
if entry_height < font_size:
|
||||
has_error = True
|
||||
messages.append(f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['description']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size.")
|
||||
if len(messages) >= 20:
|
||||
messages.append("Aborting further checks; fix bounding boxes and try again")
|
||||
return messages
|
||||
|
||||
if not has_error:
|
||||
messages.append("SUCCESS: All bounding boxes are valid")
|
||||
return messages
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 2:
|
||||
print("Usage: check_bounding_boxes.py [fields.json]")
|
||||
sys.exit(1)
|
||||
# Input file should be in the `fields.json` format described in forms.md.
|
||||
with open(sys.argv[1]) as f:
|
||||
messages = get_bounding_box_messages(f)
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
226
skills/pdf/scripts/check_bounding_boxes_test.py
Executable file
226
skills/pdf/scripts/check_bounding_boxes_test.py
Executable file
@@ -0,0 +1,226 @@
|
||||
import unittest
|
||||
import json
|
||||
import io
|
||||
from check_bounding_boxes import get_bounding_box_messages
|
||||
|
||||
|
||||
# Currently this is not run automatically in CI; it's just for documentation and manual checking.
|
||||
class TestGetBoundingBoxMessages(unittest.TestCase):
|
||||
|
||||
def create_json_stream(self, data):
|
||||
"""Helper to create a JSON stream from data"""
|
||||
return io.StringIO(json.dumps(data))
|
||||
|
||||
def test_no_intersections(self):
|
||||
"""Test case with no bounding box intersections"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 40, 50, 60],
|
||||
"entry_bounding_box": [60, 40, 150, 60]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_label_entry_intersection_same_field(self):
|
||||
"""Test intersection between label and entry of the same field"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 60, 30],
|
||||
"entry_bounding_box": [50, 10, 150, 30] # Overlaps with label
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_intersection_between_different_fields(self):
|
||||
"""Test intersection between bounding boxes of different fields"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [40, 20, 80, 40], # Overlaps with Name's boxes
|
||||
"entry_bounding_box": [160, 10, 250, 30]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_different_pages_no_intersection(self):
|
||||
"""Test that boxes on different pages don't count as intersecting"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
},
|
||||
{
|
||||
"description": "Email",
|
||||
"page_number": 2,
|
||||
"label_bounding_box": [10, 10, 50, 30], # Same coordinates but different page
|
||||
"entry_bounding_box": [60, 10, 150, 30]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_entry_height_too_small(self):
|
||||
"""Test that entry box height is checked against font size"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
|
||||
"entry_text": {
|
||||
"font_size": 14 # Font size larger than height
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_entry_height_adequate(self):
|
||||
"""Test that adequate entry box height passes"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 30], # Height is 20
|
||||
"entry_text": {
|
||||
"font_size": 14 # Font size smaller than height
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_default_font_size(self):
|
||||
"""Test that default font size is used when not specified"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
|
||||
"entry_text": {} # No font_size specified, should use default 14
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
|
||||
self.assertFalse(any("SUCCESS" in msg for msg in messages))
|
||||
|
||||
def test_no_entry_text(self):
|
||||
"""Test that missing entry_text doesn't cause height check"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [60, 10, 150, 20] # Small height but no entry_text
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
def test_multiple_errors_limit(self):
|
||||
"""Test that error messages are limited to prevent excessive output"""
|
||||
fields = []
|
||||
# Create many overlapping fields
|
||||
for i in range(25):
|
||||
fields.append({
|
||||
"description": f"Field{i}",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30], # All overlap
|
||||
"entry_bounding_box": [20, 15, 60, 35] # All overlap
|
||||
})
|
||||
|
||||
data = {"form_fields": fields}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
# Should abort after ~20 messages
|
||||
self.assertTrue(any("Aborting" in msg for msg in messages))
|
||||
# Should have some FAILURE messages but not hundreds
|
||||
failure_count = sum(1 for msg in messages if "FAILURE" in msg)
|
||||
self.assertGreater(failure_count, 0)
|
||||
self.assertLess(len(messages), 30) # Should be limited
|
||||
|
||||
def test_edge_touching_boxes(self):
|
||||
"""Test that boxes touching at edges don't count as intersecting"""
|
||||
data = {
|
||||
"form_fields": [
|
||||
{
|
||||
"description": "Name",
|
||||
"page_number": 1,
|
||||
"label_bounding_box": [10, 10, 50, 30],
|
||||
"entry_bounding_box": [50, 10, 150, 30] # Touches at x=50
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
stream = self.create_json_stream(data)
|
||||
messages = get_bounding_box_messages(stream)
|
||||
self.assertTrue(any("SUCCESS" in msg for msg in messages))
|
||||
self.assertFalse(any("FAILURE" in msg for msg in messages))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
12
skills/pdf/scripts/check_fillable_fields.py
Executable file
12
skills/pdf/scripts/check_fillable_fields.py
Executable file
@@ -0,0 +1,12 @@
|
||||
import sys
|
||||
from pypdf import PdfReader
|
||||
|
||||
|
||||
# Script for GLM to run to determine whether a PDF has fillable form fields. See forms.md.
|
||||
|
||||
|
||||
reader = PdfReader(sys.argv[1])
|
||||
if (reader.get_fields()):
|
||||
print("This PDF has fillable form fields")
|
||||
else:
|
||||
print("This PDF does not have fillable form fields; you will need to visually determine where to enter data")
|
||||
35
skills/pdf/scripts/convert_pdf_to_images.py
Executable file
35
skills/pdf/scripts/convert_pdf_to_images.py
Executable file
@@ -0,0 +1,35 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
|
||||
# Converts each page of a PDF to a PNG image.
|
||||
|
||||
|
||||
def convert(pdf_path, output_dir, max_dim=1000):
|
||||
images = convert_from_path(pdf_path, dpi=200)
|
||||
|
||||
for i, image in enumerate(images):
|
||||
# Scale image if needed to keep width/height under `max_dim`
|
||||
width, height = image.size
|
||||
if width > max_dim or height > max_dim:
|
||||
scale_factor = min(max_dim / width, max_dim / height)
|
||||
new_width = int(width * scale_factor)
|
||||
new_height = int(height * scale_factor)
|
||||
image = image.resize((new_width, new_height))
|
||||
|
||||
image_path = os.path.join(output_dir, f"page_{i+1}.png")
|
||||
image.save(image_path)
|
||||
print(f"Saved page {i+1} as {image_path} (size: {image.size})")
|
||||
|
||||
print(f"Converted {len(images)} pages to PNG images")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 3:
|
||||
print("Usage: convert_pdf_to_images.py [input pdf] [output directory]")
|
||||
sys.exit(1)
|
||||
pdf_path = sys.argv[1]
|
||||
output_directory = sys.argv[2]
|
||||
convert(pdf_path, output_directory)
|
||||
41
skills/pdf/scripts/create_validation_image.py
Executable file
41
skills/pdf/scripts/create_validation_image.py
Executable file
@@ -0,0 +1,41 @@
|
||||
import json
|
||||
import sys
|
||||
|
||||
from PIL import Image, ImageDraw
|
||||
|
||||
|
||||
# Creates "validation" images with rectangles for the bounding box information that
|
||||
# GLM creates when determining where to add text annotations in PDFs. See forms.md.
|
||||
|
||||
|
||||
def create_validation_image(page_number, fields_json_path, input_path, output_path):
|
||||
# Input file should be in the `fields.json` format described in forms.md.
|
||||
with open(fields_json_path, 'r') as f:
|
||||
data = json.load(f)
|
||||
|
||||
img = Image.open(input_path)
|
||||
draw = ImageDraw.Draw(img)
|
||||
num_boxes = 0
|
||||
|
||||
for field in data["form_fields"]:
|
||||
if field["page_number"] == page_number:
|
||||
entry_box = field['entry_bounding_box']
|
||||
label_box = field['label_bounding_box']
|
||||
# Draw red rectangle over entry bounding box and blue rectangle over the label.
|
||||
draw.rectangle(entry_box, outline='red', width=2)
|
||||
draw.rectangle(label_box, outline='blue', width=2)
|
||||
num_boxes += 2
|
||||
|
||||
img.save(output_path)
|
||||
print(f"Created validation image at {output_path} with {num_boxes} bounding boxes")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 5:
|
||||
print("Usage: create_validation_image.py [page number] [fields.json file] [input image path] [output image path]")
|
||||
sys.exit(1)
|
||||
page_number = int(sys.argv[1])
|
||||
fields_json_path = sys.argv[2]
|
||||
input_image_path = sys.argv[3]
|
||||
output_image_path = sys.argv[4]
|
||||
create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)
|
||||
152
skills/pdf/scripts/extract_form_field_info.py
Executable file
152
skills/pdf/scripts/extract_form_field_info.py
Executable file
@@ -0,0 +1,152 @@
|
||||
import json
|
||||
import sys
|
||||
|
||||
from pypdf import PdfReader
|
||||
|
||||
|
||||
# Extracts data for the fillable form fields in a PDF and outputs JSON that
|
||||
# GLM uses to fill the fields. See forms.md.
|
||||
|
||||
|
||||
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
|
||||
def get_full_annotation_field_id(annotation):
|
||||
components = []
|
||||
while annotation:
|
||||
field_name = annotation.get('/T')
|
||||
if field_name:
|
||||
components.append(field_name)
|
||||
annotation = annotation.get('/Parent')
|
||||
return ".".join(reversed(components)) if components else None
|
||||
|
||||
|
||||
def make_field_dict(field, field_id):
|
||||
field_dict = {"field_id": field_id}
|
||||
ft = field.get('/FT')
|
||||
if ft == "/Tx":
|
||||
field_dict["type"] = "text"
|
||||
elif ft == "/Btn":
|
||||
field_dict["type"] = "checkbox" # radio groups handled separately
|
||||
states = field.get("/_States_", [])
|
||||
if len(states) == 2:
|
||||
# "/Off" seems to always be the unchecked value, as suggested by
|
||||
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
|
||||
# It can be either first or second in the "/_States_" list.
|
||||
if "/Off" in states:
|
||||
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
|
||||
field_dict["unchecked_value"] = "/Off"
|
||||
else:
|
||||
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
|
||||
field_dict["checked_value"] = states[0]
|
||||
field_dict["unchecked_value"] = states[1]
|
||||
elif ft == "/Ch":
|
||||
field_dict["type"] = "choice"
|
||||
states = field.get("/_States_", [])
|
||||
field_dict["choice_options"] = [{
|
||||
"value": state[0],
|
||||
"text": state[1],
|
||||
} for state in states]
|
||||
else:
|
||||
field_dict["type"] = f"unknown ({ft})"
|
||||
return field_dict
|
||||
|
||||
|
||||
# Returns a list of fillable PDF fields:
|
||||
# [
|
||||
# {
|
||||
# "field_id": "name",
|
||||
# "page": 1,
|
||||
# "type": ("text", "checkbox", "radio_group", or "choice")
|
||||
# // Per-type additional fields described in forms.md
|
||||
# },
|
||||
# ]
|
||||
def get_field_info(reader: PdfReader):
|
||||
fields = reader.get_fields()
|
||||
|
||||
field_info_by_id = {}
|
||||
possible_radio_names = set()
|
||||
|
||||
for field_id, field in fields.items():
|
||||
# Skip if this is a container field with children, except that it might be
|
||||
# a parent group for radio button options.
|
||||
if field.get("/Kids"):
|
||||
if field.get("/FT") == "/Btn":
|
||||
possible_radio_names.add(field_id)
|
||||
continue
|
||||
field_info_by_id[field_id] = make_field_dict(field, field_id)
|
||||
|
||||
# Bounding rects are stored in annotations in page objects.
|
||||
|
||||
# Radio button options have a separate annotation for each choice;
|
||||
# all choices have the same field name.
|
||||
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
|
||||
radio_fields_by_id = {}
|
||||
|
||||
for page_index, page in enumerate(reader.pages):
|
||||
annotations = page.get('/Annots', [])
|
||||
for ann in annotations:
|
||||
field_id = get_full_annotation_field_id(ann)
|
||||
if field_id in field_info_by_id:
|
||||
field_info_by_id[field_id]["page"] = page_index + 1
|
||||
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
|
||||
elif field_id in possible_radio_names:
|
||||
try:
|
||||
# ann['/AP']['/N'] should have two items. One of them is '/Off',
|
||||
# the other is the active value.
|
||||
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
|
||||
except KeyError:
|
||||
continue
|
||||
if len(on_values) == 1:
|
||||
rect = ann.get("/Rect")
|
||||
if field_id not in radio_fields_by_id:
|
||||
radio_fields_by_id[field_id] = {
|
||||
"field_id": field_id,
|
||||
"type": "radio_group",
|
||||
"page": page_index + 1,
|
||||
"radio_options": [],
|
||||
}
|
||||
# Note: at least on macOS 15.7, Preview.app doesn't show selected
|
||||
# radio buttons correctly. (It does if you remove the leading slash
|
||||
# from the value, but that causes them not to appear correctly in
|
||||
# Chrome/Firefox/Acrobat/etc).
|
||||
radio_fields_by_id[field_id]["radio_options"].append({
|
||||
"value": on_values[0],
|
||||
"rect": rect,
|
||||
})
|
||||
|
||||
# Some PDFs have form field definitions without corresponding annotations,
|
||||
# so we can't tell where they are. Ignore these fields for now.
|
||||
fields_with_location = []
|
||||
for field_info in field_info_by_id.values():
|
||||
if "page" in field_info:
|
||||
fields_with_location.append(field_info)
|
||||
else:
|
||||
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
|
||||
|
||||
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
|
||||
def sort_key(f):
|
||||
if "radio_options" in f:
|
||||
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
|
||||
else:
|
||||
rect = f.get("rect") or [0, 0, 0, 0]
|
||||
adjusted_position = [-rect[1], rect[0]]
|
||||
return [f.get("page"), adjusted_position]
|
||||
|
||||
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
|
||||
sorted_fields.sort(key=sort_key)
|
||||
|
||||
return sorted_fields
|
||||
|
||||
|
||||
def write_field_info(pdf_path: str, json_output_path: str):
|
||||
reader = PdfReader(pdf_path)
|
||||
field_info = get_field_info(reader)
|
||||
with open(json_output_path, "w") as f:
|
||||
json.dump(field_info, f, indent=2)
|
||||
print(f"Wrote {len(field_info)} fields to {json_output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 3:
|
||||
print("Usage: extract_form_field_info.py [input pdf] [output json]")
|
||||
sys.exit(1)
|
||||
write_field_info(sys.argv[1], sys.argv[2])
|
||||
114
skills/pdf/scripts/fill_fillable_fields.py
Executable file
114
skills/pdf/scripts/fill_fillable_fields.py
Executable file
@@ -0,0 +1,114 @@
|
||||
import json
|
||||
import sys
|
||||
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
from extract_form_field_info import get_field_info
|
||||
|
||||
|
||||
# Fills fillable form fields in a PDF. See forms.md.
|
||||
|
||||
|
||||
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
|
||||
with open(fields_json_path) as f:
|
||||
fields = json.load(f)
|
||||
# Group by page number.
|
||||
fields_by_page = {}
|
||||
for field in fields:
|
||||
if "value" in field:
|
||||
field_id = field["field_id"]
|
||||
page = field["page"]
|
||||
if page not in fields_by_page:
|
||||
fields_by_page[page] = {}
|
||||
fields_by_page[page][field_id] = field["value"]
|
||||
|
||||
reader = PdfReader(input_pdf_path)
|
||||
|
||||
has_error = False
|
||||
field_info = get_field_info(reader)
|
||||
fields_by_ids = {f["field_id"]: f for f in field_info}
|
||||
for field in fields:
|
||||
existing_field = fields_by_ids.get(field["field_id"])
|
||||
if not existing_field:
|
||||
has_error = True
|
||||
print(f"ERROR: `{field['field_id']}` is not a valid field ID")
|
||||
elif field["page"] != existing_field["page"]:
|
||||
has_error = True
|
||||
print(f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})")
|
||||
else:
|
||||
if "value" in field:
|
||||
err = validation_error_for_field_value(existing_field, field["value"])
|
||||
if err:
|
||||
print(err)
|
||||
has_error = True
|
||||
if has_error:
|
||||
sys.exit(1)
|
||||
|
||||
writer = PdfWriter(clone_from=reader)
|
||||
for page, field_values in fields_by_page.items():
|
||||
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
|
||||
|
||||
# This seems to be necessary for many PDF viewers to format the form values correctly.
|
||||
# It may cause the viewer to show a "save changes" dialog even if the user doesn't make any changes.
|
||||
writer.set_need_appearances_writer(True)
|
||||
|
||||
with open(output_pdf_path, "wb") as f:
|
||||
writer.write(f)
|
||||
|
||||
|
||||
def validation_error_for_field_value(field_info, field_value):
|
||||
field_type = field_info["type"]
|
||||
field_id = field_info["field_id"]
|
||||
if field_type == "checkbox":
|
||||
checked_val = field_info["checked_value"]
|
||||
unchecked_val = field_info["unchecked_value"]
|
||||
if field_value != checked_val and field_value != unchecked_val:
|
||||
return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"'
|
||||
elif field_type == "radio_group":
|
||||
option_values = [opt["value"] for opt in field_info["radio_options"]]
|
||||
if field_value not in option_values:
|
||||
return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}'
|
||||
elif field_type == "choice":
|
||||
choice_values = [opt["value"] for opt in field_info["choice_options"]]
|
||||
if field_value not in choice_values:
|
||||
return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}'
|
||||
return None
|
||||
|
||||
|
||||
# pypdf (at least version 5.7.0) has a bug when setting the value for a selection list field.
|
||||
# In _writer.py around line 966:
|
||||
#
|
||||
# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0:
|
||||
# txt = "\n".join(annotation.get_inherited(FA.Opt, []))
|
||||
#
|
||||
# The problem is that for selection lists, `get_inherited` returns a list of two-element lists like
|
||||
# [["value1", "Text 1"], ["value2", "Text 2"], ...]
|
||||
# This causes `join` to throw a TypeError because it expects an iterable of strings.
|
||||
# The horrible workaround is to patch `get_inherited` to return a list of the value strings.
|
||||
# We call the original method and adjust the return value only if the argument to `get_inherited`
|
||||
# is `FA.Opt` and if the return value is a list of two-element lists.
|
||||
def monkeypatch_pydpf_method():
|
||||
from pypdf.generic import DictionaryObject
|
||||
from pypdf.constants import FieldDictionaryAttributes
|
||||
|
||||
original_get_inherited = DictionaryObject.get_inherited
|
||||
|
||||
def patched_get_inherited(self, key: str, default = None):
|
||||
result = original_get_inherited(self, key, default)
|
||||
if key == FieldDictionaryAttributes.Opt:
|
||||
if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
|
||||
result = [r[0] for r in result]
|
||||
return result
|
||||
|
||||
DictionaryObject.get_inherited = patched_get_inherited
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 4:
|
||||
print("Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]")
|
||||
sys.exit(1)
|
||||
monkeypatch_pydpf_method()
|
||||
input_pdf = sys.argv[1]
|
||||
fields_json = sys.argv[2]
|
||||
output_pdf = sys.argv[3]
|
||||
fill_pdf_fields(input_pdf, fields_json, output_pdf)
|
||||
108
skills/pdf/scripts/fill_pdf_form_with_annotations.py
Executable file
108
skills/pdf/scripts/fill_pdf_form_with_annotations.py
Executable file
@@ -0,0 +1,108 @@
|
||||
import json
|
||||
import sys
|
||||
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
from pypdf.annotations import FreeText
|
||||
|
||||
|
||||
# Fills a PDF by adding text annotations defined in `fields.json`. See forms.md.
|
||||
|
||||
|
||||
def transform_coordinates(bbox, image_width, image_height, pdf_width, pdf_height):
|
||||
"""Transform bounding box from image coordinates to PDF coordinates"""
|
||||
# Image coordinates: origin at top-left, y increases downward
|
||||
# PDF coordinates: origin at bottom-left, y increases upward
|
||||
x_scale = pdf_width / image_width
|
||||
y_scale = pdf_height / image_height
|
||||
|
||||
left = bbox[0] * x_scale
|
||||
right = bbox[2] * x_scale
|
||||
|
||||
# Flip Y coordinates for PDF
|
||||
top = pdf_height - (bbox[1] * y_scale)
|
||||
bottom = pdf_height - (bbox[3] * y_scale)
|
||||
|
||||
return left, bottom, right, top
|
||||
|
||||
|
||||
def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
|
||||
"""Fill the PDF form with data from fields.json"""
|
||||
|
||||
# `fields.json` format described in forms.md.
|
||||
with open(fields_json_path, "r") as f:
|
||||
fields_data = json.load(f)
|
||||
|
||||
# Open the PDF
|
||||
reader = PdfReader(input_pdf_path)
|
||||
writer = PdfWriter()
|
||||
|
||||
# Copy all pages to writer
|
||||
writer.append(reader)
|
||||
|
||||
# Get PDF dimensions for each page
|
||||
pdf_dimensions = {}
|
||||
for i, page in enumerate(reader.pages):
|
||||
mediabox = page.mediabox
|
||||
pdf_dimensions[i + 1] = [mediabox.width, mediabox.height]
|
||||
|
||||
# Process each form field
|
||||
annotations = []
|
||||
for field in fields_data["form_fields"]:
|
||||
page_num = field["page_number"]
|
||||
|
||||
# Get page dimensions and transform coordinates.
|
||||
page_info = next(p for p in fields_data["pages"] if p["page_number"] == page_num)
|
||||
image_width = page_info["image_width"]
|
||||
image_height = page_info["image_height"]
|
||||
pdf_width, pdf_height = pdf_dimensions[page_num]
|
||||
|
||||
transformed_entry_box = transform_coordinates(
|
||||
field["entry_bounding_box"],
|
||||
image_width, image_height,
|
||||
pdf_width, pdf_height
|
||||
)
|
||||
|
||||
# Skip empty fields
|
||||
if "entry_text" not in field or "text" not in field["entry_text"]:
|
||||
continue
|
||||
entry_text = field["entry_text"]
|
||||
text = entry_text["text"]
|
||||
if not text:
|
||||
continue
|
||||
|
||||
font_name = entry_text.get("font", "Arial")
|
||||
font_size = str(entry_text.get("font_size", 14)) + "pt"
|
||||
font_color = entry_text.get("font_color", "000000")
|
||||
|
||||
# Font size/color seems to not work reliably across viewers:
|
||||
# https://github.com/py-pdf/pypdf/issues/2084
|
||||
annotation = FreeText(
|
||||
text=text,
|
||||
rect=transformed_entry_box,
|
||||
font=font_name,
|
||||
font_size=font_size,
|
||||
font_color=font_color,
|
||||
border_color=None,
|
||||
background_color=None,
|
||||
)
|
||||
annotations.append(annotation)
|
||||
# page_number is 0-based for pypdf
|
||||
writer.add_annotation(page_number=page_num - 1, annotation=annotation)
|
||||
|
||||
# Save the filled PDF
|
||||
with open(output_pdf_path, "wb") as output:
|
||||
writer.write(output)
|
||||
|
||||
print(f"Successfully filled PDF form and saved to {output_pdf_path}")
|
||||
print(f"Added {len(annotations)} text annotations")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 4:
|
||||
print("Usage: fill_pdf_form_with_annotations.py [input pdf] [fields.json] [output pdf]")
|
||||
sys.exit(1)
|
||||
input_pdf = sys.argv[1]
|
||||
fields_json = sys.argv[2]
|
||||
output_pdf = sys.argv[3]
|
||||
|
||||
fill_pdf_form(input_pdf, fields_json, output_pdf)
|
||||
110
skills/pdf/scripts/sanitize_code.py
Executable file
110
skills/pdf/scripts/sanitize_code.py
Executable file
@@ -0,0 +1,110 @@
|
||||
import re
|
||||
import html
|
||||
import sys
|
||||
from typing import Dict
|
||||
|
||||
# ---------- Step 0: restore literal unicode escapes/entities to real chars ----------
|
||||
_RE_UNICODE_ESC = re.compile(r"(\\u[0-9a-fA-F]{4})|(\\U[0-9a-fA-F]{8})|(\\x[0-9a-fA-F]{2})")
|
||||
|
||||
def _restore_escapes(s: str) -> str:
|
||||
# HTML entities: ³ ≤ α ...
|
||||
s = html.unescape(s)
|
||||
|
||||
# Literal backslash escapes: "\\u00B3" -> "³"
|
||||
def _dec(m: re.Match) -> str:
|
||||
esc = m.group(0)
|
||||
try:
|
||||
if esc.startswith("\\u") or esc.startswith("\\U"):
|
||||
return chr(int(esc[2:], 16))
|
||||
if esc.startswith("\\x"):
|
||||
return chr(int(esc[2:], 16))
|
||||
except Exception:
|
||||
return esc
|
||||
return esc
|
||||
|
||||
return _RE_UNICODE_ESC.sub(_dec, s)
|
||||
|
||||
# ---------- Step 1: superscripts/subscripts -> <super>/<sub> ----------
|
||||
_SUPERSCRIPT_MAP: Dict[str, str] = {
|
||||
"⁰": "0", "¹": "1", "²": "2", "³": "3", "⁴": "4",
|
||||
"⁵": "5", "⁶": "6", "⁷": "7", "⁸": "8", "⁹": "9",
|
||||
"⁺": "+", "⁻": "-", "⁼": "=", "⁽": "(", "⁾": ")",
|
||||
"ⁿ": "n", "ᶦ": "i",
|
||||
}
|
||||
|
||||
_SUBSCRIPT_MAP: Dict[str, str] = {
|
||||
"₀": "0", "₁": "1", "₂": "2", "₃": "3", "₄": "4",
|
||||
"₅": "5", "₆": "6", "₇": "7", "₈": "8", "₉": "9",
|
||||
"₊": "+", "₋": "-", "₌": "=", "₍": "(", "₎": ")",
|
||||
"ₐ": "a", "ₑ": "e", "ₕ": "h", "ᵢ": "i", "ⱼ": "j",
|
||||
"ₖ": "k", "ₗ": "l", "ₘ": "m", "ₙ": "n", "ₒ": "o",
|
||||
"ₚ": "p", "ᵣ": "r", "ₛ": "s", "ₜ": "t", "ᵤ": "u",
|
||||
"ᵥ": "v", "ₓ": "x",
|
||||
}
|
||||
|
||||
def _replace_super_sub(s: str) -> str:
|
||||
out = []
|
||||
for ch in s:
|
||||
if ch in _SUPERSCRIPT_MAP:
|
||||
out.append(f"<super>{_SUPERSCRIPT_MAP[ch]}</super>")
|
||||
elif ch in _SUBSCRIPT_MAP:
|
||||
out.append(f"<sub>{_SUBSCRIPT_MAP[ch]}</sub>")
|
||||
else:
|
||||
out.append(ch)
|
||||
return "".join(out)
|
||||
|
||||
# ---------- Step 2: symbol fallback for SimHei (protect tags, then replace) ----------
|
||||
_SYMBOL_FALLBACK: Dict[str, str] = {
|
||||
# Currently empty - enable entries as needed for fonts missing specific glyphs
|
||||
# "±": "+/-",
|
||||
# "×": "*",
|
||||
# "÷": "/",
|
||||
# "≤": "<=",
|
||||
# "≥": ">=",
|
||||
# "≠": "!=",
|
||||
# "≈": "~=",
|
||||
# "∞": "inf",
|
||||
}
|
||||
|
||||
def _fallback_symbols(s: str) -> str:
|
||||
# Protect <super>/<sub> tags from being modified
|
||||
placeholders = {}
|
||||
def _protect_tag(m: re.Match) -> str:
|
||||
key = f"@@TAG{len(placeholders)}@@"
|
||||
placeholders[key] = m.group(0)
|
||||
return key
|
||||
|
||||
protected = re.sub(r"</?super>|</?sub>", _protect_tag, s)
|
||||
|
||||
# Replace symbols
|
||||
protected = "".join(_SYMBOL_FALLBACK.get(ch, ch) for ch in protected)
|
||||
|
||||
# Restore tags
|
||||
for k, v in placeholders.items():
|
||||
protected = protected.replace(k, v)
|
||||
|
||||
return protected
|
||||
|
||||
def sanitize_code(text: str) -> str:
|
||||
"""
|
||||
Full sanitization pipeline for PDF generation code.
|
||||
- Restore unicode escapes/entities to real characters
|
||||
- Replace superscript/subscript unicode with <super>/<sub>
|
||||
- Replace other risky symbols with ASCII/text fallbacks
|
||||
"""
|
||||
s = _restore_escapes(text)
|
||||
s = _replace_super_sub(s)
|
||||
s = _fallback_symbols(s)
|
||||
return s
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python sanitize_code.py <target_script.py>")
|
||||
sys.exit(1)
|
||||
target = sys.argv[1]
|
||||
with open(target, "r", encoding="utf-8") as f:
|
||||
code = f.read()
|
||||
sanitized = sanitize_code(code)
|
||||
with open(target, "w", encoding="utf-8") as f:
|
||||
f.write(sanitized)
|
||||
print(f"Sanitized: {target}")
|
||||
Reference in New Issue
Block a user