feat: Add complete Agentic Compaction & Pipeline System

- Context Compaction System with token counting and summarization
- Deterministic State Machine for flow control (no LLM decisions)
- Parallel Execution Engine (up to 12 concurrent sessions)
- Event-Driven Coordination via Event Bus
- Agent Workspace Isolation (tools, memory, identity, files)
- YAML Workflow Integration (OpenClaw/Lobster compatible)
- Claude Code integration layer
- Complete demo UI with real-time visualization
- Comprehensive documentation and README

Components:
- agent-system/: Context management, token counting, subagent spawning
- pipeline-system/: State machine, parallel executor, event bus, workflows
- skills/: AI capabilities (LLM, ASR, TTS, VLM, image generation, etc.)
- src/app/: Next.js demo application

Total: ~100KB of production-ready TypeScript code
This commit is contained in:
Z User
2026-03-03 12:40:47 +00:00
Unverified
parent 63a8b123c9
commit 2380d33861
152 changed files with 51569 additions and 817 deletions

30
skills/pdf/LICENSE.txt Executable file
View File

@@ -0,0 +1,30 @@
© 2025 Anthropic, PBC. All rights reserved.
LICENSE: Use of these materials (including all code, prompts, assets, files,
and other components of this Skill) is governed by your agreement with
Anthropic regarding use of Anthropic's services. If no separate agreement
exists, use is governed by Anthropic's Consumer Terms of Service or
Commercial Terms of Service, as applicable:
https://www.anthropic.com/legal/consumer-terms
https://www.anthropic.com/legal/commercial-terms
Your applicable agreement is referred to as the "Agreement." "Services" are
as defined in the Agreement.
ADDITIONAL RESTRICTIONS: Notwithstanding anything in the Agreement to the
contrary, users may not:
- Extract these materials from the Services or retain copies of these
materials outside the Services
- Reproduce or copy these materials, except for temporary copies created
automatically during authorized use of the Services
- Create derivative works based on these materials
- Distribute, sublicense, or transfer these materials to any third party
- Make, offer to sell, sell, or import any inventions embodied in these
materials
- Reverse engineer, decompile, or disassemble these materials
The receipt, viewing, or possession of these materials does not convey or
imply any license or right beyond those expressly granted above.
Anthropic retains all right, title, and interest in these materials,
including all copyrights, patents, and other intellectual property rights.

1534
skills/pdf/SKILL.md Executable file

File diff suppressed because it is too large Load Diff

205
skills/pdf/forms.md Executable file
View File

@@ -0,0 +1,205 @@
**CRITICAL: You MUST complete these steps in order. Do not skip ahead to writing code.**
If you need to fill out a PDF form, first check to see if the PDF has fillable form fields. Run this script from this file's directory:
`python scripts/check_fillable_fields <file.pdf>`, and depending on the result go to either the "Fillable fields" or "Non-fillable fields" and follow those instructions.
# Fillable fields
If the PDF has fillable form fields:
- Run this script from this file's directory: `python scripts/extract_form_field_info.py <input.pdf> <field_info.json>`. It will create a JSON file with a list of fields in this format:
```
[
{
"field_id": (unique ID for the field),
"page": (page number, 1-based),
"rect": ([left, bottom, right, top] bounding box in PDF coordinates, y=0 is the bottom of the page),
"type": ("text", "checkbox", "radio_group", or "choice"),
},
// Checkboxes have "checked_value" and "unchecked_value" properties:
{
"field_id": (unique ID for the field),
"page": (page number, 1-based),
"type": "checkbox",
"checked_value": (Set the field to this value to check the checkbox),
"unchecked_value": (Set the field to this value to uncheck the checkbox),
},
// Radio groups have a "radio_options" list with the possible choices.
{
"field_id": (unique ID for the field),
"page": (page number, 1-based),
"type": "radio_group",
"radio_options": [
{
"value": (set the field to this value to select this radio option),
"rect": (bounding box for the radio button for this option)
},
// Other radio options
]
},
// Multiple choice fields have a "choice_options" list with the possible choices:
{
"field_id": (unique ID for the field),
"page": (page number, 1-based),
"type": "choice",
"choice_options": [
{
"value": (set the field to this value to select this option),
"text": (display text of the option)
},
// Other choice options
],
}
]
```
- Convert the PDF to PNGs (one image for each page) with this script (run from this file's directory):
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
Then analyze the images to determine the purpose of each form field (make sure to convert the bounding box PDF coordinates to image coordinates).
- Create a `field_values.json` file in this format with the values to be entered for each field:
```
[
{
"field_id": "last_name", // Must match the field_id from `extract_form_field_info.py`
"description": "The user's last name",
"page": 1, // Must match the "page" value in field_info.json
"value": "Simpson"
},
{
"field_id": "Checkbox12",
"description": "Checkbox to be checked if the user is 18 or over",
"page": 1,
"value": "/On" // If this is a checkbox, use its "checked_value" value to check it. If it's a radio button group, use one of the "value" values in "radio_options".
},
// more fields
]
```
- Run the `fill_fillable_fields.py` script from this file's directory to create a filled-in PDF:
`python scripts/fill_fillable_fields.py <input pdf> <field_values.json> <output pdf>`
This script will verify that the field IDs and values you provide are valid; if it prints error messages, correct the appropriate fields and try again.
# Non-fillable fields
If the PDF doesn't have fillable form fields, you'll need to visually determine where the data should be added and create text annotations. Follow the below steps *exactly*. You MUST perform all of these steps to ensure that the the form is accurately completed. Details for each step are below.
- Convert the PDF to PNG images and determine field bounding boxes.
- Create a JSON file with field information and validation images showing the bounding boxes.
- Validate the the bounding boxes.
- Use the bounding boxes to fill in the form.
## Step 1: Visual Analysis (REQUIRED)
- Convert the PDF to PNG images. Run this script from this file's directory:
`python scripts/convert_pdf_to_images.py <file.pdf> <output_directory>`
The script will create a PNG image for each page in the PDF.
- Carefully examine each PNG image and identify all form fields and areas where the user should enter data. For each form field where the user should enter text, determine bounding boxes for both the form field label, and the area where the user should enter text. The label and entry bounding boxes MUST NOT INTERSECT; the text entry box should only include the area where data should be entered. Usually this area will be immediately to the side, above, or below its label. Entry bounding boxes must be tall and wide enough to contain their text.
These are some examples of form structures that you might see:
*Label inside box*
```
┌────────────────────────┐
│ Name: │
└────────────────────────┘
```
The input area should be to the right of the "Name" label and extend to the edge of the box.
*Label before line*
```
Email: _______________________
```
The input area should be above the line and include its entire width.
*Label under line*
```
_________________________
Name
```
The input area should be above the line and include the entire width of the line. This is common for signature and date fields.
*Label above line*
```
Please enter any special requests:
________________________________________________
```
The input area should extend from the bottom of the label to the line, and should include the entire width of the line.
*Checkboxes*
```
Are you a US citizen? Yes □ No □
```
For checkboxes:
- Look for small square boxes (□) - these are the actual checkboxes to target. They may be to the left or right of their labels.
- Distinguish between label text ("Yes", "No") and the clickable checkbox squares.
- The entry bounding box should cover ONLY the small square, not the text label.
### Step 2: Create fields.json and validation images (REQUIRED)
- Create a file named `fields.json` with information for the form fields and bounding boxes in this format:
```
{
"pages": [
{
"page_number": 1,
"image_width": (first page image width in pixels),
"image_height": (first page image height in pixels),
},
{
"page_number": 2,
"image_width": (second page image width in pixels),
"image_height": (second page image height in pixels),
}
// additional pages
],
"form_fields": [
// Example for a text field.
{
"page_number": 1,
"description": "The user's last name should be entered here",
// Bounding boxes are [left, top, right, bottom]. The bounding boxes for the label and text entry should not overlap.
"field_label": "Last name",
"label_bounding_box": [30, 125, 95, 142],
"entry_bounding_box": [100, 125, 280, 142],
"entry_text": {
"text": "Johnson", // This text will be added as an annotation at the entry_bounding_box location
"font_size": 14, // optional, defaults to 14
"font_color": "000000", // optional, RRGGBB format, defaults to 000000 (black)
}
},
// Example for a checkbox. TARGET THE SQUARE for the entry bounding box, NOT THE TEXT
{
"page_number": 2,
"description": "Checkbox that should be checked if the user is over 18",
"entry_bounding_box": [140, 525, 155, 540], // Small box over checkbox square
"field_label": "Yes",
"label_bounding_box": [100, 525, 132, 540], // Box containing "Yes" text
// Use "X" to check a checkbox.
"entry_text": {
"text": "X",
}
}
// additional form field entries
]
}
```
Create validation images by running this script from this file's directory for each page:
`python scripts/create_validation_image.py <page_number> <path_to_fields.json> <input_image_path> <output_image_path>
The validation images will have red rectangles where text should be entered, and blue rectangles covering label text.
### Step 3: Validate Bounding Boxes (REQUIRED)
#### Automated intersection check
- Verify that none of bounding boxes intersect and that the entry bounding boxes are tall enough by checking the fields.json file with the `check_bounding_boxes.py` script (run from this file's directory):
`python scripts/check_bounding_boxes.py <JSON file>`
If there are errors, reanalyze the relevant fields, adjust the bounding boxes, and iterate until there are no remaining errors. Remember: label (blue) bounding boxes should contain text labels, entry (red) boxes should not.
#### Manual image inspection
**CRITICAL: Do not proceed without visually inspecting validation images**
- Red rectangles must ONLY cover input areas
- Red rectangles MUST NOT contain any text
- Blue rectangles should contain label text
- For checkboxes:
- Red rectangle MUST be centered on the checkbox square
- Blue rectangle should cover the text label for the checkbox
- If any rectangles look wrong, fix fields.json, regenerate the validation images, and verify again. Repeat this process until the bounding boxes are fully accurate.
### Step 4: Add annotations to the PDF
Run this script from this file's directory to create a filled-out PDF using the information in fields.json:
`python scripts/fill_pdf_form_with_annotations.py <input_pdf_path> <path_to_fields.json> <output_pdf_path>

765
skills/pdf/reference.md Executable file
View File

@@ -0,0 +1,765 @@
# PDF Processing Advanced Reference
This document contains advanced PDF processing features, detailed examples, and additional libraries not covered in the main skill instructions.
## pypdfium2 Library (Apache/BSD License)
### Overview
pypdfium2 is a Python binding for PDFium (Chromium's PDF library). It's excellent for fast PDF rendering, image generation, and serves as a PyMuPDF replacement.
### Render PDF to Images
```python
import pypdfium2 as pdfium
from PIL import Image
# Load PDF
pdf = pdfium.PdfDocument("document.pdf")
# Render page to image
page = pdf[0] # First page
bitmap = page.render(
scale=2.0, # Higher resolution
rotation=0 # No rotation
)
# Convert to PIL Image
img = bitmap.to_pil()
img.save("page_1.png", "PNG")
# Process multiple pages
for i, page in enumerate(pdf):
bitmap = page.render(scale=1.5)
img = bitmap.to_pil()
img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
```
### Extract Text with pypdfium2
```python
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("document.pdf")
for i, page in enumerate(pdf):
text = page.get_text()
print(f"Page {i+1} text length: {len(text)} chars")
```
## JavaScript Libraries
### pdf-lib (MIT License)
pdf-lib is a powerful JavaScript library for creating and modifying PDF documents in any JavaScript environment.
#### Load and Manipulate Existing PDF
```javascript
import { PDFDocument } from 'pdf-lib';
import fs from 'fs';
async function manipulatePDF() {
// Load existing PDF
const existingPdfBytes = fs.readFileSync('input.pdf');
const pdfDoc = await PDFDocument.load(existingPdfBytes);
// Get page count
const pageCount = pdfDoc.getPageCount();
console.log(`Document has ${pageCount} pages`);
// Add new page
const newPage = pdfDoc.addPage([600, 400]);
newPage.drawText('Added by pdf-lib', {
x: 100,
y: 300,
size: 16
});
// Save modified PDF
const pdfBytes = await pdfDoc.save();
fs.writeFileSync('modified.pdf', pdfBytes);
}
```
#### Create Complex PDFs from Scratch
**Note**: This JavaScript example uses pdf-lib's built-in StandardFonts. For Python/reportlab, always use the six registered fonts defined in SKILL.md (SimHei, Microsoft YaHei, SarasaMonoSC, Times New Roman, Calibri, DejaVuSans).
```javascript
import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';
import fs from 'fs';
async function createPDF() {
const pdfDoc = await PDFDocument.create();
// Add fonts
const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);
const helveticaBold = await pdfDoc.embedFont(StandardFonts.HelveticaBold);
// Add page
const page = pdfDoc.addPage([595, 842]); // A4 size
const { width, height } = page.getSize();
// Add text with styling
page.drawText('Invoice #12345', {
x: 50,
y: height - 50,
size: 18,
font: helveticaBold,
color: rgb(0.2, 0.2, 0.8)
});
// Add rectangle (header background)
page.drawRectangle({
x: 40,
y: height - 100,
width: width - 80,
height: 30,
color: rgb(0.9, 0.9, 0.9)
});
// Add table-like content
const items = [
['Item', 'Qty', 'Price', 'Total'],
['Widget', '2', '$50', '$100'],
['Gadget', '1', '$75', '$75']
];
let yPos = height - 150;
items.forEach(row => {
let xPos = 50;
row.forEach(cell => {
page.drawText(cell, {
x: xPos,
y: yPos,
size: 12,
font: helveticaFont
});
xPos += 120;
});
yPos -= 25;
});
const pdfBytes = await pdfDoc.save();
fs.writeFileSync('created.pdf', pdfBytes);
}
```
#### Advanced Merge and Split Operations
```javascript
import { PDFDocument } from 'pdf-lib';
import fs from 'fs';
async function mergePDFs() {
// Create new document
const mergedPdf = await PDFDocument.create();
// Load source PDFs
const pdf1Bytes = fs.readFileSync('doc1.pdf');
const pdf2Bytes = fs.readFileSync('doc2.pdf');
const pdf1 = await PDFDocument.load(pdf1Bytes);
const pdf2 = await PDFDocument.load(pdf2Bytes);
// Copy pages from first PDF
const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices());
pdf1Pages.forEach(page => mergedPdf.addPage(page));
// Copy specific pages from second PDF (pages 0, 2, 4)
const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]);
pdf2Pages.forEach(page => mergedPdf.addPage(page));
const mergedPdfBytes = await mergedPdf.save();
fs.writeFileSync('merged.pdf', mergedPdfBytes);
}
```
### pdfjs-dist (Apache License)
PDF.js is Mozilla's JavaScript library for rendering PDFs in the browser.
#### Basic PDF Loading and Rendering
```javascript
import * as pdfjsLib from 'pdfjs-dist';
// Configure worker (important for performance)
pdfjsLib.GlobalWorkerOptions.workerSrc = './pdf.worker.js';
async function renderPDF() {
// Load PDF
const loadingTask = pdfjsLib.getDocument('document.pdf');
const pdf = await loadingTask.promise;
console.log(`Loaded PDF with ${pdf.numPages} pages`);
// Get first page
const page = await pdf.getPage(1);
const viewport = page.getViewport({ scale: 1.5 });
// Render to canvas
const canvas = document.createElement('canvas');
const context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
const renderContext = {
canvasContext: context,
viewport: viewport
};
await page.render(renderContext).promise;
document.body.appendChild(canvas);
}
```
#### Extract Text with Coordinates
```javascript
import * as pdfjsLib from 'pdfjs-dist';
async function extractText() {
const loadingTask = pdfjsLib.getDocument('document.pdf');
const pdf = await loadingTask.promise;
let fullText = '';
// Extract text from all pages
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const textContent = await page.getTextContent();
const pageText = textContent.items
.map(item => item.str)
.join(' ');
fullText += `\n--- Page ${i} ---\n${pageText}`;
// Get text with coordinates for advanced processing
const textWithCoords = textContent.items.map(item => ({
text: item.str,
x: item.transform[4],
y: item.transform[5],
width: item.width,
height: item.height
}));
}
console.log(fullText);
return fullText;
}
```
#### Extract Annotations and Forms
```javascript
import * as pdfjsLib from 'pdfjs-dist';
async function extractAnnotations() {
const loadingTask = pdfjsLib.getDocument('annotated.pdf');
const pdf = await loadingTask.promise;
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const annotations = await page.getAnnotations();
annotations.forEach(annotation => {
console.log(`Annotation type: ${annotation.subtype}`);
console.log(`Content: ${annotation.contents}`);
console.log(`Coordinates: ${JSON.stringify(annotation.rect)}`);
});
}
}
```
## Advanced Command-Line Operations
### poppler-utils Advanced Features
#### Extract Text with Bounding Box Coordinates
```bash
# Extract text with bounding box coordinates (essential for structured data)
pdftotext -bbox-layout document.pdf output.xml
# The XML output contains precise coordinates for each text element
```
#### Advanced Image Conversion
```bash
# Convert to PNG images with specific resolution
pdftoppm -png -r 300 document.pdf output_prefix
# Convert specific page range with high resolution
pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
# Convert to JPEG with quality setting
pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
```
#### Extract Embedded Images
```bash
# Extract all embedded images with metadata
pdfimages -j -p document.pdf page_images
# List image info without extracting
pdfimages -list document.pdf
# Extract images in their original format
pdfimages -all document.pdf images/img
```
### qpdf Advanced Features
#### Complex Page Manipulation
```bash
# Split PDF into groups of pages
qpdf --split-pages=3 input.pdf output_group_%02d.pdf
# Extract specific pages with complex ranges
qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
# Merge specific pages from multiple PDFs
qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
```
#### PDF Optimization and Repair
```bash
# Optimize PDF for web (linearize for streaming)
qpdf --linearize input.pdf optimized.pdf
# Remove unused objects and compress
qpdf --optimize-level=all input.pdf compressed.pdf
# Attempt to repair corrupted PDF structure
qpdf --check input.pdf
qpdf --fix-qdf damaged.pdf repaired.pdf
# Show detailed PDF structure for debugging
qpdf --show-all-pages input.pdf > structure.txt
```
#### Advanced Encryption
```bash
# Add password protection with specific permissions
qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
# Check encryption status
qpdf --show-encryption encrypted.pdf
# Remove password protection (requires password)
qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
```
## Advanced Python Techniques
### pdfplumber Advanced Features
#### Extract Text with Precise Coordinates
```python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
page = pdf.pages[0]
# Extract all text with coordinates
chars = page.chars
for char in chars[:10]: # First 10 characters
print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
# Extract text by bounding box (left, top, right, bottom)
bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
```
#### Advanced Table Extraction with Custom Settings
```python
import pdfplumber
import pandas as pd
with pdfplumber.open("complex_table.pdf") as pdf:
page = pdf.pages[0]
# Extract tables with custom settings for complex layouts
table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 3,
"intersection_tolerance": 15
}
tables = page.extract_tables(table_settings)
# Visual debugging for table extraction
img = page.to_image(resolution=150)
img.save("debug_layout.png")
```
### reportlab Advanced Features
#### Quick TOC Template (Copy-Paste Ready)
```python
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.units import inch
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfbase.pdfmetrics import registerFontFamily
# Register fonts first
pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf'))
registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman')
# Setup
doc = SimpleDocTemplate("report.pdf", pagesize=A4,
leftMargin=0.75*inch, rightMargin=0.75*inch)
styles = getSampleStyleSheet()
# Configure heading style
styles['Heading1'].fontName = 'Times New Roman'
styles['Heading1'].textColor = colors.black # Titles must be black
story = []
# Calculate dimensions
page_width = A4[0]
available_width = page_width - 1.5*inch
page_num_width = 50 # Fixed width for page numbers (enough for 3-4 digits)
# Calculate dots: fill space from title to page number
dots_column_width = available_width - 200 - page_num_width # Reserve space for title + page
optimal_dot_count = int(dots_column_width / 4.5) # ~4.5pt per dot at 7pt font
# Define styles
toc_style = ParagraphStyle('TOCEntry', parent=styles['Normal'],
fontName='Times New Roman', fontSize=11, leading=16)
dots_style = ParagraphStyle('LeaderDots', parent=styles['Normal'],
fontName='Times New Roman', fontSize=7, leading=16) # Smaller font for more dots
# Build TOC (use Paragraph with <b></b> for bold heading)
toc_data = [
[Paragraph('<b>Table of Contents</b>', styles['Heading1']), '', ''],
['', '', ''],
]
entries = [('Section 1', '5'), ('Section 2', '10')]
for title, page in entries:
toc_data.append([
Paragraph(title, toc_style),
Paragraph('.' * optimal_dot_count, dots_style),
Paragraph(page, toc_style)
])
# Use None for title column (auto-expand), fixed for others
toc_table = Table(toc_data, colWidths=[None, dots_column_width, page_num_width])
toc_table.setStyle(TableStyle([
('GRID', (0, 0), (-1, -1), 0, colors.white),
('LINEBELOW', (0, 0), (0, 0), 1.5, colors.black),
('ALIGN', (0, 0), (0, -1), 'LEFT'),
('ALIGN', (1, 0), (1, -1), 'LEFT'),
('ALIGN', (2, 0), (2, -1), 'RIGHT'),
('VALIGN', (0, 0), (-1, -1), 'TOP'),
('LEFTPADDING', (0, 0), (-1, -1), 0),
('RIGHTPADDING', (0, 0), (-1, -1), 0),
('TOPPADDING', (0, 2), (-1, -1), 3),
('BOTTOMPADDING', (0, 2), (-1, -1), 3),
('TEXTCOLOR', (1, 2), (1, -1), colors.HexColor('#888888')),
]))
story.append(toc_table)
story.append(PageBreak())
doc.build(story)
```
#### Advanced: Table of Contents with Leader Dots
**Critical Rules for TOC with Leader Dots:**
1. **Three-column structure**: [Title, Dots, Page Number] for leader dot style
2. **Column width strategy**:
- Title: `None` (auto-expands to content)
- Dots: Calculated width = `available_width - 200 - 50` (reserves space for title + page)
- Page number: Fixed `50pt` (enough for 3-4 digit numbers, ensures right alignment)
3. **Dynamic dot count**: `int(dots_column_width / 4.5)` for 7pt font (adjust based on font size)
4. **Dot styling**: Small font (7-8pt) and gray color (#888888) for professional look
5. **Alignment sequence**: LEFT (title) → LEFT (dots flow from title) → RIGHT (page numbers)
6. **Zero padding**: Essential for seamless visual connection between columns
7. **Indentation**: Use leading spaces in title text for hierarchy (e.g., " 1.1 Subsection")
**MANDATORY STYLE REQUIREMENTS:**
- ✅ USE FIXED WIDTHS: Percentage-based widths are STRICTLY FORBIDDEN. You MUST use fixed values to guarantee alignment, especially for page numbers.
- ✅ DYNAMIC LEADER DOTS: Hard-coded dot counts are STRICTLY FORBIDDEN. You MUST calculate the number of dots dynamically based on the column width to prevent overflow or wrapping.
- ✅ MINIMUM COLUMN WIDTH: The page number column MUST be at least 40pt wide. Anything less will prevent proper right alignment.
- ✅ DOT FONT SIZE: Leader dot font size MUST NOT EXCEED 8pt. Larger sizes will ruin the dot density and are unacceptable.
- ✅ DOT ALIGNMENT: Dots MUST remain left-aligned to maintain the visual flow from the title. Right-aligning dots is forbidden.
- ✅ ZERO PADDING: Padding between columns MUST be set to exactly 0. Any gap will create a break in the dot line and is not allowed.
- ✅ USE PARAGRAPH OBJECTS: Bold text MUST be wrapped in a Paragraph() object like `Paragraph('<b>Text</b>', style)`. Using plain strings like `'<b>Text</b>'` is strictly STRICTLY FORBIDDEN as styles will not render.
#### CRITICAL: Table Cell Content Must Use Paragraph
**ALL text content in table cells MUST be wrapped in `Paragraph()` objects.** This is essential for:
- Rendering formatting tags (`<b>`, `<super>`, `<sub>`, `<i>`)
- Proper font application
- Correct text alignment within cells
- Consistent styling across the table
**The ONLY exception**: `Image()` objects can be placed directly in table cells without Paragraph wrapping.
```python
from reportlab.platypus import Table, TableStyle, Paragraph, Image
from reportlab.lib.styles import ParagraphStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT
# Define cell styles
header_style = ParagraphStyle(
name='TableHeader',
fontName='Times New Roman',
fontSize=11,
textColor=colors.white,
alignment=TA_CENTER
)
cell_style = ParagraphStyle(
name='TableCell',
fontName='Times New Roman',
fontSize=10,
textColor=colors.black,
alignment=TA_CENTER
)
# ✅ CORRECT: All text wrapped in Paragraph()
data = [
[
Paragraph('<b>Name</b>', header_style),
Paragraph('<b>Formula</b>', header_style),
Paragraph('<b>Value</b>', header_style)
],
[
Paragraph('Water', cell_style),
Paragraph('H<sub>2</sub>O', cell_style), # Subscript works
Paragraph('18.015 g/mol', cell_style)
],
[
Paragraph('Pressure', cell_style),
Paragraph('1.01 x 10<super>5</super> Pa', cell_style), # Superscript works
Paragraph('<b>Standard</b>', cell_style) # Bold works
]
]
# ❌ WRONG: Plain strings - NO formatting will render
# data = [
# ['<b>Name</b>', '<b>Formula</b>', '<b>Value</b>'], # Bold won't work!
# ['Water', 'H<sub>2</sub>O', '18.015 g/mol'], # Subscript won't work!
# ]
# Image exception - Image objects go directly, no Paragraph needed
# data_with_image = [
# [Paragraph('<b>Logo</b>', header_style), Paragraph('<b>Description</b>', header_style)],
# [Image('logo.png', width=50, height=50), Paragraph('Company logo', cell_style)],
# ]
table = Table(data, colWidths=[100, 150, 100])
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('VALIGN', (0, 0), (-1, -1), 'MIDDLE'),
]))
```
#### Debug Tips for Layout Issues
```python
from reportlab.platypus import HRFlowable
from reportlab.lib.colors import red
# Visualize spacing during development
story.append(table)
story.append(HRFlowable(width="100%", color=red, thickness=0.5, spaceBefore=0, spaceAfter=0))
story.append(Spacer(1, 6))
story.append(HRFlowable(width="100%", color=red, thickness=0.5, spaceBefore=0, spaceAfter=0))
story.append(caption)
# This creates visual markers to see actual spacing
```
## Complex Workflows
### Extract Figures/Images from PDF
#### Method 1: Using pdfimages (fastest)
```bash
# Extract all images with original quality
pdfimages -all document.pdf images/img
```
#### Method 2: Using pypdfium2 + Image Processing
```python
import pypdfium2 as pdfium
from PIL import Image
import numpy as np
def extract_figures(pdf_path, output_dir):
pdf = pdfium.PdfDocument(pdf_path)
for page_num, page in enumerate(pdf):
# Render high-resolution page
bitmap = page.render(scale=3.0)
img = bitmap.to_pil()
# Convert to numpy for processing
img_array = np.array(img)
# Simple figure detection (non-white regions)
mask = np.any(img_array != [255, 255, 255], axis=2)
# Find contours and extract bounding boxes
# (This is simplified - real implementation would need more sophisticated detection)
# Save detected figures
# ... implementation depends on specific needs
```
### Batch PDF Processing with Error Handling
```python
import os
import glob
from pypdf import PdfReader, PdfWriter
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def batch_process_pdfs(input_dir, operation='merge'):
pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
if operation == 'merge':
writer = PdfWriter()
for pdf_file in pdf_files:
try:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
logger.info(f"Processed: {pdf_file}")
except Exception as e:
logger.error(f"Failed to process {pdf_file}: {e}")
continue
with open("batch_merged.pdf", "wb") as output:
writer.write(output)
elif operation == 'extract_text':
for pdf_file in pdf_files:
try:
reader = PdfReader(pdf_file)
text = ""
for page in reader.pages:
text += page.extract_text()
output_file = pdf_file.replace('.pdf', '.txt')
with open(output_file, 'w', encoding='utf-8') as f:
f.write(text)
logger.info(f"Extracted text from: {pdf_file}")
except Exception as e:
logger.error(f"Failed to extract text from {pdf_file}: {e}")
continue
```
### Advanced PDF Cropping
```python
from pypdf import PdfWriter, PdfReader
reader = PdfReader("input.pdf")
writer = PdfWriter()
# Crop page (left, bottom, right, top in points)
page = reader.pages[0]
page.mediabox.left = 50
page.mediabox.bottom = 50
page.mediabox.right = 550
page.mediabox.top = 750
writer.add_page(page)
with open("cropped.pdf", "wb") as output:
writer.write(output)
```
## Performance Optimization Tips
### 1. For Large PDFs
- Use streaming approaches instead of loading entire PDF in memory
- Use `qpdf --split-pages` for splitting large files
- Process pages individually with pypdfium2
### 2. For Text Extraction
- `pdftotext -bbox-layout` is fastest for plain text extraction
- Use pdfplumber for structured data and tables
- Avoid `pypdf.extract_text()` for very large documents
### 3. For Image Extraction
- `pdfimages` is much faster than rendering pages
- Use low resolution for previews, high resolution for final output
### 4. For Form Filling
- pdf-lib maintains form structure better than most alternatives
- Pre-validate form fields before processing
### 5. Memory Management
```python
# Process PDFs in chunks
def process_large_pdf(pdf_path, chunk_size=10):
reader = PdfReader(pdf_path)
total_pages = len(reader.pages)
for start_idx in range(0, total_pages, chunk_size):
end_idx = min(start_idx + chunk_size, total_pages)
writer = PdfWriter()
for i in range(start_idx, end_idx):
writer.add_page(reader.pages[i])
# Process chunk
with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
writer.write(output)
```
## Troubleshooting Common Issues
### Encrypted PDFs
```python
# Handle password-protected PDFs
from pypdf import PdfReader
try:
reader = PdfReader("encrypted.pdf")
if reader.is_encrypted:
reader.decrypt("password")
except Exception as e:
print(f"Failed to decrypt: {e}")
```
### Corrupted PDFs
```bash
# Use qpdf to repair
qpdf --check corrupted.pdf
qpdf --replace-input corrupted.pdf
```
### Text Extraction Issues
```python
# Fallback to OCR for scanned PDFs
import pytesseract
from pdf2image import convert_from_path
def extract_text_with_ocr(pdf_path):
images = convert_from_path(pdf_path)
text = ""
for i, image in enumerate(images):
text += pytesseract.image_to_string(image)
return text
```
## License Information
- **pypdf**: BSD License
- **pdfplumber**: MIT License
- **pypdfium2**: Apache/BSD License
- **reportlab**: BSD License
- **poppler-utils**: GPL-2 License
- **qpdf**: Apache License
- **pdf-lib**: MIT License
- **pdfjs-dist**: Apache License

View File

@@ -0,0 +1,172 @@
#!/usr/bin/env python3
"""
Add Z.ai branding metadata to PDF documents.
This script adds Z.ai metadata (Author, Creator, Producer) to PDF files.
It can process single files or batch process multiple PDFs.
"""
import os
import sys
import argparse
from pypdf import PdfReader, PdfWriter
def add_zai_metadata(input_pdf_path, output_pdf_path=None, custom_title=None, verbose=True):
"""
Add Z.ai branding metadata to a PDF document.
Args:
input_pdf_path: Path to input PDF
output_pdf_path: Path to output PDF (default: overwrites input)
custom_title: Custom title to use (default: preserves original or uses filename)
verbose: Print status messages (default: True)
Sets:
- Author: Z.ai
- Creator: Z.ai
- Producer: http://z.ai
- Title: Custom title, original title, or filename (in that priority)
Returns:
Path to the output PDF file
"""
# Validate input file exists
if not os.path.exists(input_pdf_path):
print(f"Error: Input file not found: {input_pdf_path}", file=sys.stderr)
sys.exit(1)
# Read the PDF
try:
reader = PdfReader(input_pdf_path)
except Exception as e:
print(f"Error: Cannot open PDF: {e}", file=sys.stderr)
sys.exit(1)
writer = PdfWriter()
# Copy all pages
for page in reader.pages:
writer.add_page(page)
# Determine title
if custom_title:
title = custom_title
else:
original_meta = reader.metadata
if original_meta and original_meta.title and original_meta.title not in ['(anonymous)', 'unspecified', None]:
title = original_meta.title
else:
# Use filename without extension as title
title = os.path.splitext(os.path.basename(input_pdf_path))[0]
# Add Z.ai metadata
writer.add_metadata({
'/Title': title,
'/Author': 'Z.ai',
'/Creator': 'Z.ai',
'/Producer': 'http://z.ai',
})
# Write output
if output_pdf_path is None:
output_pdf_path = input_pdf_path
try:
with open(output_pdf_path, "wb") as output:
writer.write(output)
except Exception as e:
print(f"Error: Cannot write output file: {e}", file=sys.stderr)
sys.exit(1)
# Print status
if verbose:
print(f"✓ Updated metadata for: {os.path.basename(input_pdf_path)}")
print(f" Title: {title}")
print(f" Author: Z.ai")
print(f" Creator: Z.ai")
print(f" Producer: http://z.ai")
if output_pdf_path != input_pdf_path:
print(f" Output: {output_pdf_path}")
return output_pdf_path
def main():
"""Command-line interface for add_zai_metadata."""
parser = argparse.ArgumentParser(
description='Add Z.ai branding metadata to PDF documents',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Add metadata to a single PDF (in-place)
%(prog)s document.pdf
# Add metadata to a single PDF (create new file)
%(prog)s input.pdf -o output.pdf
# Add metadata with custom title
%(prog)s report.pdf -t "Q4 Financial Analysis"
# Batch process all PDFs in current directory
%(prog)s *.pdf
# Quiet mode (no output)
%(prog)s document.pdf -q
"""
)
parser.add_argument(
'input',
nargs='+',
help='Input PDF file(s) to process'
)
parser.add_argument(
'-o', '--output',
help='Output PDF path (only for single input file)'
)
parser.add_argument(
'-t', '--title',
help='Custom title for the PDF'
)
parser.add_argument(
'-q', '--quiet',
action='store_true',
help='Quiet mode (no status messages)'
)
args = parser.parse_args()
# Check if output is specified for multiple files
if args.output and len(args.input) > 1:
print("Error: --output can only be used with a single input file", file=sys.stderr)
sys.exit(1)
# Process each input file
for input_path in args.input:
# Determine output path
if len(args.input) == 1 and args.output:
output_path = args.output
else:
output_path = None # Overwrite in-place
# Determine title
if args.title:
custom_title = args.title
else:
custom_title = None
# Add metadata
add_zai_metadata(
input_path,
output_pdf_path=output_path,
custom_title=custom_title,
verbose=not args.quiet
)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,70 @@
from dataclasses import dataclass
import json
import sys
# Script to check that the `fields.json` file that GLM creates when analyzing PDFs
# does not have overlapping bounding boxes. See forms.md.
@dataclass
class RectAndField:
rect: list[float]
rect_type: str
field: dict
# Returns a list of messages that are printed to stdout for GLM to read.
def get_bounding_box_messages(fields_json_stream) -> list[str]:
messages = []
fields = json.load(fields_json_stream)
messages.append(f"Read {len(fields['form_fields'])} fields")
def rects_intersect(r1, r2):
disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
return not (disjoint_horizontal or disjoint_vertical)
rects_and_fields = []
for f in fields["form_fields"]:
rects_and_fields.append(RectAndField(f["label_bounding_box"], "label", f))
rects_and_fields.append(RectAndField(f["entry_bounding_box"], "entry", f))
has_error = False
for i, ri in enumerate(rects_and_fields):
# This is O(N^2); we can optimize if it becomes a problem.
for j in range(i + 1, len(rects_and_fields)):
rj = rects_and_fields[j]
if ri.field["page_number"] == rj.field["page_number"] and rects_intersect(ri.rect, rj.rect):
has_error = True
if ri.field is rj.field:
messages.append(f"FAILURE: intersection between label and entry bounding boxes for `{ri.field['description']}` ({ri.rect}, {rj.rect})")
else:
messages.append(f"FAILURE: intersection between {ri.rect_type} bounding box for `{ri.field['description']}` ({ri.rect}) and {rj.rect_type} bounding box for `{rj.field['description']}` ({rj.rect})")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if ri.rect_type == "entry":
if "entry_text" in ri.field:
font_size = ri.field["entry_text"].get("font_size", 14)
entry_height = ri.rect[3] - ri.rect[1]
if entry_height < font_size:
has_error = True
messages.append(f"FAILURE: entry bounding box height ({entry_height}) for `{ri.field['description']}` is too short for the text content (font size: {font_size}). Increase the box height or decrease the font size.")
if len(messages) >= 20:
messages.append("Aborting further checks; fix bounding boxes and try again")
return messages
if not has_error:
messages.append("SUCCESS: All bounding boxes are valid")
return messages
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: check_bounding_boxes.py [fields.json]")
sys.exit(1)
# Input file should be in the `fields.json` format described in forms.md.
with open(sys.argv[1]) as f:
messages = get_bounding_box_messages(f)
for msg in messages:
print(msg)

View File

@@ -0,0 +1,226 @@
import unittest
import json
import io
from check_bounding_boxes import get_bounding_box_messages
# Currently this is not run automatically in CI; it's just for documentation and manual checking.
class TestGetBoundingBoxMessages(unittest.TestCase):
def create_json_stream(self, data):
"""Helper to create a JSON stream from data"""
return io.StringIO(json.dumps(data))
def test_no_intersections(self):
"""Test case with no bounding box intersections"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [10, 40, 50, 60],
"entry_bounding_box": [60, 40, 150, 60]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_label_entry_intersection_same_field(self):
"""Test intersection between label and entry of the same field"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 60, 30],
"entry_bounding_box": [50, 10, 150, 30] # Overlaps with label
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_intersection_between_different_fields(self):
"""Test intersection between bounding boxes of different fields"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 1,
"label_bounding_box": [40, 20, 80, 40], # Overlaps with Name's boxes
"entry_bounding_box": [160, 10, 250, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "intersection" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_different_pages_no_intersection(self):
"""Test that boxes on different pages don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30]
},
{
"description": "Email",
"page_number": 2,
"label_bounding_box": [10, 10, 50, 30], # Same coordinates but different page
"entry_bounding_box": [60, 10, 150, 30]
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_entry_height_too_small(self):
"""Test that entry box height is checked against font size"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {
"font_size": 14 # Font size larger than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_entry_height_adequate(self):
"""Test that adequate entry box height passes"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 30], # Height is 20
"entry_text": {
"font_size": 14 # Font size smaller than height
}
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_default_font_size(self):
"""Test that default font size is used when not specified"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20], # Height is 10
"entry_text": {} # No font_size specified, should use default 14
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("FAILURE" in msg and "height" in msg for msg in messages))
self.assertFalse(any("SUCCESS" in msg for msg in messages))
def test_no_entry_text(self):
"""Test that missing entry_text doesn't cause height check"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [60, 10, 150, 20] # Small height but no entry_text
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
def test_multiple_errors_limit(self):
"""Test that error messages are limited to prevent excessive output"""
fields = []
# Create many overlapping fields
for i in range(25):
fields.append({
"description": f"Field{i}",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30], # All overlap
"entry_bounding_box": [20, 15, 60, 35] # All overlap
})
data = {"form_fields": fields}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
# Should abort after ~20 messages
self.assertTrue(any("Aborting" in msg for msg in messages))
# Should have some FAILURE messages but not hundreds
failure_count = sum(1 for msg in messages if "FAILURE" in msg)
self.assertGreater(failure_count, 0)
self.assertLess(len(messages), 30) # Should be limited
def test_edge_touching_boxes(self):
"""Test that boxes touching at edges don't count as intersecting"""
data = {
"form_fields": [
{
"description": "Name",
"page_number": 1,
"label_bounding_box": [10, 10, 50, 30],
"entry_bounding_box": [50, 10, 150, 30] # Touches at x=50
}
]
}
stream = self.create_json_stream(data)
messages = get_bounding_box_messages(stream)
self.assertTrue(any("SUCCESS" in msg for msg in messages))
self.assertFalse(any("FAILURE" in msg for msg in messages))
if __name__ == '__main__':
unittest.main()

View File

@@ -0,0 +1,12 @@
import sys
from pypdf import PdfReader
# Script for GLM to run to determine whether a PDF has fillable form fields. See forms.md.
reader = PdfReader(sys.argv[1])
if (reader.get_fields()):
print("This PDF has fillable form fields")
else:
print("This PDF does not have fillable form fields; you will need to visually determine where to enter data")

View File

@@ -0,0 +1,35 @@
import os
import sys
from pdf2image import convert_from_path
# Converts each page of a PDF to a PNG image.
def convert(pdf_path, output_dir, max_dim=1000):
images = convert_from_path(pdf_path, dpi=200)
for i, image in enumerate(images):
# Scale image if needed to keep width/height under `max_dim`
width, height = image.size
if width > max_dim or height > max_dim:
scale_factor = min(max_dim / width, max_dim / height)
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = image.resize((new_width, new_height))
image_path = os.path.join(output_dir, f"page_{i+1}.png")
image.save(image_path)
print(f"Saved page {i+1} as {image_path} (size: {image.size})")
print(f"Converted {len(images)} pages to PNG images")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: convert_pdf_to_images.py [input pdf] [output directory]")
sys.exit(1)
pdf_path = sys.argv[1]
output_directory = sys.argv[2]
convert(pdf_path, output_directory)

View File

@@ -0,0 +1,41 @@
import json
import sys
from PIL import Image, ImageDraw
# Creates "validation" images with rectangles for the bounding box information that
# GLM creates when determining where to add text annotations in PDFs. See forms.md.
def create_validation_image(page_number, fields_json_path, input_path, output_path):
# Input file should be in the `fields.json` format described in forms.md.
with open(fields_json_path, 'r') as f:
data = json.load(f)
img = Image.open(input_path)
draw = ImageDraw.Draw(img)
num_boxes = 0
for field in data["form_fields"]:
if field["page_number"] == page_number:
entry_box = field['entry_bounding_box']
label_box = field['label_bounding_box']
# Draw red rectangle over entry bounding box and blue rectangle over the label.
draw.rectangle(entry_box, outline='red', width=2)
draw.rectangle(label_box, outline='blue', width=2)
num_boxes += 2
img.save(output_path)
print(f"Created validation image at {output_path} with {num_boxes} bounding boxes")
if __name__ == "__main__":
if len(sys.argv) != 5:
print("Usage: create_validation_image.py [page number] [fields.json file] [input image path] [output image path]")
sys.exit(1)
page_number = int(sys.argv[1])
fields_json_path = sys.argv[2]
input_image_path = sys.argv[3]
output_image_path = sys.argv[4]
create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)

View File

@@ -0,0 +1,152 @@
import json
import sys
from pypdf import PdfReader
# Extracts data for the fillable form fields in a PDF and outputs JSON that
# GLM uses to fill the fields. See forms.md.
# This matches the format used by PdfReader `get_fields` and `update_page_form_field_values` methods.
def get_full_annotation_field_id(annotation):
components = []
while annotation:
field_name = annotation.get('/T')
if field_name:
components.append(field_name)
annotation = annotation.get('/Parent')
return ".".join(reversed(components)) if components else None
def make_field_dict(field, field_id):
field_dict = {"field_id": field_id}
ft = field.get('/FT')
if ft == "/Tx":
field_dict["type"] = "text"
elif ft == "/Btn":
field_dict["type"] = "checkbox" # radio groups handled separately
states = field.get("/_States_", [])
if len(states) == 2:
# "/Off" seems to always be the unchecked value, as suggested by
# https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
# It can be either first or second in the "/_States_" list.
if "/Off" in states:
field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
field_dict["unchecked_value"] = "/Off"
else:
print(f"Unexpected state values for checkbox `${field_id}`. Its checked and unchecked values may not be correct; if you're trying to check it, visually verify the results.")
field_dict["checked_value"] = states[0]
field_dict["unchecked_value"] = states[1]
elif ft == "/Ch":
field_dict["type"] = "choice"
states = field.get("/_States_", [])
field_dict["choice_options"] = [{
"value": state[0],
"text": state[1],
} for state in states]
else:
field_dict["type"] = f"unknown ({ft})"
return field_dict
# Returns a list of fillable PDF fields:
# [
# {
# "field_id": "name",
# "page": 1,
# "type": ("text", "checkbox", "radio_group", or "choice")
# // Per-type additional fields described in forms.md
# },
# ]
def get_field_info(reader: PdfReader):
fields = reader.get_fields()
field_info_by_id = {}
possible_radio_names = set()
for field_id, field in fields.items():
# Skip if this is a container field with children, except that it might be
# a parent group for radio button options.
if field.get("/Kids"):
if field.get("/FT") == "/Btn":
possible_radio_names.add(field_id)
continue
field_info_by_id[field_id] = make_field_dict(field, field_id)
# Bounding rects are stored in annotations in page objects.
# Radio button options have a separate annotation for each choice;
# all choices have the same field name.
# See https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
radio_fields_by_id = {}
for page_index, page in enumerate(reader.pages):
annotations = page.get('/Annots', [])
for ann in annotations:
field_id = get_full_annotation_field_id(ann)
if field_id in field_info_by_id:
field_info_by_id[field_id]["page"] = page_index + 1
field_info_by_id[field_id]["rect"] = ann.get('/Rect')
elif field_id in possible_radio_names:
try:
# ann['/AP']['/N'] should have two items. One of them is '/Off',
# the other is the active value.
on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
except KeyError:
continue
if len(on_values) == 1:
rect = ann.get("/Rect")
if field_id not in radio_fields_by_id:
radio_fields_by_id[field_id] = {
"field_id": field_id,
"type": "radio_group",
"page": page_index + 1,
"radio_options": [],
}
# Note: at least on macOS 15.7, Preview.app doesn't show selected
# radio buttons correctly. (It does if you remove the leading slash
# from the value, but that causes them not to appear correctly in
# Chrome/Firefox/Acrobat/etc).
radio_fields_by_id[field_id]["radio_options"].append({
"value": on_values[0],
"rect": rect,
})
# Some PDFs have form field definitions without corresponding annotations,
# so we can't tell where they are. Ignore these fields for now.
fields_with_location = []
for field_info in field_info_by_id.values():
if "page" in field_info:
fields_with_location.append(field_info)
else:
print(f"Unable to determine location for field id: {field_info.get('field_id')}, ignoring")
# Sort by page number, then Y position (flipped in PDF coordinate system), then X.
def sort_key(f):
if "radio_options" in f:
rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
else:
rect = f.get("rect") or [0, 0, 0, 0]
adjusted_position = [-rect[1], rect[0]]
return [f.get("page"), adjusted_position]
sorted_fields = fields_with_location + list(radio_fields_by_id.values())
sorted_fields.sort(key=sort_key)
return sorted_fields
def write_field_info(pdf_path: str, json_output_path: str):
reader = PdfReader(pdf_path)
field_info = get_field_info(reader)
with open(json_output_path, "w") as f:
json.dump(field_info, f, indent=2)
print(f"Wrote {len(field_info)} fields to {json_output_path}")
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: extract_form_field_info.py [input pdf] [output json]")
sys.exit(1)
write_field_info(sys.argv[1], sys.argv[2])

View File

@@ -0,0 +1,114 @@
import json
import sys
from pypdf import PdfReader, PdfWriter
from extract_form_field_info import get_field_info
# Fills fillable form fields in a PDF. See forms.md.
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
with open(fields_json_path) as f:
fields = json.load(f)
# Group by page number.
fields_by_page = {}
for field in fields:
if "value" in field:
field_id = field["field_id"]
page = field["page"]
if page not in fields_by_page:
fields_by_page[page] = {}
fields_by_page[page][field_id] = field["value"]
reader = PdfReader(input_pdf_path)
has_error = False
field_info = get_field_info(reader)
fields_by_ids = {f["field_id"]: f for f in field_info}
for field in fields:
existing_field = fields_by_ids.get(field["field_id"])
if not existing_field:
has_error = True
print(f"ERROR: `{field['field_id']}` is not a valid field ID")
elif field["page"] != existing_field["page"]:
has_error = True
print(f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})")
else:
if "value" in field:
err = validation_error_for_field_value(existing_field, field["value"])
if err:
print(err)
has_error = True
if has_error:
sys.exit(1)
writer = PdfWriter(clone_from=reader)
for page, field_values in fields_by_page.items():
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
# This seems to be necessary for many PDF viewers to format the form values correctly.
# It may cause the viewer to show a "save changes" dialog even if the user doesn't make any changes.
writer.set_need_appearances_writer(True)
with open(output_pdf_path, "wb") as f:
writer.write(f)
def validation_error_for_field_value(field_info, field_value):
field_type = field_info["type"]
field_id = field_info["field_id"]
if field_type == "checkbox":
checked_val = field_info["checked_value"]
unchecked_val = field_info["unchecked_value"]
if field_value != checked_val and field_value != unchecked_val:
return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"'
elif field_type == "radio_group":
option_values = [opt["value"] for opt in field_info["radio_options"]]
if field_value not in option_values:
return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}'
elif field_type == "choice":
choice_values = [opt["value"] for opt in field_info["choice_options"]]
if field_value not in choice_values:
return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}'
return None
# pypdf (at least version 5.7.0) has a bug when setting the value for a selection list field.
# In _writer.py around line 966:
#
# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0:
# txt = "\n".join(annotation.get_inherited(FA.Opt, []))
#
# The problem is that for selection lists, `get_inherited` returns a list of two-element lists like
# [["value1", "Text 1"], ["value2", "Text 2"], ...]
# This causes `join` to throw a TypeError because it expects an iterable of strings.
# The horrible workaround is to patch `get_inherited` to return a list of the value strings.
# We call the original method and adjust the return value only if the argument to `get_inherited`
# is `FA.Opt` and if the return value is a list of two-element lists.
def monkeypatch_pydpf_method():
from pypdf.generic import DictionaryObject
from pypdf.constants import FieldDictionaryAttributes
original_get_inherited = DictionaryObject.get_inherited
def patched_get_inherited(self, key: str, default = None):
result = original_get_inherited(self, key, default)
if key == FieldDictionaryAttributes.Opt:
if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
result = [r[0] for r in result]
return result
DictionaryObject.get_inherited = patched_get_inherited
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]")
sys.exit(1)
monkeypatch_pydpf_method()
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_fields(input_pdf, fields_json, output_pdf)

View File

@@ -0,0 +1,108 @@
import json
import sys
from pypdf import PdfReader, PdfWriter
from pypdf.annotations import FreeText
# Fills a PDF by adding text annotations defined in `fields.json`. See forms.md.
def transform_coordinates(bbox, image_width, image_height, pdf_width, pdf_height):
"""Transform bounding box from image coordinates to PDF coordinates"""
# Image coordinates: origin at top-left, y increases downward
# PDF coordinates: origin at bottom-left, y increases upward
x_scale = pdf_width / image_width
y_scale = pdf_height / image_height
left = bbox[0] * x_scale
right = bbox[2] * x_scale
# Flip Y coordinates for PDF
top = pdf_height - (bbox[1] * y_scale)
bottom = pdf_height - (bbox[3] * y_scale)
return left, bottom, right, top
def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
"""Fill the PDF form with data from fields.json"""
# `fields.json` format described in forms.md.
with open(fields_json_path, "r") as f:
fields_data = json.load(f)
# Open the PDF
reader = PdfReader(input_pdf_path)
writer = PdfWriter()
# Copy all pages to writer
writer.append(reader)
# Get PDF dimensions for each page
pdf_dimensions = {}
for i, page in enumerate(reader.pages):
mediabox = page.mediabox
pdf_dimensions[i + 1] = [mediabox.width, mediabox.height]
# Process each form field
annotations = []
for field in fields_data["form_fields"]:
page_num = field["page_number"]
# Get page dimensions and transform coordinates.
page_info = next(p for p in fields_data["pages"] if p["page_number"] == page_num)
image_width = page_info["image_width"]
image_height = page_info["image_height"]
pdf_width, pdf_height = pdf_dimensions[page_num]
transformed_entry_box = transform_coordinates(
field["entry_bounding_box"],
image_width, image_height,
pdf_width, pdf_height
)
# Skip empty fields
if "entry_text" not in field or "text" not in field["entry_text"]:
continue
entry_text = field["entry_text"]
text = entry_text["text"]
if not text:
continue
font_name = entry_text.get("font", "Arial")
font_size = str(entry_text.get("font_size", 14)) + "pt"
font_color = entry_text.get("font_color", "000000")
# Font size/color seems to not work reliably across viewers:
# https://github.com/py-pdf/pypdf/issues/2084
annotation = FreeText(
text=text,
rect=transformed_entry_box,
font=font_name,
font_size=font_size,
font_color=font_color,
border_color=None,
background_color=None,
)
annotations.append(annotation)
# page_number is 0-based for pypdf
writer.add_annotation(page_number=page_num - 1, annotation=annotation)
# Save the filled PDF
with open(output_pdf_path, "wb") as output:
writer.write(output)
print(f"Successfully filled PDF form and saved to {output_pdf_path}")
print(f"Added {len(annotations)} text annotations")
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_pdf_form_with_annotations.py [input pdf] [fields.json] [output pdf]")
sys.exit(1)
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_form(input_pdf, fields_json, output_pdf)

View File

@@ -0,0 +1,110 @@
import re
import html
import sys
from typing import Dict
# ---------- Step 0: restore literal unicode escapes/entities to real chars ----------
_RE_UNICODE_ESC = re.compile(r"(\\u[0-9a-fA-F]{4})|(\\U[0-9a-fA-F]{8})|(\\x[0-9a-fA-F]{2})")
def _restore_escapes(s: str) -> str:
# HTML entities: &#179; &#x2264; &alpha; ...
s = html.unescape(s)
# Literal backslash escapes: "\\u00B3" -> "³"
def _dec(m: re.Match) -> str:
esc = m.group(0)
try:
if esc.startswith("\\u") or esc.startswith("\\U"):
return chr(int(esc[2:], 16))
if esc.startswith("\\x"):
return chr(int(esc[2:], 16))
except Exception:
return esc
return esc
return _RE_UNICODE_ESC.sub(_dec, s)
# ---------- Step 1: superscripts/subscripts -> <super>/<sub> ----------
_SUPERSCRIPT_MAP: Dict[str, str] = {
"": "0", "¹": "1", "²": "2", "³": "3", "": "4",
"": "5", "": "6", "": "7", "": "8", "": "9",
"": "+", "": "-", "": "=", "": "(", "": ")",
"": "n", "": "i",
}
_SUBSCRIPT_MAP: Dict[str, str] = {
"": "0", "": "1", "": "2", "": "3", "": "4",
"": "5", "": "6", "": "7", "": "8", "": "9",
"": "+", "": "-", "": "=", "": "(", "": ")",
"": "a", "": "e", "": "h", "": "i", "": "j",
"": "k", "": "l", "": "m", "": "n", "": "o",
"": "p", "": "r", "": "s", "": "t", "": "u",
"": "v", "": "x",
}
def _replace_super_sub(s: str) -> str:
out = []
for ch in s:
if ch in _SUPERSCRIPT_MAP:
out.append(f"<super>{_SUPERSCRIPT_MAP[ch]}</super>")
elif ch in _SUBSCRIPT_MAP:
out.append(f"<sub>{_SUBSCRIPT_MAP[ch]}</sub>")
else:
out.append(ch)
return "".join(out)
# ---------- Step 2: symbol fallback for SimHei (protect tags, then replace) ----------
_SYMBOL_FALLBACK: Dict[str, str] = {
# Currently empty - enable entries as needed for fonts missing specific glyphs
# "±": "+/-",
# "×": "*",
# "÷": "/",
# "≤": "<=",
# "≥": ">=",
# "≠": "!=",
# "≈": "~=",
# "∞": "inf",
}
def _fallback_symbols(s: str) -> str:
# Protect <super>/<sub> tags from being modified
placeholders = {}
def _protect_tag(m: re.Match) -> str:
key = f"@@TAG{len(placeholders)}@@"
placeholders[key] = m.group(0)
return key
protected = re.sub(r"</?super>|</?sub>", _protect_tag, s)
# Replace symbols
protected = "".join(_SYMBOL_FALLBACK.get(ch, ch) for ch in protected)
# Restore tags
for k, v in placeholders.items():
protected = protected.replace(k, v)
return protected
def sanitize_code(text: str) -> str:
"""
Full sanitization pipeline for PDF generation code.
- Restore unicode escapes/entities to real characters
- Replace superscript/subscript unicode with <super>/<sub>
- Replace other risky symbols with ASCII/text fallbacks
"""
s = _restore_escapes(text)
s = _replace_super_sub(s)
s = _fallback_symbols(s)
return s
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python sanitize_code.py <target_script.py>")
sys.exit(1)
target = sys.argv[1]
with open(target, "r", encoding="utf-8") as f:
code = f.read()
sanitized = sanitize_code(code)
with open(target, "w", encoding="utf-8") as f:
f.write(sanitized)
print(f"Sanitized: {target}")