--- name: pdf description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When GLM needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale. license: Proprietary. LICENSE.txt has complete terms --- # PDF Processing Guide ## Overview This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions. Role: You are a Professional Document Architect and Technical Editor specializing in high-density, industry-standard PDF content creation. If the content is not rich enough, use the web-search skill first. Objective: Generate content that is information-rich, structured for maximum professional utility, and optimized for a compact, low-padding layout without sacrificing readability. --- ## Core Constraints (Must Follow) ### 1. Output Language **Generated PDF must use the same language as user's query.** - Chinese query → Generate Chinese PDF content - English query → Generate English PDF content - Explicit language specification → Follow user's choice ### 2. Page Count Control - Follow user's page specifications strictly | User Input | Execution Rule | |------------|----------------| | Explicit count (e.g., "3 pages") | Match exactly; allow partial final page | | Unspecified | Determine based on document type; prioritize completeness over brevity | **Avoid these mistakes**: - Cutting content short (brevity is not a valid excuse) - Filling pages with low-density bullet lists (keep information dense) - Creating documents over 2x the requested length **Resume/CV exception**: - Target **1 page** by default unless otherwise instructed - Apply tight margins: `margin: 1.5cm` ### 3. Structure Compliance (Mandatory) **User supplies outline**: - **Strictly follow** the outline structure provided by user - Match section names from outline (slight rewording OK; preserve hierarchy and sequence) - Never add/remove sections on your own - If structure seems flawed, **confirm with user** before changing **No outline provided**: - Deploy standard frameworks by document category: - **Academic papers**: IMRaD format (Introduction-Methods-Results-Discussion) or Introduction-Literature Review-Methods-Results-Discussion-Conclusion - **Business reports**: Top-down approach (Executive Summary → In-depth Analysis → Recommendations) - **Technical guides**: Overview → Core Concepts → Implementation → Examples → FAQ - **Academic assignments**: Match assignment rubric structure - Ensure logical flow between sections without gaps ### 4. Information Sourcing Requirements #### CRITICAL: Verify Before Writing **Never invent facts. If unsure, SEARCH immediately.** Mandatory search triggers - You **MUST search FIRST** if content includes ANY of the following:: - Quantitative data, metrics, percentages, rankings - Legal/regulatory frameworks, policies, industry standards - Scholarly findings, theoretical models, research methods - Recent news, emerging trends - **Any information you cannot verify with certainty** ### 5. Character Safety Rule (Mandatory) **Golden Rule: Every character in the final PDF must come from following sources:** 1. CJK characters rendered by registered Chinese fonts (SimHei / Microsoft YaHei) 2. Mathematical/relational operators (e.g., `+` ,`−` , `×`, `÷`, `±`, `≤`,`√`, `∑`,`≅`, `∫`, `π`, `∠`, etc.) **FORBIDDEN unicode escape sequence (DO NOT USE):** 1. Superscript and subscript digits (Never use the form like: \u00b2, \u2082, etc.) 2. Math operators and special symbols (Never use the form like: \u2245, \u0394, \u2212, \u00d7, etc.) 3. Emoji characters (Never use the form like: \u2728, \u2705, etc.) **The ONLY way to produce bold text, superscripts, subscripts, or Mathematical/relational operators is through ReportLab tags inside `Paragraph()` objects:** | Need | Correct Method | Correct Example | |------|---------------|---------| | Superscript | `` tag in `Paragraph()` | `Paragraph('102 × 103 = 105', style)` | | Subscript | `` tag in `Paragraph()` | `Paragraph('H2O', style)` | | Bold | `` tag in `Paragraph()` | `Paragraph('Title', style)` | | Mathematical/relational operators | Literal char in `Paragraph()` | `Paragraph('AB ⊥ AC, ∠A = 90°, and ΔABC ≅ ΔDCF', style)` | | Scientific notation | Combined tags in `Paragraph()` | `Paragraph('1.2 × 108 kg/m3', style)` | ```python from reportlab.platypus import Paragraph from reportlab.lib.styles import ParagraphStyle from reportlab.lib.enums import TA_LEFT, TA_CENTER body_style = enbody_style = ParagraphStyle( name="ENBodyStyle", fontName="Times New Roman", fontSize=10.5, leading=18, alignment=TA_JUSTIFY, ) header_style = ParagraphStyle( name='CoverTitle', fontName='Times New Roman', fontSize=42, leading=50, alignment=TA_CENTER, spaceAfter=36 ) # Superscript: area unit Paragraph('Total area: 500 m2', body_style) # Subscript: chemical formula Paragraph('The reaction produces CO2 and H2O', body_style) # Scientific notation: large number with superscript Paragraph('Speed of light: 3.0 × 108 m/s', body_style) # Combined superscript and subscript Paragraph('Ek = mv2/2', body_style) # Bold heading Paragraph('Chapter 1: Introduction', header_style) # Math symbols in body text Paragraph('When ∠ A = 90°, AB ⊥ AC and ΔABC ≅ ΔDEF', body_style) ``` **Pre-generation check — before writing ANY string, ask:** > "Does this string contain a character outside basic CJK or Mathematical/relational operators?" > If YES → it MUST be inside a `Paragraph()` with the appropriate tag. > If it is a superscript/subscript digit in raw unicode escape sequence form → REPLACE with ``/`` tag. **NEVER rely on post-generation scanning. Prevent at the point of writing.** ## Font Setup (Guaranteed Success Method) ### CRITICAL: Allowed Fonts Only **You MUST ONLY use the following registered fonts. Using ANY other font (such as Arial, Helvetica, Courier, Georgia, etc.) is STRICTLY FORBIDDEN and will cause rendering failures.** | Font Name | Usage | Path | |-----------|-------|------| | `Microsoft YaHei` | Chinese headings | `/usr/share/fonts/truetype/chinese/msyh.ttf` | | `SimHei` | Chinese body text | `/usr/share/fonts/truetype/chinese/SimHei.ttf` | | `SarasaMonoSC` | Chinese code blocks | `/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf` | | `Times New Roman` | English text, numbers, tables | `/usr/share/fonts/truetype/english/Times-New-Roman.ttf` | | `Calibri` | English alternative | `/usr/share/fonts/truetype/english/calibri-regular.ttf` | | `DejaVuSans` | Formulas, symbols, code | `/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf` | **FORBIDDEN fonts (DO NOT USE):** - ❌ Arial, Arial-Bold, Arial-Italic - ❌ Helvetica, Helvetica-Bold, Helvetica-Oblique - ❌ Courier, Courier-Bold - ❌ Any font not listed in the table above **For bold text and superscript/subscript:** - Must call `registerFontFamily()` after registering fonts - Then use ``, ``, `` tags in Paragraph - **CRITICAL**: These tags ONLY work inside `Paragraph()` objects, NOT in plain strings ### Font Registration Template ```python from reportlab.pdfbase import pdfmetrics from reportlab.pdfbase.ttfonts import TTFont from reportlab.pdfbase.pdfmetrics import registerFontFamily # Chinese fonts pdfmetrics.registerFont(TTFont('Microsoft YaHei', '/usr/share/fonts/truetype/chinese/msyh.ttf')) pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf')) pdfmetrics.registerFont(TTFont("SarasaMonoSC", '/usr/share/fonts/truetype/chinese/SarasaMonoSC-Regular.ttf')) # English fonts pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf')) pdfmetrics.registerFont(TTFont('Calibri', '/usr/share/fonts/truetype/english/calibri-regular.ttf')) # Symbol/Formula font pdfmetrics.registerFont(TTFont("DejaVuSans", '/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf')) # CRITICAL: Register font families to enable , , tags registerFontFamily('Microsoft YaHei', normal='Microsoft YaHei', bold='Microsoft YaHei') registerFontFamily('SimHei', normal='SimHei', bold='SimHei') registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman') registerFontFamily('Calibri', normal='Calibri', bold='Calibri') registerFontFamily('DejaVuSans', normal='DejaVuSans', bold='DejaVuSans') ``` ### Font Configuration by Document Type **For Chinese PDFs:** - Body text: `SimHei` or `Microsoft YaHei` - Headings: `Microsoft YaHei` (MUST use for Chinese headings) - Code blocks: `SarasaMonoSC` - Formulas/symbols: `DejaVuSans` - **In tables: ALL Chinese content and numbers MUST use `SimHei`** **For English PDFs:** - Body text: `Times New Roman` - Headings: `Times New Roman` (MUST use for English headings) - Code blocks: `DejaVuSans` - **In tables: ALL English content and numbers MUST use `Times New Roman`** **For Mixed Chinese-English PDFs (CRITICAL):** - Chinese text and numbers: Use `SimHei` - English text: Use `Times New Roman` - **ALWAYS apply this rule when generating PDFs containing both Chinese and English text** - **In tables: ALL Chinese content and numbers MUST use `SimHei`, ALL English content MUST use `Times New Roman`** - **Mixed Chinese-English Text Font Handling**: When a single string contains **both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)")**: MUST split the string by language and apply different fonts to each part using ReportLab's inline `` tags within `Paragraph` objects. English fonts (e.g., `Times New Roman`) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., `SimHei`) render English with poor spacing. Must set `ParagraphStyle.fontName` to your **base font**, then wrap segments of the other language with `` inline tags. ```python from reportlab.lib.styles import ParagraphStyle from reportlab.platypus import Paragraph from reportlab.pdfbase import pdfmetrics from reportlab.pdfbase.ttfonts import TTFont pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf')) pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf')) # Base font is English; wrap Chinese parts: enbody_style = ParagraphStyle( name="ENBodyStyle", fontName="Times New Roman", # Base font for English fontSize=10.5, leading=18, alignment=TA_JUSTIFY, ) # Wrap Chinese segments with tag story.append(Paragraph( 'Zhipu QingYan (智谱清言) is developed by Z.ai' 'My name is Lei Shen (沈磊)', '文心一言 (ERNIE Bot) is by Baidu.', enbody_style )) # Base font is Chinese; wrap English parts: cnbody_style = ParagraphStyle( name="CNBodyStyle", fontName="SimHei", # Base font for Chinese fontSize=10.5, leading=18, alignment=TA_JUSTIFY, ) # Wrap Chinese segments with tag story.append(Paragraph( '本报告使用 GPT-4 ' '和 GLM 进行测试。', cnbody_style )) ``` ### Chinese Plot PNG Method If using Python to generate PNGs containing Chinese characters: ```python import matplotlib.pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False ``` ### Available Font Paths Run `fc-list` to get more fonts. Font files are typically located under: - `/usr/share/fonts/truetype/chinese/` - `/usr/share/fonts/truetype/english/` - `/usr/share/fonts/` ## Guidelines for Output 1. **Information Density**: Prioritize depth and conciseness. Avoid fluff or excessive introductory filler. Use professional, precise terminology. 2. **Structural Hierarchy**: Use nested headings (H1, H2, H3) and logical numbering (e.g., 1.1, 1.1.1) to organize complex data. 3. **Data Formatting**: Convert long paragraphs into structured tables, multi-column lists, or compact bullet points wherever possible to reduce vertical whitespace. 4. **Visual Rhythm**: Use horizontal rules (---) to separate major sections. Ensure a high text-to-whitespace ratio while maintaining a clear scannable path for the eye. 5. **Technical Precision**: Use LaTeX for all mathematical or scientific notations. Ensure all tables are formatted with clear headers. 6. **Tone**: Academic, corporate, and authoritative. Adapt to the specific professional field (e.g., Legal, Engineering, Financial) as requested. 7. **Data Presentation**: - When comparing data or showing trends, use charts instead of plain text lists - Tables use the standard color scheme defined below 8. **Links & References**: - URLs must be clickable hyperlinks - Multiple figures/tables add numbering and cross-references ("see Figure 1", "as shown in Table 2") - Academic/legal/data analysis citation scenarios implement correct in-text click-to-jump references with corresponding footnotes/endnotes ## Layout & Spacing Control ### Page Breaks - NEVER insert page breaks between sections (H1,H2, H3) or within chapters - Let content flow naturally; avoid forcing new pages - **Specific allowed locations**: * Between the cover page and table of contents (if TOC exists) * Between the cover page and main content (if NO TOC exists) * Between the table of contents and main content (if TOC exists) * Between the main content and back cover page (if back cover page exists) ### Vertical Spacing Standards * **Before tables**: `Spacer(1, 18)` after preceding text content (symmetric with table+caption block bottom spacing) * After tables: `Spacer(1, 6)` before table caption * After table captions: `Spacer(1, 18)` before next content (larger gap for table+caption blocks) * Between paragraphs: `Spacer(1, 12)` (approximately 1 line) * Between H3 subsections: `Spacer(1, 12)` * Between H2 sections: `Spacer(1, 18)` (approximately 1.5 lines) * Between H1 sections: `Spacer(1, 24)` (approximately 2 lines) * NEVER use `Spacer(1, X)` where X > 24, except for intentional H1 major section breaks or cover page elements ### Cover Page Specifications When creating PDFs with cover pages, use the following enlarged specifications: **Title Formatting:** - Main title font size: `36-48pt` (vs normal heading 18-20pt) - Subtitle font size: `18-24pt` - Author/date font size: `14-16pt` - ALL titles MUST be bold: Use `` tags in Paragraph (requires `registerFontFamily()` call first) **Cover Page Spacing:** - Top margin to title: `Spacer(1, 120)` or more (push title to upper-middle area) - After main title: `Spacer(1, 36)` before subtitle - After subtitle: `Spacer(1, 48)` before author/institution info - Between author lines: `Spacer(1, 18)` - After author block: `Spacer(1, 60)` before date - Use `PageBreak()` after cover page content **Alignment:** - All text or image in cover page must use `TA_CENTER` **Cover Page Style Example:** ```python # Cover page styles cover_title_style = ParagraphStyle( name='CoverTitle', fontName='Microsoft YaHei', # or 'Times New Roman' for English fontSize=42, leading=50, alignment=TA_CENTER, spaceAfter=36 ) cover_subtitle_style = ParagraphStyle( name='CoverSubtitle', fontName='SimHei', # or 'Times New Roman' for English fontSize=20, leading=28, alignment=TA_CENTER, spaceAfter=48 ) cover_author_style = ParagraphStyle( name='CoverAuthor', fontName='SimHei', # or 'Times New Roman' for English fontSize=14, leading=22, alignment=TA_CENTER, spaceAfter=18 ) # Cover page construction story.append(Spacer(1, 120)) # Push down from top story.append(Paragraph("报告主标题", cover_title_style)) story.append(Spacer(1, 36)) story.append(Paragraph("副标题或说明文字", cover_subtitle_style)) story.append(Spacer(1, 48)) story.append(Paragraph("作者姓名", cover_author_style)) story.append(Paragraph("所属机构", cover_author_style)) story.append(Spacer(1, 60)) story.append(Paragraph("2025年2月", cover_author_style)) story.append(PageBreak()) # Always page break after cover ``` ### Table & Content Flow * Standard sequence: `Spacer(1, 18)` → Table → `Spacer(1, 6)` → Caption (centered) → `Spacer(1, 18)` → Next content * Keep related content together: table + caption + immediate analysis * Avoid orphan headings at page bottom ### Alignment and Typography - **CJK body**: Use `TA_LEFT` + 2-char indent. Headings: no indent. - **Font sizes**: Body 11pt, subheadings 14pt, headings 18-20pt - **Line height**: 1.5-1.6 (keep line leading at 1.2x font size minimum for readability) - **CRITICAL: Alignment Selection Rule**: - Use `TA_JUSTIFY` only when **ALL** of the following conditions are met: * Language: The text is predominantly English (≥ 90%) * Column width: Sufficiently wide (A4 single-column body text) * Font: Western fonts (e.g. Times New Roman / Calibri) * Chinese content: None or negligible - Otherwise, always default to `TA_LEFT` - **Note**: CJK text with `TA_JUSTIFY` can cause orphaned punctuation (commas, periods) at line start - For Chinese text, always add `wordWrap='CJK'` to ParagraphStyle to ensure proper typography rules ### Style Configuration * Normal paragraph: `spaceBefore=0`, `spaceAfter=6-12` * Headings: `spaceBefore=12-18`, `spaceAfter=6-12` * **Headings must be bold**: Use `` tags in Paragraph (requires `registerFontFamily()` call first) * Table captions: `spaceBefore=3`, `spaceAfter=6`, `alignment=TA_CENTER` * **CRITICAL**: For Chinese text, always add `wordWrap='CJK'` to ParagraphStyle - Prevents closing punctuation from appearing at line start - Prevents opening brackets from appearing at line end - Ensures proper Chinese typography rules ### Table Formatting #### Standard Table Color Scheme (MUST USE for ALL tables) ```python # Define standard colors for consistent table styling TABLE_HEADER_COLOR = colors.HexColor('#1F4E79') # Dark blue for header TABLE_HEADER_TEXT = colors.white # White text for header TABLE_ROW_EVEN = colors.white # White for even rows TABLE_ROW_ODD = colors.HexColor('#F5F5F5') # Light gray for odd rows ``` - A table caption must be added immediately after the table (centered) - The entire table must be centered on the page - **Header Row Formatting (CRITICAL)**: - Background: Dark blue (#1F4E79) - Text color: White (set via ParagraphStyle with `textColor=colors.white`) - Font weight: **Bold** (use `` tags in Paragraph after calling `registerFontFamily()`) - **IMPORTANT**: Bold tags ONLY work inside `Paragraph()` objects. Plain strings like `'Text'` will NOT render bold. - **Cell Formatting (Inside the Table)**: - Left/Right Cell Margin: Set to at least 120-200 twips (approximately the width of one character) - Text Alignment: Each body element within the same table must be aligned the same method. - **Font**: ALL Chinese text and numbers in tables MUST use `SimHei` for Chinese PDFs. ALL English text and numbers in tables MUST use `Times New Roman` for English PDFs. ALL Chinese content and numbers MUST use `SimHei`, ALL English content MUST use `Times New Roman` for Mixed Chinese-English PDFs. - **Units with Exponents (CRITICAL)**: - PROHIBITED: `W/m2`, `kg/m3`, `m/s2` (plain text exponents) - RIGHT: `Paragraph('W/m2', style)`, `Paragraph('kg/m3', style)` (proper superscript in Paragraph) - Always use `` tags inside Paragraph objects for unit exponents in table cells - **Numeric Values in Tables (CRITICAL)**: - Large numbers MUST use scientific notation: `Paragraph('-1.246 × 108', style)` not `-124600000` - Small decimals MUST use scientific notation: `Paragraph('2.5 × 10-3', style)` not `0.0025` - Threshold: Use scientific notation when |value| ≥ 10000 or |value| ≤ 0.001 - Format: `Paragraph('coefficient × 10exponent', style)` (e.g., `Paragraph('-1.246 × 108', style)`) #### Table Cell Paragraph Wrapping (MANDATORY - REVIEW BEFORE EVERY TABLE) **STOP AND CHECK**: Before creating ANY table, verify that ALL text cells use `Paragraph()`. ```python # 1) key point in Chinese: wordWrap="CJK" tbl_center = ParagraphStyle( "tbl_center", fontName="SimHei", fontSize=9, leading=12, alignment=TA_CENTER, wordWrap="CJK", ) # 2) ALL content MUST be wrapped in Paragraph - NO EXCEPTIONS for text findings_data = [] for a, b, c in findings: findings_data.append([ Paragraph(a, tbl_center), Paragraph(b, tbl_center), Paragraph(c, tbl_center), # ALL content MUST be wrapped in Paragraph ]) findings_table = Table(findings_data, colWidths=[1.8*cm, 3*cm, 9*cm]) ``` **Complete Table Example:** ```python from reportlab.platypus import Table, TableStyle, Paragraph, Image from reportlab.lib.styles import ParagraphStyle from reportlab.lib import colors from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_RIGHT, TA_JUSTIFY # Define styles for table cells header_style = ParagraphStyle( name='TableHeader', fontName='Times New Roman', fontSize=11, textColor=colors.white, alignment=TA_CENTER ) cell_style = ParagraphStyle( name='TableCell', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_CENTER ) cell_style_jus = ParagraphStyle( name='TableCellLeft', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_JUSTIFY ) cell_style_right = ParagraphStyle( name='TableCellRight', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_RIGHT ) # ✅ CORRECT: All text content wrapped in Paragraph() data = [ # Header row - bold text with Paragraph [ Paragraph('Parameter', header_style), Paragraph('Unit', header_style), Paragraph('Value', header_style), Paragraph('Note', header_style) ], # Data rows - all text in Paragraph [ Paragraph('Temperature', cell_style_jus), Paragraph('°C', cell_style), Paragraph('25.5', cell_style_jus), Paragraph('Ambient', cell_style) ], [ Paragraph('Pressure', cell_style_jus), Paragraph('Pa', cell_style), Paragraph('1.01 × 105', cell_style_jus), # Scientific notation Paragraph('Standard', cell_style) ], [ Paragraph('Density', cell_style_jus), Paragraph('kg/m3', cell_style), # Unit with exponent Paragraph('1.225', cell_style_jus), Paragraph('Air at STP', cell_style) ], [ Paragraph('H2O Content', cell_style_jus), # Subscript Paragraph('%', cell_style), Paragraph('45.2', cell_style_jus), Paragraph('Relative humidity', cell_style) ] ] # ❌ PROHIBITED: Plain strings - NEVER DO THIS # data = [ # ['Parameter', 'Unit', 'Value'], # Bold won't work! # ['Pressure', 'Pa', '1.01 × 105'], # Superscript won't work! # ] # Create table table = Table(data, colWidths=[120, 80, 100, 120]) table.setStyle(TableStyle([ # Header styling ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')), ('TEXTCOLOR', (0, 0), (-1, 0), colors.white), # Alternating row colors ('BACKGROUND', (0, 1), (-1, 1), colors.white), ('BACKGROUND', (0, 2), (-1, 2), colors.HexColor('#F5F5F5')), ('BACKGROUND', (0, 3), (-1, 3), colors.white), ('BACKGROUND', (0, 4), (-1, 4), colors.HexColor('#F5F5F5')), # Grid and alignment ('GRID', (0, 0), (-1, -1), 0.5, colors.grey), ('VALIGN', (0, 0), (-1, -1), 'MIDDLE'), ('LEFTPADDING', (0, 0), (-1, -1), 8), ('RIGHTPADDING', (0, 0), (-1, -1), 8), ('TOPPADDING', (0, 0), (-1, -1), 6), ('BOTTOMPADDING', (0, 0), (-1, -1), 6), ])) # Example with image (Image is the ONLY exception - no Paragraph needed) # data_with_image = [ # [Paragraph('Item', header_style), Paragraph('Image', header_style)], # [Paragraph('Logo', cell_style), Image('logo.png', width=50, height=50)], # Image directly, no Paragraph # ] ``` ### PDF Metadata (REQUIRED) **CRITICAL**: ALL PDFs MUST have proper metadata set during creation. #### Required Metadata Fields - **Title**: MUST match the filename (without .pdf extension) - **Author**: MUST be set to "Z.ai" - **Creator**: MUST be set to "Z.ai" - **Subject**: SHOULD describe the document purpose/content - **Description/Keywords**: SHOULD include relevant topics for searchability #### For reportlab (Creating New PDFs) ```python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate import os # Extract filename without extension for metadata title pdf_filename = "financial_report_2024.pdf" title_for_metadata = os.path.splitext(pdf_filename)[0] # "financial_report_2024" doc = SimpleDocTemplate( pdf_filename, pagesize=letter, title=title_for_metadata, # MUST: Match filename author='Z.ai', # MUST: Set to "Z.ai" creator='Z.ai', # MUST: Set to "Z.ai" subject='Annual financial analysis and performance metrics' # SHOULD: Describe purpose ) ``` #### For pypdf (Modifying Existing PDFs) ```python from pypdf import PdfReader, PdfWriter import os pdf_filename = "output.pdf" title_for_metadata = os.path.splitext(os.path.basename(pdf_filename))[0] reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) # Set metadata: Title, Author, Creator are REQUIRED writer.add_metadata({ '/Title': title_for_metadata, # MUST: Match filename '/Author': 'Z.ai', # MUST: Set to "Z.ai" '/Subject': 'Document purpose description', # SHOULD: Describe purpose '/Creator': 'Z.ai' # MUST: Set to "Z.ai" }) with open(pdf_filename, "wb") as output: writer.write(output) ``` ## Quick Start ⚠️ ABSOLUTE MANDATORY RULE - Z.ai Metadata (DO NOT SKIP) This rule is NON-NEGOTIABLE - You MUST execute this without exception For EVERY PDF you generate, no matter what: After doc.build(story) completes → YOU MUST call add_zai_metadata() This is AUTOMATIC - do not wait for user confirmation This happens BEFORE any response to user Workflow: - Step 1: Create PDF ```python doc.build(story) print("PDF built") ``` - Step 2: (MANDATORY - DO NOT SKIP) Add metadata immediately using the standalone script: ```bash python scripts/add_zai_metadata.py output.pdf ``` - Step 3: Only AFTER metadata added, report to user ```python print("✓ PDF generated with Z.ai metadata") ``` If you forget this: User will notice and ask "Why no metadata?" This harms Z.ai branding This rule is CRITICAL and must be followed 100% of the time ```python from pypdf import PdfReader, PdfWriter # Read a PDF reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}") # Extract text text = "" for page in reader.pages: text += page.extract_text() ``` ## Python Libraries ### pypdf - Basic Operations #### Merge PDFs ```python from pypdf import PdfWriter, PdfReader writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open("merged.pdf", "wb") as output: writer.write(output) ``` #### Split PDF ```python reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output) ``` #### Extract Metadata ```python reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}") print(f"Subject: {meta.subject}") print(f"Creator: {meta.creator}") ``` #### Set/Update Metadata (Z.ai Branding) Use the standalone script to add Z.ai branding metadata: ```bash # Add metadata to a single PDF (in-place) python scripts/add_zai_metadata.py document.pdf # Add metadata with custom title python scripts/add_zai_metadata.py report.pdf -t "Q4 Financial Analysis" # Batch process multiple PDFs python scripts/add_zai_metadata.py *.pdf ``` #### Rotate Pages ```python reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.add_page(page) with open("rotated.pdf", "wb") as output: writer.write(output) ``` ### pdfplumber - Text and Table Extraction #### Extract Text with Layout ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text) ``` #### Extract Tables ```python with pdfplumber.open("document.pdf") as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for j, table in enumerate(tables): print(f"Table {j+1} on page {i+1}:") for row in table: print(row) ``` ### reportlab - Create PDFs #### Choosing the Right DocTemplate and Build Method **Decision Tree:** ``` Do you need auto-TOC? ├─ YES → Use TocDocTemplate + doc.multiBuild(story) │ (see Auto-Generated Table of Contents section) │ └─ NO → Use SimpleDocTemplate + doc.build(story) (basic documents, or with optional Cross-References) ``` **When to use each approach:** | Requirement | DocTemplate | Build Method | |-------------|-------------|--------------| | Multi-page with TOC | `TocDocTemplate` | `multiBuild()` | | Single-page or no TOC | `SimpleDocTemplate` | `build()` | | With Cross-References (no TOC) | `SimpleDocTemplate` | `build()` | | Both TOC + Cross-References | `TocDocTemplate` | `multiBuild()` | **⚠️ CRITICAL**: - `multiBuild()` is ONLY needed when using `TableOfContents` - Using `build()` with `TocDocTemplate` = TOC won't work - Using `multiBuild()` without `TocDocTemplate` = unnecessary overhead ### Rich Text Formatting: Bold, Superscript, Subscript, and Special Characters #### Prerequisites To use ``, ``, `` tags, you **must**: 1. Register your fonts via `registerFont()` 2. Call `registerFontFamily()` to link normal/bold/italic variants 3. Wrap all tagged text in `Paragraph()` objects **CRITICAL**: These tags ONLY work inside `Paragraph()` objects. Plain strings like `'Text'` will NOT render correctly. #### Character Handling (see Core Constraint #5) All superscript, subscript, and Mathematical/relational operators rules are defined in **Core Constraint #5 — Character Safety Rule**. **Quick reminder when writing Rich Text**: - ``, ``, `` tags ONLY work inside `Paragraph()` objects - Must call `registerFontFamily()` first to enable these tags - Plain strings like `'Text'` will NOT render — always use `Paragraph()` - For scientific notation: `Paragraph('coefficient × 10exponent', style)` - For chemical formulas: `Paragraph('H2O', style)` Do NOT use any unicode escape sequence(e.g., Superscript and subscript digits, Math operators and special symbols, Emoji characters) anywhere. If you are unsure whether a character is safe, wrap it in a `Paragraph()` with the appropriate tag. #### Complete Python Example ```python # --- Register fonts and font family --- pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf')) # CRITICAL: Must call registerFontFamily() to enable and tags registerFontFamily('Times New Roman', normal='Times New Roman', bold='Times New Roman') # --- Define styles --- body_style = ParagraphStyle( name='BodyStyle', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_JUSTIFY, ) bold_style = ParagraphStyle( name='BoldStyle', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_JUSTIFY, ) header_style = ParagraphStyle( name='HeaderStyle', fontName='Times New Roman', fontSize=10, textColor=colors.white, alignment=TA_JUSTIFY, ) # --- Body text examples --- # Bold title title = Paragraph('Scientific Formulas and Chemical Expressions', bold_style) # Math formula with superscript and mathematical symbol × math_text = Paragraph( 'The Einstein mass-energy equivalence is expressed as E = mc2. ' 'In applied physics, the gravitational force is F = 6.674 × 10-11 × ' 'm1m2/r2, ' 'and the quadratic formula solves a2 + b2 = c2.', body_style, ) # Chemical expressions with subscript chem_text = Paragraph( 'The combustion of methane: CH4 + 2O2 ' '= CO2 + 2H2O. ' 'Sulfuric acid (H2SO4) reacts with sodium hydroxide to produce ' 'Na2SO4 and water.', body_style, ) ``` #### Preventing Unwanted Line Breaks **Problem 1: English names broken at awkward positions** ```python # PROHIBITED: "K.G. Palepu" may break after "K.G." text = Paragraph("Professors (K.G. Palepu) proposed...",style) # RIGHT: Use non-breaking space (U+00A0) to prevent breaking text = Paragraph("Professors (K.G.\u00A0Palepu) proposed...",style) ``` **Problem 2: Punctuation at line start** ```python # RIGHT: Add wordWrap='CJK' for proper typography styles.add(ParagraphStyle( name='BodyStyle', fontName='SimHei', fontSize=10.5, leading=18, alignment=TA_LEFT, wordWrap='CJK' # Prevents orphaned punctuation )) ``` **Problem 3: Creating intentional line breaks** ```python # PROHIBITED: Normal newline character does NOT create line breaks text = Paragraph("Line 1\nLine 2\nLine 3", style) # Will render as single line! # RIGHT: Use
tag for line breaks text = Paragraph("Line 1
Line 2
Line 3", style) # Alternative: Split into multiple Paragraph objects story.append(Paragraph("Line 1", style)) story.append(Paragraph("Line 2", style)) story.append(Paragraph("Line 3", style)) ``` #### Basic PDF Creation ```python from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter # Add text c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF created with reportlab") # Add a line c.line(100, height - 140, 400, height - 140) # Save c.save() ``` #### Auto-Generated Table of Contents ## ⚠️ CRITICAL WARNINGS ### ❌ FORBIDDEN: Manual Table of Contents **NEVER manually create TOC like this:** ```python # ❌ PROHIBIT - DO NOT USE toc_entries = [("1. Title", "5"), ("2. Section", "10")] for entry, page in toc_entries: story.append(Paragraph(f"{entry} {'.'*50} {page}", style)) ``` **Why it's PROHIBIT:** - Hardcoded page numbers become incorrect when content changes - No clickable hyperlinks - Manual leader dots are fragile - Must be manually updated with every document change **✅ ALWAYS use auto-generated TOC:** **Key Implementation Requirements:** - **Custom `TocDocTemplate` class**: Override `afterFlowable()` to capture TOC entries - **Bookmark attributes**: Set `bookmark_name`, `bookmark_level`, `bookmark_text` on each heading - **Use `doc.multiBuild(story)`**: NOT `doc.build()` - multiBuild is required for TOC processing - **Clickable hyperlinks**: Generated automatically with proper styling **Helper Function Pattern:** ```python def add_heading(text, style, level=0): """Create heading with bookmark for auto-TOC""" p = Paragraph(text, style) p.bookmark_name = text p.bookmark_level = level p.bookmark_text = text return p # Usage: story.append(add_heading("1. Introduction", styles['Heading1'], 0)) story.append(Paragraph('Content...', styles['Normal'])) ``` #### Complete TOC Implementation Example Copy and adapt this complete working code for your PDF with Table of Contents: ```python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak, Spacer from reportlab.platypus.tableofcontents import TableOfContents from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.units import inch class TocDocTemplate(SimpleDocTemplate): def __init__(self, *args, **kwargs): SimpleDocTemplate.__init__(self, *args, **kwargs) def afterFlowable(self, flowable): """Capture TOC entries after each flowable is rendered""" if hasattr(flowable, 'bookmark_name'): level = getattr(flowable, 'bookmark_level', 0) text = getattr(flowable, 'bookmark_text', '') self.notify('TOCEntry', (level, text, self.page)) # Create document doc = TocDocTemplate("document.pdf", pagesize=letter) story = [] styles = getSampleStyleSheet() # Create Table of Contents toc = TableOfContents() toc.levelStyles = [ ParagraphStyle(name='TOCHeading1', fontSize=14, leftIndent=20, fontName='Times New Roman'), ParagraphStyle(name='TOCHeading2', fontSize=12, leftIndent=40, fontName='Times New Roman'), ] story.append(Paragraph("Table of Contents", styles['Title'])) story.append(Spacer(1, 0.2*inch)) story.append(toc) story.append(PageBreak()) # Helper function: Create heading with TOC bookmark def add_heading(text, style, level=0): p = Paragraph(text, style) p.bookmark_name = text p.bookmark_level = level p.bookmark_text = text return p # Chapter 1: Introduction story.append(add_heading("Chapter 1: Introduction", styles['Heading1'], 0)) story.append(Paragraph("This is the introduction chapter with some example content.", styles['Normal'])) story.append(Spacer(1, 0.2*inch)) story.append(add_heading("1.1 Background", styles['Heading2'], 1)) story.append(Paragraph("Background information goes here.", styles['Normal'])) # Chapter 2: Conclusion story.append(add_heading("Chapter 2: Conclusion", styles['Heading1'], 0)) story.append(Paragraph("This concludes our document.", styles['Normal'])) story.append(Spacer(1, 0.2*inch)) story.append(add_heading("2.1 Summary", styles['Heading2'], 1)) story.append(Paragraph("Summary of the document.", styles['Normal'])) # Build the document (must use multiBuild for TOC to work) doc.multiBuild(story) print("PDF with Table of Contents created successfully!") ``` #### Cross-References (Figures, Tables, Bibliography) **OPTIONAL**: For academic papers requiring citation systems (LaTeX-style `\ref{}` and `\cite{}`) **Key Principle**: Pre-register all figures, tables, and references BEFORE using them in text. **Simple Implementation Pattern:** ```python from reportlab.lib.pagesizes import letter from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle from reportlab.lib.enums import TA_CENTER from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib import colors from reportlab.platypus import Table, TableStyle class CrossReferenceDocument: """Manages cross-references throughout the document""" def __init__(self): self.figures = {} self.tables = {} self.refs = {} self.figure_counter = 0 self.table_counter = 0 self.ref_counter = 0 def add_figure(self, name): """Add a figure and return its number""" if name not in self.figures: self.figure_counter += 1 self.figures[name] = self.figure_counter return self.figures[name] def add_table(self, name): """Add a table and return its number""" if name not in self.tables: self.table_counter += 1 self.tables[name] = self.table_counter return self.tables[name] def add_reference(self, name): """Add a reference and return its number""" if name not in self.refs: self.ref_counter += 1 self.refs[name] = self.ref_counter return self.refs[name] def build_document(): doc = SimpleDocTemplate("cross_ref.pdf", pagesize=letter) xref = CrossReferenceDocument() styles = getSampleStyleSheet() # Caption style styles.add(ParagraphStyle( name='Caption', parent=styles['Normal'], alignment=TA_CENTER, fontSize=10, textColor=colors.HexColor('#333333') )) story = [] # Step 1: Register all figures, tables, and references FIRST fig1 = xref.add_figure('sample') table1 = xref.add_table('data') ref1 = xref.add_reference('author2024') # Step 2: Use them in text intro = f""" See Figure {fig1} for details and Table {table1} for data[{ref1}]. """ story.append(Paragraph(intro, styles['Normal'])) story.append(Spacer(1, 0.2*inch)) # Step 3: Create figures and tables with numbered captions story.append(Paragraph(f"Figure {fig1}. Sample Figure Caption", styles['Caption'] )) # Table example header_style = ParagraphStyle( name='TableHeader', fontName='Times New Roman', fontSize=11, textColor=colors.white, alignment=TA_CENTER ) cell_style = ParagraphStyle( name='TableCell', fontName='Times New Roman', fontSize=10, textColor=colors.black, alignment=TA_CENTER ) # All text content wrapped in Paragraph() data = [ [Paragraph('Item', header_style), Paragraph('Value', header_style)], [Paragraph('A', cell_style), Paragraph('10', cell_style)], [Paragraph('B', cell_style), Paragraph('20', cell_style)], ] t = Table(data, colWidths=[2*inch, 2*inch]) t.setStyle(TableStyle([ ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#1F4E79')), ('TEXTCOLOR', (0, 0), (-1, 0), colors.white), ('ALIGN', (0, 0), (-1, -1), 'CENTER'), ('GRID', (0, 0), (-1, -1), 0.5, colors.grey), ])) story.append(t) story.append(Spacer(1, 6)) story.append(Paragraph(f"Table {table1}. Sample Data Table", styles['Caption'] )) story.append(PageBreak()) # Step 4: Reference again in discussion discussion = f""" As shown in Figure {fig1} and Table {table1}, results are clear[{ref1}]. """ story.append(Paragraph(discussion, styles['Normal'])) # Step 5: Bibliography section story.append(PageBreak()) story.append(Paragraph("References", styles['Heading1'])) story.append(Paragraph( f"[{ref1}] Author, A. (2024). Example Reference. Journal Name.", styles['Normal'] )) doc.build(story) print("PDF with cross-references created!") if __name__ == '__main__': build_document() ``` **Usage Notes:** - **Pre-registration is critical**: Call `add_figure()`/`add_table()`/`add_reference()` at the START of your document - **Citation format**: Use `Paragraph('[{ref_num}]')` for inline citations - **Caption format**: Use `Paragraph('Figure {num}.')` or `Paragraph('Table {num}.')` with centered caption style - **Combine with TOC**: Use `TocDocTemplate` + `doc.multiBuild(story)` if both cross-refs and auto-TOC are needed ## Command-Line Tools ### pdftotext (poppler-utils) ```bash # Extract text pdftotext input.pdf output.txt # Extract text preserving layout pdftotext -layout input.pdf output.txt # Extract specific pages pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5 ``` ### qpdf ```bash # Merge PDFs qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf # Split pages qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf # Rotate pages qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees # Remove password qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf ``` ## Common Tasks ### Brand PDFs with Z.ai Metadata ⚠️ CRITICAL MANDATORY RULE - PDF Metadata MUST be Added After Every PDF Generation All PDFs MUST have metadata added immediately after creation - This is the FINAL step and CANNOT be skipped **Usage - Standalone Script:** ```bash # Add metadata to a single PDF (in-place) python scripts/add_zai_metadata.py document.pdf # Add metadata to a single PDF (create new file) python scripts/add_zai_metadata.py input.pdf -o output.pdf # Add metadata with custom title python scripts/add_zai_metadata.py report.pdf -t "Q4 Financial Analysis" # Batch process all PDFs in current directory python scripts/add_zai_metadata.py *.pdf # Quiet mode (no output) python scripts/add_zai_metadata.py document.pdf -q # Show help python scripts/add_zai_metadata.py --help ``` **Requirements:** After doc.build(story) completes → Immediately call the script Do NOT wait for user reminder, Do NOT check task description - Execute automatically Confirm metadata info to user after adding Memory phrase: PDF build done, metadata must add, no need to remind ### Extract Text from Scanned PDFs ```python # Requires: pip install pytesseract pdf2image import pytesseract from pdf2image import convert_from_path # Convert PDF to images images = convert_from_path('scanned.pdf') # OCR each page text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n" print(text) ``` ### Add Watermark ```python from pypdf import PdfReader, PdfWriter # Create watermark (or load existing) watermark = PdfReader("watermark.pdf").pages[0] # Apply to all pages reader = PdfReader("document.pdf") writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page) with open("watermarked.pdf", "wb") as output: writer.write(output) ``` ### Password Protection ```python from pypdf import PdfReader, PdfWriter reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) # Add password writer.encrypt("userpassword", "ownerpassword") with open("encrypted.pdf", "wb") as output: writer.write(output) ``` ## Critical Reminders (MUST Follow) ### Font Rules - **FONT RESTRICTION**: ONLY use the six registered fonts. NEVER use Arial, Helvetica, Courier, or any unregistered fonts. - **In tables**: ALL Chinese text and numbers MUST use `SimHei` for Chinese PDF. ALL English text and numbers MUST use `Times New Roman` for English PDF. ALL Chinese content and numbers MUST use `SimHei`, ALL English content MUST use `Times New Roman` for Mixed Chinese-English PDF. - **CRITICAL**: Must call `registerFontFamily()` after registering fonts to enable ``, ``, `` tags. - **Mixed Chinese-English Text Font Handling**: When a single string contains **both Chinese and English characters (e.g., "My name is Lei Shen (沈磊)")**: MUST split the string by language and apply different fonts to each part using ReportLab's inline `` tags within `Paragraph` objects. English fonts (e.g., `Times New Roman`) cannot render Chinese characters (they appear as blank boxes), and Chinese fonts (e.g., `SimHei`) render English with poor spacing. Must set `ParagraphStyle.fontName` to your **base font**, then wrap segments of the other language with `` inline tags. ```python from reportlab.lib.styles import ParagraphStyle from reportlab.platypus import Paragraph from reportlab.pdfbase import pdfmetrics from reportlab.pdfbase.ttfonts import TTFont pdfmetrics.registerFont(TTFont('SimHei', '/usr/share/fonts/truetype/chinese/SimHei.ttf')) pdfmetrics.registerFont(TTFont('Times New Roman', '/usr/share/fonts/truetype/english/Times-New-Roman.ttf')) # Base font is English; wrap Chinese parts: enbody_style = ParagraphStyle( name="ENBodyStyle", fontName="Times New Roman", # Base font for English fontSize=10.5, leading=18, alignment=TA_JUSTIFY, ) # Wrap Chinese segments with tag story.append(Paragraph( 'Zhipu QingYan (智谱清言) is developed by Z.ai' 'My name is Lei Shen (沈磊)', '文心一言 (ERNIE Bot) is by Baidu.', enbody_style )) # Base font is Chinese; wrap English parts: cnbody_style = ParagraphStyle( name="CNBodyStyle", fontName="SimHei", # Base font for Chinese fontSize=10.5, leading=18, alignment=TA_JUSTIFY, ) # Wrap Chinese segments with tag story.append(Paragraph( '本报告使用 GPT-4 ' '和 GLM 进行测试。', cnbody_style )) ``` ### Rich Text Tags (``, ``, ``) - These tags ONLY work inside `Paragraph()` objects — plain strings will NOT render them. - **Character Safety**: Follow **Core Constraint #5** strictly. Do not use forbidden Unicode superscript/subscript/math characters anywhere in the code. Always use ``, ``,`` tags inside `Paragraph()`. - **Scientific Notation in Tables**: `Paragraph('1.246 × 108', style)` — never write large numbers as plain digits. ### Line Breaks in Paragraph - **CRITICAL**: `Paragraph` does not treat a normal newline character (`\n`) as a line break. To create line breaks, you must use `
` (or split the content into multiple `Paragraph` objects). ```python sms3 = \\\"\\\"\\\"Hi [FIRST_NAME] You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public. [DATE] | [TIME] [Boutique Name] _private champagne reception included_ Can I save you a spot? Just let me know! [Your Name]\\\"\\\"\\\" sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400]) # IMPORTANT: # Paragraph does NOT treat '\n' as a line break. # Use
to force line breaks. sms3 = """Hi [FIRST_NAME]

You're invited! Join us for an exclusive first look at the Carolina Herrera Resort 2025 collection—before it opens to the public.

[DATE] | [TIME]
[Boutique Name]

private champagne reception included

Can I save you a spot? Just let me know!

[Your Name]""" sms3_box = Table([[Paragraph(sms3, sms1_style)]], colWidths=[400]) ``` ### Body Title & Heading Styles - **All titles and sub-titles (except for Table headers)**: Must be bold with black text - use `Paragraph('Title', style)` + `textColor=colors.black`. ### Table Cell Content Rule (MANDATORY) **ALL text content in table cells MUST be wrapped in `Paragraph()`. This is NON-NEGOTIABLE.** ❌ **PROHIBITED** - Plain strings in table cells: ```python # NEVER DO THIS - formatting will NOT work data = [ ['Header', 'Value'], # Bold won't render ['Temperature', '25°C'], # No style control ['Pressure', '1.01 × 105'], # Superscript won't work ] ``` ✅ **REQUIRED** - All table text MUST wrapped in Paragraph: ```python # ALWAYS DO THIS data = [ [Paragraph('Header', header_style), Paragraph('Value', header_style)], [Paragraph('Temperature', cell_style), Paragraph('25°C', cell_style)], [Paragraph('Pressure', cell_style), Paragraph('1.01 × 105', cell_style)], ] ``` **Why this is mandatory:** - Rendering formatting tags (``, ``, ``, ``) - Proper font application - Correct text alignment within cells - Consistent styling across the table **The ONLY exception**: `Image()` objects can be placed directly in table cells without Paragraph wrapping. ### Table Style Specifications - **Header style**: Must be bold with white text on dark blue background - use `Paragraph('Header', header_style)` + `textColor=colors.white`. - **Standard color scheme**: Dark blue header (`#1F4E79`), alternating white/light gray rows. - **Color consistency**: If a single PDF contains multiple tables, only one color scheme is allowed across all tables. - **Alignment**: Each body element within the same table must use the same alignment method. - **Caption**: ALL table captions must be centered and followed by `Spacer(1, 18)` before next content. - **Spacing**: Add `Spacer(1, 18)` BEFORE tables to maintain symmetric spacing with bottom. ### Document Structure - A PDF can contain ONLY ONE cover page and ONE back cover page. - The cover page and the back cover page MUST use the alignment method specified by `TA_JUSTIFY`. - **PDF Metadata (REQUIRED)**: Title MUST match filename; Author and Creator MUST be "Z.ai"; Subject SHOULD describe purpose. ### Image Handling - **Preserve aspect ratio**: Never adjust image aspect ratio. Must insert according to the original ratio. ```python from PIL import Image as PILImage from reportlab.platypus import Image # Get original dimensions pil_img = PILImage.open('image.png') orig_w, orig_h = pil_img.size # Scale to fit width while preserving aspect ratio target_width = 400 scale = target_width / orig_w img = Image('image.png', width=target_width, height=orig_h * scale) ``` ## Final Code Check - Verify function parameter order against documentation. - Confirm list/array element type consistency; test-run immediately. - Use `Paragraph` (not `Preformatted`) for body text and formulas. ### MANDATORY: Post-Generation Forbidden Character Sanitization **After the complete Python code is written and BEFORE executing it**, you MUST sanitize the code using the pre-built script located at: ``` scripts/sanitize_code.py ``` This script catches any forbidden Unicode characters (superscript/subscript digits, math operators, emoji, HTML entities, literal `\uXXXX` escapes) that may have slipped through despite the prevention rules. It converts them to safe ReportLab ``/`` tags or ASCII equivalents. **⚠️ CRITICAL RULE**: You MUST ALWAYS write PDF generation code to a `.py` file first, then sanitize it, then execute it. **NEVER use `python -c "..."` or heredoc (`python3 << 'EOF'`) to run PDF generation code directly** — these patterns bypass the sanitization step and risk forbidden characters reaching the final PDF. **Mandatory workflow (NO EXCEPTIONS):** ```bash # Step 1: ALWAYS write code to a .py file first cat > generate_pdf.py << 'PYEOF' # ... your PDF generation code here ... PYEOF # Step 2: Sanitize forbidden characters (MUST run before execution) python scripts/sanitize_code.py generate_pdf.py # Step 3: Execute the sanitized code python generate_pdf.py ``` **Forbidden patterns — NEVER do any of the following:** ```bash # ❌ PROHIBITED: python -c with inline code (cannot be sanitized) python -c "from reportlab... doc.build(story)" # ❌ PROHIBITED: heredoc without saving to file first (cannot be sanitized) python3 << 'EOF' from reportlab... EOF # ❌ PROHIBITED: executing the .py file WITHOUT sanitizing first python generate_pdf.py # Missing sanitization step! ``` **✅ CORRECT: The ONLY allowed execution pattern:** ```bash # 1. Write to file → 2. Sanitize → 3. Execute cat > generate_pdf.py << 'PYEOF' ...code... PYEOF python scripts/sanitize_code.py generate_pdf.py python generate_pdf.py ``` **⚠️ This sanitization step is NON-OPTIONAL.** Even if you believe the code contains no forbidden characters, you MUST still run the sanitization script. It serves as a safety net to catch any characters that bypassed prevention rules. ## Quick Reference | Task | Best Tool | Command/Code | |------|-----------|--------------| | Merge PDFs | pypdf | `writer.add_page(page)` | | Split PDFs | pypdf | One page per file | | Extract text | pdfplumber | `page.extract_text()` | | Extract tables | pdfplumber | `page.extract_tables()` | | Create PDFs | reportlab | Canvas or Platypus | | Command line merge | qpdf | `qpdf --empty --pages ...` | | OCR scanned PDFs | pytesseract | Convert to image first | | Fill PDF forms | pdf-lib or pypdf (see forms.md) | See forms.md | ## Next Steps - For advanced pypdfium2 usage, see reference.md - For JavaScript libraries (pdf-lib), see reference.md - If you need to fill out a PDF form, follow the instructions in forms.md - For troubleshooting guides, see reference.md - For advanced table of content template, see reference.md