Initial commit
This commit is contained in:
88
skills/docx/routes/comment.md
Executable file
88
skills/docx/routes/comment.md
Executable file
@@ -0,0 +1,88 @@
|
||||
# Route: Add Comments
|
||||
|
||||
## Method 1: python-docx (Recommended — Simple)
|
||||
|
||||
```python
|
||||
from docx import Document
|
||||
from docx.oxml.ns import qn
|
||||
from docx.oxml import OxmlElement
|
||||
from datetime import datetime
|
||||
|
||||
def add_comment(paragraph, comment_text, author="GLM", initials="G"):
|
||||
"""Add a comment to an entire paragraph."""
|
||||
# Create comment reference
|
||||
comment_id = str(hash(comment_text) % 10000)
|
||||
|
||||
# Add to comments.xml (need to create if not exists)
|
||||
# ... complex XML manipulation required
|
||||
pass
|
||||
|
||||
# Simpler approach: use python-docx-ng or manipulate XML directly
|
||||
```
|
||||
|
||||
**Note**: python-docx has limited native comment support. For reliable results, use the OOXML method.
|
||||
|
||||
## Method 2: OOXML Direct Manipulation (Reliable)
|
||||
|
||||
### Step 1: Unpack
|
||||
|
||||
```bash
|
||||
mkdir work && cd work && unzip ../input.docx
|
||||
```
|
||||
|
||||
### Step 2: Create/update word/comments.xml
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
||||
<w:comments xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
||||
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
|
||||
<w:comment w:id="1" w:author="Reviewer" w:date="2024-01-15T10:30:00Z" w:initials="R">
|
||||
<w:p>
|
||||
<w:r>
|
||||
<w:t>This section needs more detail.</w:t>
|
||||
</w:r>
|
||||
</w:p>
|
||||
</w:comment>
|
||||
</w:comments>
|
||||
```
|
||||
|
||||
### Step 3: Mark comment range in document.xml
|
||||
|
||||
```xml
|
||||
<w:commentRangeStart w:id="1"/>
|
||||
<w:r><w:t>Text being commented on</w:t></w:r>
|
||||
<w:commentRangeEnd w:id="1"/>
|
||||
<w:r>
|
||||
<w:rPr><w:rStyle w:val="CommentReference"/></w:rPr>
|
||||
<w:commentReference w:id="1"/>
|
||||
</w:r>
|
||||
```
|
||||
|
||||
### Step 4: Update relationships
|
||||
|
||||
In `word/_rels/document.xml.rels`, add:
|
||||
```xml
|
||||
<Relationship Id="rIdComments" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments" Target="comments.xml"/>
|
||||
```
|
||||
|
||||
### Step 5: Update Content_Types
|
||||
|
||||
In `[Content_Types].xml`, ensure:
|
||||
```xml
|
||||
<Override PartName="/word/comments.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
|
||||
```
|
||||
|
||||
### Step 6: Pack
|
||||
|
||||
```bash
|
||||
zip -r ../output.docx . -x ".*"
|
||||
```
|
||||
|
||||
## When to Use Each Method
|
||||
|
||||
| Scenario | Method |
|
||||
|----------|--------|
|
||||
| Add 1-2 simple comments | OOXML |
|
||||
| Batch review (many comments) | OOXML with Python script |
|
||||
| Comment on specific words | OOXML (precise range control) |
|
||||
| Quick annotation | python-docx if available |
|
||||
207
skills/docx/routes/create.md
Executable file
207
skills/docx/routes/create.md
Executable file
@@ -0,0 +1,207 @@
|
||||
# Route: Create New Document
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
0. Check if user provided a reference template (PDF/docx) → if yes, use Template-Following Mode below
|
||||
1. Load `references/design-system.md` → select palette and cover recipe
|
||||
2. Load `references/common-rules.md` → shared layout, font, placeholder rules
|
||||
3. Check user keywords → load scene file if applicable
|
||||
4. Load `references/docx-js-core.md`
|
||||
5. If complex → also load `references/docx-js-advanced.md`
|
||||
6. Plan document structure (outline)
|
||||
7. Write JS/TS using docx library
|
||||
⚠️ **BEFORE writing any string**: scan ALL Chinese text for curly quotes `""''` and replace with `\u201c \u201d \u2018 \u2019` — bare curly quotes break JS syntax (see docx-js-advanced.md § Quotes Escaping)
|
||||
8. Run with `bun run generate.js` (or `node generate.js`)
|
||||
9. If TOC → run `python3 "$DOCX_SCRIPTS/add_toc_placeholders.py" output.docx --auto`
|
||||
10. Run post-generation checklist (see SKILL.md)
|
||||
```
|
||||
|
||||
## Template-Following Mode
|
||||
|
||||
When the user provides a reference document (PDF/docx) as a **formatting template** (e.g., "generate following this template format", "refer to this sample"), switch to template-following mode instead of the standard recipe-based workflow:
|
||||
|
||||
1. **Extract the template's structure** — cover layout, section order, heading hierarchy, page breaks, special pages (e.g., advisor comments page, approval form)
|
||||
2. **Replicate structure exactly** — every major structural unit becomes a **separate section** (cover, body, appendix/form pages) with appropriate margins and page breaks
|
||||
3. **Fill content** from the user's content source, or generate per user instructions
|
||||
4. **Preserve template-specific elements** — school-specific forms, signature areas, stamp placeholders, advisor comment blocks → reproduce as-is with placeholder text (e.g., "Advisor (signature):")
|
||||
5. **Maintain formatting fidelity** — font choices, table layouts, spacing, and alignment should match the template, not the standard design-system palettes
|
||||
|
||||
⚠️ **Do NOT apply standard cover recipes (R1–R7) when a user-provided template defines its own cover format.** Follow the template's cover layout instead. Standard `common-rules.md` constraints (e.g., `WidthType.PERCENTAGE`, `allNoBorders` for cover wrapper, `Rule 8` line spacing) still apply for cross-engine compatibility.
|
||||
|
||||
⚠️ **Each distinct page type = separate section.** Cover section (margin: 0), body section (standard margins), appendix/form pages (may need different margins or orientation). Never place cover + body + appendix in a single section.
|
||||
|
||||
---
|
||||
|
||||
## Decision Tree
|
||||
|
||||
### Cover Page?
|
||||
- **YES**: Reports, theses, proposals, plans, or 3+ page docs with clear title/author
|
||||
- **NO**: Resumes, contracts, official documents, exam papers, short memos
|
||||
|
||||
### Cover Style Selector — Recipe Router
|
||||
|
||||
Covers use **7 validated layout recipes (R1–R7)**, auto-selected by `selectCoverRecipe()` in `references/design-system.md` (the **authoritative source** — do NOT duplicate the function).
|
||||
|
||||
**Quick Reference:**
|
||||
|
||||
| docType | Recipe | Default Palette |
|
||||
|---------|--------|-----------------|
|
||||
| contract / official / exam / resume | null (no cover) | — |
|
||||
| academic | R5 (Clean White) | ACADEMIC |
|
||||
| proposal_report (thesis proposal) | R5 (Clean White) | ACADEMIC |
|
||||
| lesson_plan (STEM) | R4 (Top Color Block) | DM-1 |
|
||||
| lesson_plan (arts/general) | R6 (Editorial Warm) | ED-1 |
|
||||
| creative / branding / design | R3 (Centered Card Frame) | SN-2 |
|
||||
| cultural / newsletter / internal | R6 (Editorial Warm) | ED-1 |
|
||||
| activity / event | R6 (Editorial Warm) | ED-1 |
|
||||
| trend/research (cultural/creative/brand) | R7 (Swiss Tech) | ST-1 |
|
||||
| whitepaper | R2 (Double-Rule Frame) | IG-1 / CM-2 |
|
||||
| consulting | R2 (Double-Rule Frame) | MIN-1 |
|
||||
| proposal / plan | R4 (Top Color Block) | GO-1 |
|
||||
| report | R1 (Pure Paragraph Left) | by industry |
|
||||
| default | R1 (Pure Paragraph Left) | DS-1 |
|
||||
|
||||
⚠️ **Long title routing:** After selecting recipe, apply `applyLongTitleOverride(result, titleLength)`. Titles >20 chars on R3/R4/R6 → fall back to R1. Titles >30 chars on R2 → fall back to R1. R5 is never overridden.
|
||||
|
||||
⚠️ **Academic thesis cover:** Use `buildAcademicCover()` from `scenes/academic.md`.
|
||||
|
||||
⚠️ **Thesis proposal report (开题报告):** Use `buildProposalCover()` from `scenes/academic.md`. Cover MUST be an independent section. Keywords: "开题报告" (Chinese), "thesis proposal", "research proposal" — NOT the same as business proposals (which use R4).
|
||||
|
||||
### Table of Contents?
|
||||
- **YES**: 3+ major sections (H1 headings)
|
||||
- **NO**: Resumes, exam papers, short docs, contracts (<20 clauses)
|
||||
|
||||
→ See `references/toc.md` for the complete TOC reference (3-step process, code examples, common bugs).
|
||||
|
||||
### Headers/Footers?
|
||||
- **YES** by default (page numbers minimum)
|
||||
- **NO**: cover page section, official docs (special format)
|
||||
|
||||
### Load Math Formulas?
|
||||
When: exam papers, academic papers, physics/math/chemistry → load `references/math-formulas.md`
|
||||
|
||||
### Load Chart Templates?
|
||||
When: data visualization, reports with charts → load `references/chart-templates.md`
|
||||
|
||||
## Outline Rules
|
||||
|
||||
**User provides outline** → Follow EXACTLY. No additions, deletions, or reordering.
|
||||
|
||||
**No outline** → Create from scene template:
|
||||
- **Academic:** Abstract → TOC → Body → References
|
||||
- **Report:** Use `selectReportType()` to determine type, then follow template A–F:
|
||||
- analysis → Template A (Executive Summary → Background → Scope & Method → Findings → Diagnosis → Conclusions)
|
||||
- experiment → Template B (Abstract → Objective & Hypothesis → Environment → Procedure → Results → Error Analysis → Conclusions)
|
||||
- testing → Template C (Overview → Scope & Environment → Test Plan → Results → Defects → Risks → Conclusions)
|
||||
- research → Template D (Summary → Background → Subjects & Method → Sample → Findings → Synthesis → Recommendations)
|
||||
- review → Template E (Overview → Goals → Review → Results → Issues → Lessons → Action Plan)
|
||||
- proposal → Template F (Summary → Status → Goals → Solution → Roadmap → Resources → Risks → Benefits)
|
||||
- **Contract:** Use `selectContractType()` then follow template A–E:
|
||||
- bilateral → Template A (Header → Parties → Recitals → Definitions → Subject → Price → Rights → Delivery → Tax → IP → Breach → Force Majeure → Termination → Notices → Dispute → Miscellaneous → Signature)
|
||||
- transfer → Template B (Header → Recitals → Definitions → Subject → Consideration → Closing → Representations → Tax → Breach → Dispute → Signature)
|
||||
- nda → Template C (Header → Recitals → Definition → Obligations → Use Restrictions → Return/Destroy → Exceptions → Duration → Breach → Dispute → Signature)
|
||||
- framework → Template D (Header → Recitals → Purpose → Scope → Division → Mechanism → Commercial → Confidentiality → Term → Breach → Dispute → Signature)
|
||||
- terms → Template E (Title → Definitions → Services → Rights → Liability → Fees → IP → Termination → Notices → Dispute → Miscellaneous)
|
||||
- **Official:** Use `selectOfficialType()` + `needsRedHeader()`:
|
||||
- notice → Template A ([Red header] → [Doc number] → Title → Addressee → Reason → Items → Requirements → [Attachments] → [Signature] → [Date] → [Colophon])
|
||||
- letter → Template B ([Red header] → [Doc number] → Title → Addressee → Reason → Negotiation/Reply → Closing → [Signature] → [Date])
|
||||
- reply → Template C ([Red header] → [Doc number] → Title → Addressee → Reference → Reply → "This is the reply." → Signature → Date)
|
||||
- minutes → Template D (Title → Meeting Overview → Agreed Items → Responsibilities → [Distribution]) — typically no red header
|
||||
- Present outline to user before generating when possible
|
||||
|
||||
## Scene Completeness
|
||||
|
||||
Include ALL elements a scene specifies:
|
||||
- **Academic thesis:** Cover (`buildAcademicCover()` in its own section), abstract, TOC, references
|
||||
- **Thesis proposal report (thesis proposal / 开题报告):** Cover (`buildProposalCover()` in its own section), body sections per proposal template. Cover MUST be a separate section.
|
||||
- **Report:** Cover, executive summary, conclusions
|
||||
- **Contract:** Party info, recitals, complete clause closure, signature block, uniform `【】` placeholders
|
||||
- **Official:** Correct document type, specific title, closing phrase matching type, proper numbering hierarchy, red header only when requested
|
||||
- **Exam:** Student info area, scoring criteria
|
||||
|
||||
Generate complete, substantive content — not skeletons.
|
||||
|
||||
## Content Guidelines
|
||||
|
||||
- **Length**: "detailed report" = 3000+ words. "brief summary" = 500–1000.
|
||||
- **Data**: Use user's data, or generate realistic placeholders
|
||||
- **Charts**: Use `references/chart-templates.md` matplotlib templates → PNG → embed
|
||||
- **Math**: Use `references/math-formulas.md` LaTeX → docx-js Math mapping
|
||||
- **Tables**: For structured data, not layout
|
||||
- **Numbering**: Figures, tables numbered sequentially with cross-references
|
||||
|
||||
## Code Architecture
|
||||
|
||||
### Heading Style Rule (Mandatory)
|
||||
|
||||
**All body chapter headings MUST use `heading: HeadingLevel.HEADING_X`** — never simulate with bold + large font (TOC cannot detect simulated headings).
|
||||
|
||||
**Exception:** Cover title and TOC title ("目录") heading MUST NOT use Heading style.
|
||||
|
||||
### Blank Page Prevention
|
||||
|
||||
→ See SKILL.md § Post-Generation checklist for the full set of rules.
|
||||
|
||||
Key rules:
|
||||
1. No double page breaks (SectionType.NEXT_PAGE + PageBreak = blank page)
|
||||
2. PageBreak paragraphs should have visible text content
|
||||
3. No more than 3 consecutive empty paragraphs
|
||||
4. Cover section: ≤2 trailing empty paragraphs, no trailing PageBreak
|
||||
|
||||
### Builder Pattern Example
|
||||
|
||||
```js
|
||||
const { Document, Packer, Paragraph, TextRun, Header, Footer,
|
||||
AlignmentType, HeadingLevel, PageNumber } = require("docx");
|
||||
const fs = require("fs");
|
||||
|
||||
// 1. Palette
|
||||
const P = { primary: "#101820", body: "#182030", secondary: "#506070", accent: "#8090A0" };
|
||||
const c = (hex) => hex.replace("#", "");
|
||||
|
||||
// 2. Component builders
|
||||
function heading(text, level = HeadingLevel.HEADING_1) {
|
||||
return new Paragraph({
|
||||
heading: level,
|
||||
spacing: { before: level === HeadingLevel.HEADING_1 ? 360 : 240, after: 120 },
|
||||
children: [new TextRun({ text, bold: true, color: c(P.primary), font: { ascii: "Calibri", eastAsia: "SimHei" } })]
|
||||
});
|
||||
}
|
||||
|
||||
function body(text) {
|
||||
return new Paragraph({
|
||||
alignment: AlignmentType.JUSTIFIED,
|
||||
indent: { firstLine: 480 },
|
||||
spacing: { line: 312 },
|
||||
children: [new TextRun({ text, size: 24, color: c(P.body) })],
|
||||
});
|
||||
}
|
||||
|
||||
// 3. Assembly — cover + body in separate sections
|
||||
const doc = new Document({
|
||||
styles: { default: { document: {
|
||||
run: { font: { ascii: "Calibri", eastAsia: "Microsoft YaHei" }, size: 24, color: c(P.body) },
|
||||
paragraph: { spacing: { line: 312 } },
|
||||
}}},
|
||||
sections: [
|
||||
{ properties: { page: { margin: { top: 0, bottom: 0, left: 0, right: 0 } } },
|
||||
children: buildCoverR1(config) }, // ← use recipe from design-system.md
|
||||
{ properties: { page: { margin: { top: 1440, bottom: 1440, left: 1701, right: 1417 } } },
|
||||
footers: { default: new Footer({ children: [new Paragraph({ alignment: AlignmentType.CENTER,
|
||||
children: [new TextRun({ children: [PageNumber.CURRENT], size: 18 })] })] }) },
|
||||
children: [heading("Chapter 1"), body("Content...")] },
|
||||
],
|
||||
});
|
||||
|
||||
Packer.toBuffer(doc).then(buf => { fs.writeFileSync("output.docx", buf); });
|
||||
```
|
||||
|
||||
## Post-Generation
|
||||
|
||||
→ See SKILL.md § Post-Generation for the complete two-layer verification checklist.
|
||||
|
||||
```bash
|
||||
python3 "$DOCX_SCRIPTS/postcheck.py" output.docx
|
||||
```
|
||||
⚠️ **Running postcheck.py is MANDATORY.** Fix all ❌ errors before delivering.
|
||||
115
skills/docx/routes/edit.md
Executable file
115
skills/docx/routes/edit.md
Executable file
@@ -0,0 +1,115 @@
|
||||
# Route: Edit Existing Document
|
||||
|
||||
## Workflow Overview
|
||||
|
||||
```
|
||||
1. Receive .docx (or .doc → convert)
|
||||
2. Unpack → working directory
|
||||
3. Analyze structure (document.xml, styles.xml)
|
||||
4. Plan changes → batch by type
|
||||
5. Implement via Document library (Python)
|
||||
6. Pack → output.docx
|
||||
7. Verify (pandoc or visual)
|
||||
```
|
||||
|
||||
## Step 0: Format Conversion
|
||||
|
||||
```bash
|
||||
# .doc → .docx
|
||||
libreoffice --headless --convert-to docx input.doc
|
||||
```
|
||||
|
||||
## Step 1: Unpack
|
||||
|
||||
```bash
|
||||
mkdir -p work_dir && cd work_dir && unzip ../input.docx
|
||||
```
|
||||
|
||||
Key files: `word/document.xml` (content), `word/styles.xml` (styles), `word/numbering.xml` (lists), `word/media/` (images), `[Content_Types].xml`, `word/_rels/document.xml.rels`
|
||||
|
||||
## Step 2: Plan Changes
|
||||
|
||||
Group changes into batches, process in order:
|
||||
|
||||
1. **Structural** — Add/remove sections, reorder paragraphs
|
||||
2. **Style** — Font, size, color modifications
|
||||
3. **Text** — Find/replace, fix typos
|
||||
4. **Table** — Add/remove rows/columns, update data
|
||||
5. **Image** — Replace/add images
|
||||
|
||||
## Step 3: Implement
|
||||
|
||||
Load `references/ooxml.md` for the full Document library API. Key patterns:
|
||||
|
||||
```python
|
||||
from scripts.document import Document
|
||||
|
||||
doc = Document('work_dir')
|
||||
|
||||
# Text replacement with tracked changes
|
||||
node = doc["word/document.xml"].get_node(tag="w:r", contains="old text")
|
||||
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
|
||||
replacement = f'<w:del><w:r>{rpr}<w:delText>old text</w:delText></w:r></w:del><w:ins><w:r>{rpr}<w:t>new text</w:t></w:r></w:ins>'
|
||||
doc["word/document.xml"].replace_node(node, replacement)
|
||||
|
||||
doc.save()
|
||||
```
|
||||
|
||||
## Step 4: Pack
|
||||
|
||||
```bash
|
||||
cd work_dir && zip -r ../output.docx . -x ".*"
|
||||
```
|
||||
|
||||
## Step 5: Verify
|
||||
|
||||
```bash
|
||||
pandoc output.docx -t plain -o /dev/stdout | head -50
|
||||
# or visual
|
||||
libreoffice --headless --convert-to pdf output.docx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Template Matching Workflow
|
||||
|
||||
When user says "use this format" or provides a template:
|
||||
|
||||
1. Unpack template, extract `styles.xml`, `numbering.xml`
|
||||
2. Analyze font/size/spacing/margins
|
||||
3. Copy `styles.xml` into target document
|
||||
4. Match heading hierarchy and spacing
|
||||
|
||||
## Multi-File Merge
|
||||
|
||||
1. Use first document as base
|
||||
2. Extract content from additional documents
|
||||
3. Insert with page breaks between sections
|
||||
4. Merge styles (prefer base document's)
|
||||
5. Re-number figures/tables sequentially
|
||||
|
||||
## Redlining (Tracked Changes) — Default for Revisions
|
||||
|
||||
When user asks for revisions, **default to tracked changes** so they can review:
|
||||
|
||||
```python
|
||||
doc = Document('work_dir', track_revisions=True)
|
||||
# ... make changes using replace_node with <w:del>/<w:ins>
|
||||
doc.save()
|
||||
```
|
||||
|
||||
Ask user if they want clean output or tracked changes only if ambiguous.
|
||||
|
||||
## Common Operations Quick Reference
|
||||
|
||||
| Operation | Approach |
|
||||
|-----------|----------|
|
||||
| Replace text | `get_node` + `replace_node` with tracked changes |
|
||||
| Change font | Modify `<w:rFonts>` in run properties |
|
||||
| Add paragraph | `insert_after` with `<w:p>` element |
|
||||
| Delete paragraph | `suggest_deletion` on `<w:p>` |
|
||||
| Add table row | Clone `<w:tr>`, modify cells |
|
||||
| Update header | Edit `word/headerN.xml` |
|
||||
| Change margins | Edit `<w:pgMar>` in `<w:sectPr>` |
|
||||
| Add image | See `references/ooxml.md` image insertion pattern |
|
||||
| Add comment | `doc.add_comment(start, end, text)` |
|
||||
120
skills/docx/routes/format.md
Executable file
120
skills/docx/routes/format.md
Executable file
@@ -0,0 +1,120 @@
|
||||
# Route: Format / Layout
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
1. Read current document (pandoc for content, unpack for structure)
|
||||
2. Identify format requirements from user
|
||||
3. Use unit conversion table (see SKILL.md)
|
||||
4. Apply formatting via OOXML manipulation or python-docx
|
||||
5. Pack and verify
|
||||
```
|
||||
|
||||
## Quick Formatting via python-docx
|
||||
|
||||
For simple formatting tasks, python-docx is often faster than raw XML:
|
||||
|
||||
```python
|
||||
from docx import Document as PythonDocument
|
||||
from docx.shared import Pt, Cm, Twips
|
||||
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
||||
|
||||
doc = PythonDocument("input.docx")
|
||||
|
||||
# Change all body paragraph formatting
|
||||
for para in doc.paragraphs:
|
||||
if para.style.name.startswith("Heading"):
|
||||
continue
|
||||
para.paragraph_format.first_line_indent = Twips(420)
|
||||
para.paragraph_format.line_spacing = 1.5
|
||||
para.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
|
||||
for run in para.runs:
|
||||
run.font.name = "宋体"
|
||||
run.font.size = Pt(12) # Xiao Si 小四
|
||||
|
||||
doc.save("output.docx")
|
||||
```
|
||||
|
||||
## Common Format Request Patterns
|
||||
|
||||
### University Thesis Formatting
|
||||
|
||||
Typical Chinese university thesis requirements:
|
||||
|
||||
```python
|
||||
from docx.shared import Cm, Pt, Twips
|
||||
|
||||
# Margins
|
||||
for section in doc.sections:
|
||||
section.top_margin = Cm(2.5)
|
||||
section.bottom_margin = Cm(2.5)
|
||||
section.left_margin = Cm(3.0)
|
||||
section.right_margin = Cm(2.5)
|
||||
|
||||
# Fonts
|
||||
# Body: SimSun 宋体 Xiao Si 小四 (12pt)
|
||||
# H1: SimHei 黑体 San Hao 三号 (16pt) centered
|
||||
# H2: SimHei 黑体 Si Hao 四号 (14pt)
|
||||
# H3: SimHei 黑体 Xiao Si 小四 (12pt)
|
||||
# English: Times New Roman, same sizes
|
||||
```
|
||||
|
||||
### Page Numbers Starting from Specific Page
|
||||
|
||||
Use multi-section approach:
|
||||
```python
|
||||
# Section 1: Front matter (Roman numerals)
|
||||
# Section 2: Main content (Arabic, starting from 1)
|
||||
# This requires OOXML manipulation — see routes/edit.md for unpack/pack workflow
|
||||
```
|
||||
|
||||
In raw XML (`word/document.xml`):
|
||||
```xml
|
||||
<w:sectPr>
|
||||
<w:pgNumType w:fmt="upperRoman" w:start="1"/>
|
||||
</w:sectPr>
|
||||
<!-- New section -->
|
||||
<w:sectPr>
|
||||
<w:pgNumType w:fmt="decimal" w:start="1"/>
|
||||
</w:sectPr>
|
||||
```
|
||||
|
||||
### Different Headers Per Section
|
||||
|
||||
Each section in a .docx can have its own header/footer. See `references/docx-js-advanced.md` for the multi-section approach.
|
||||
|
||||
For existing documents, modify `word/document.xml` to split `<w:sectPr>` and create separate `headerN.xml` files.
|
||||
|
||||
### Font Size Conversion
|
||||
|
||||
When user requests a Chinese font size name:
|
||||
|
||||
| Request | Action |
|
||||
|---------|--------|
|
||||
| "Change to Wu Hao (5th) size" | `font.size = Pt(10.5)` or `size: 21` in docx-js |
|
||||
| "Title in San Hao SimHei" | `font.size = Pt(16)`, `font.name = "SimHei"` |
|
||||
| "Body in Xiao Si SimSun" | `font.size = Pt(12)`, `font.name = "SimSun"` |
|
||||
|
||||
### Line Spacing Adjustment
|
||||
|
||||
```python
|
||||
from docx.shared import Twips
|
||||
|
||||
# 1.0x spacing
|
||||
para.paragraph_format.line_spacing_rule = WD_LINE_SPACING.MULTIPLE
|
||||
para.paragraph_format.line_spacing = 1.0
|
||||
|
||||
# 1.3x spacing (our default)
|
||||
para.paragraph_format.line_spacing = 1.5
|
||||
|
||||
# Fixed spacing (e.g., 28pt)
|
||||
para.paragraph_format.line_spacing_rule = WD_LINE_SPACING.EXACTLY
|
||||
para.paragraph_format.line_spacing = Pt(28)
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After formatting changes:
|
||||
1. Open in LibreOffice or convert to PDF for visual check
|
||||
2. Extract text with pandoc to ensure content unchanged
|
||||
3. Compare file sizes (formatting-only changes shouldn't dramatically change size)
|
||||
114
skills/docx/routes/read.md
Executable file
114
skills/docx/routes/read.md
Executable file
@@ -0,0 +1,114 @@
|
||||
# Route: Read / Analyze / Extract
|
||||
|
||||
## Method 1: Text Extraction via pandoc (Fastest)
|
||||
|
||||
```bash
|
||||
# Plain text
|
||||
pandoc input.docx -t plain -o output.txt
|
||||
|
||||
# Markdown (preserves structure)
|
||||
pandoc input.docx -t markdown -o output.md
|
||||
|
||||
# Extract with metadata
|
||||
pandoc input.docx -t markdown --standalone -o output.md
|
||||
```
|
||||
|
||||
**Best for**: Quick content reading, text analysis, word count, searching.
|
||||
|
||||
## Method 2: Raw XML Access (Detailed)
|
||||
|
||||
```bash
|
||||
mkdir work && cd work && unzip ../input.docx
|
||||
|
||||
# Read main content
|
||||
cat word/document.xml
|
||||
|
||||
# Read styles
|
||||
cat word/styles.xml
|
||||
|
||||
# List embedded media
|
||||
ls word/media/
|
||||
|
||||
# Read headers/footers
|
||||
cat word/header1.xml
|
||||
cat word/footer1.xml
|
||||
```
|
||||
|
||||
**Best for**: Analyzing formatting, extracting styles, inspecting document structure, debugging layout issues.
|
||||
|
||||
### Quick XML Parsing
|
||||
|
||||
```python
|
||||
import defusedxml.ElementTree as ET
|
||||
|
||||
tree = ET.parse("word/document.xml")
|
||||
ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}
|
||||
|
||||
# Extract all text
|
||||
texts = []
|
||||
for t in tree.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t"):
|
||||
if t.text:
|
||||
texts.append(t.text)
|
||||
full_text = "".join(texts)
|
||||
|
||||
# Count paragraphs
|
||||
paras = tree.findall(".//w:p", ns)
|
||||
print(f"Paragraphs: {len(paras)}")
|
||||
|
||||
# Find headings
|
||||
for para in paras:
|
||||
pPr = para.find("w:pPr", ns)
|
||||
if pPr is not None:
|
||||
pStyle = pPr.find("w:pStyle", ns)
|
||||
if pStyle is not None and "Heading" in pStyle.get(f"{{{ns['w']}}}val", ""):
|
||||
text = "".join(t.text for t in para.iter(f"{{{ns['w']}}}t") if t.text)
|
||||
print(f" {pStyle.get(f'{{{ns[\"w\"]}}}val')}: {text}")
|
||||
```
|
||||
|
||||
## Method 3: Convert to Images (Visual Analysis)
|
||||
|
||||
```bash
|
||||
# Convert to PDF first
|
||||
libreoffice --headless --convert-to pdf input.docx
|
||||
|
||||
# Then to images
|
||||
pdftoppm -png -r 200 input.pdf page
|
||||
|
||||
# Generates page-1.png, page-2.png, etc.
|
||||
```
|
||||
|
||||
**Best for**: Visual layout analysis, comparing formatting, generating previews, when user asks "what does it look like".
|
||||
|
||||
## Method 4: python-docx Reading
|
||||
|
||||
```python
|
||||
from docx import Document
|
||||
|
||||
doc = Document("input.docx")
|
||||
|
||||
# Read paragraphs
|
||||
for para in doc.paragraphs:
|
||||
print(f"[{para.style.name}] {para.text}")
|
||||
|
||||
# Read tables
|
||||
for table in doc.tables:
|
||||
for row in table.rows:
|
||||
print([cell.text for cell in row.cells])
|
||||
|
||||
# Document properties
|
||||
print(f"Sections: {len(doc.sections)}")
|
||||
print(f"Paragraphs: {len(doc.paragraphs)}")
|
||||
print(f"Tables: {len(doc.tables)}")
|
||||
```
|
||||
|
||||
## Choosing the Right Method
|
||||
|
||||
| Need | Method |
|
||||
|------|--------|
|
||||
| Quick text content | pandoc |
|
||||
| Document structure/outline | pandoc → markdown |
|
||||
| Formatting details | Raw XML |
|
||||
| Table data extraction | python-docx |
|
||||
| Visual appearance | Convert to images |
|
||||
| Style analysis | Raw XML (styles.xml) |
|
||||
| Word/character count | pandoc → plain → wc |
|
||||
Reference in New Issue
Block a user