Files
2026-06-06 05:21:10 +00:00

2.7 KiB
Executable File

Route: Read / Analyze / Extract

Method 1: Text Extraction via pandoc (Fastest)

# Plain text
pandoc input.docx -t plain -o output.txt

# Markdown (preserves structure)
pandoc input.docx -t markdown -o output.md

# Extract with metadata
pandoc input.docx -t markdown --standalone -o output.md

Best for: Quick content reading, text analysis, word count, searching.

Method 2: Raw XML Access (Detailed)

mkdir work && cd work && unzip ../input.docx

# Read main content
cat word/document.xml

# Read styles
cat word/styles.xml

# List embedded media
ls word/media/

# Read headers/footers
cat word/header1.xml
cat word/footer1.xml

Best for: Analyzing formatting, extracting styles, inspecting document structure, debugging layout issues.

Quick XML Parsing

import defusedxml.ElementTree as ET

tree = ET.parse("word/document.xml")
ns = {"w": "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}

# Extract all text
texts = []
for t in tree.iter("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t"):
    if t.text:
        texts.append(t.text)
full_text = "".join(texts)

# Count paragraphs
paras = tree.findall(".//w:p", ns)
print(f"Paragraphs: {len(paras)}")

# Find headings
for para in paras:
    pPr = para.find("w:pPr", ns)
    if pPr is not None:
        pStyle = pPr.find("w:pStyle", ns)
        if pStyle is not None and "Heading" in pStyle.get(f"{{{ns['w']}}}val", ""):
            text = "".join(t.text for t in para.iter(f"{{{ns['w']}}}t") if t.text)
            print(f"  {pStyle.get(f'{{{ns[\"w\"]}}}val')}: {text}")

Method 3: Convert to Images (Visual Analysis)

# Convert to PDF first
libreoffice --headless --convert-to pdf input.docx

# Then to images
pdftoppm -png -r 200 input.pdf page

# Generates page-1.png, page-2.png, etc.

Best for: Visual layout analysis, comparing formatting, generating previews, when user asks "what does it look like".

Method 4: python-docx Reading

from docx import Document

doc = Document("input.docx")

# Read paragraphs
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Read tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

# Document properties
print(f"Sections: {len(doc.sections)}")
print(f"Paragraphs: {len(doc.paragraphs)}")
print(f"Tables: {len(doc.tables)}")

Choosing the Right Method

Need Method
Quick text content pandoc
Document structure/outline pandoc → markdown
Formatting details Raw XML
Table data extraction python-docx
Visual appearance Convert to images
Style analysis Raw XML (styles.xml)
Word/character count pandoc → plain → wc