9.1 KiB
Executable File
Spreadsheet Integrity Pipeline
Every xlsx deliverable is built and verified through a role-based workflow. Three roles collaborate in sequence: Blueprint Architect, Builder, and Inspector. Each role has explicit responsibilities and handoff criteria.
Tool Reference: xlsx.py
All commands: python3 "$XLSX_SKILL_DIR/xlsx.py" <command> [arguments]
| Command | Purpose | Called By |
|---|---|---|
recalc <file> |
Recalculate formulas via LibreOffice, scan for errors | Builder (self-check) |
audit <file> |
Deep formula error scan + zero-value + implicit array detection | Builder (self-check) |
scan <file> |
Detect out-of-range, header-included, small-aggregate, inconsistent patterns | Builder (self-check) |
inspect <file> --pretty |
Get sheet structure, data ranges, headers (JSON) | Blueprint Architect |
pivot <in> <out> --source --values [--rows --cols --filters --style --chart] |
Create PivotTable | Builder (final step only) |
chart-verify <file> |
Verify embedded charts have data | Builder (self-check) |
validate <file> |
Structural validation (release gate) | Inspector |
Role 1: Blueprint Architect
Before any code runs, the Architect produces a build plan:
- Decompose the request: separate explicit requirements from implicit business context
- Map every sheet: name, column structure, formula dependencies, cross-references
- Identify data flow: which sheets feed into which (source → derived → summary)
- Flag ambiguity: if the request is unclear, ask — don't guess
The Architect's output is a mental blueprint. No files are created yet.
Role 2: Builder
The Builder writes code and produces the workbook. The Builder operates under a strict single-sheet discipline: complete one sheet fully, verify it, then move on.
Build Cycle (per sheet)
┌─────────────────────────────────────────────┐
│ Write sheet (data, formulas, styling, charts) │
│ ↓ │
│ Save workbook to disk │
│ ↓ │
│ Self-check chain: │
│ recalc → audit → scan │
│ + chart-verify (if sheet has charts) │
│ ↓ │
│ All clear? ──Yes──→ Proceed to next sheet │
│ │ │
│ No │
│ ↓ │
│ Fix errors → re-save → re-run self-check │
│ (loop until clean) │
└─────────────────────────────────────────────┘
Builder Constraints
- No batch-then-check: you cannot create all sheets first and verify later. Errors in early sheets propagate silently into later sheets.
- No error forwarding: a sheet with unresolved errors blocks all subsequent work.
- No silent delivery: a file that hasn't passed self-check is not a deliverable — it's a draft.
Pivot Tables — Special Sequencing
PivotTables depend on finalized source data. They are always the last data operation:
python3 "$XLSX_SKILL_DIR/xlsx.py" inspect input.xlsx --pretty # understand structure
python3 "$XLSX_SKILL_DIR/xlsx.py" pivot input.xlsx output.xlsx \
--source "Sheet!A1:F100" \
--values "Revenue:sum,Units:count" \
--rows "Product,Region" \
--cols "Quarter" \
--filters "Year" \
--location "Summary!A3" \
--style "finance" \
--chart "bar"
Aggregations: sum, count, average/avg, max, min
Chart types: bar (default), line, pie
Styles: monochrome (default), finance
Never modify pivot output with openpyxl afterward — it corrupts the pivotCache.
Role 3: Inspector
The Inspector runs after all sheets are built. Two levels of inspection: Semantic and Structural.
Semantic Inspection (for edit/transform tasks)
When the task involves transforming existing data (not creating from scratch), verify the transformation didn't corrupt meaning:
| Check | Method |
|---|---|
| Row count | Does output have the expected number of rows? (e.g., grouping 15 rows by 5 keys → expect 5 rows) |
| Column totals | Do numeric sums in output match source? (or expected transformation) |
| Spot-check | Compare 2-3 specific rows between source and output |
| Formula evaluability | Can formulas be verified in Python? If self-referencing or cross-sheet, verify computed values instead |
# Semantic verification template
source_total = sum(normalize_cell_value(ws_src.cell(row=r, column=c).value) or 0
for r in range(start, end + 1))
output_total = sum(normalize_cell_value(ws_out.cell(row=r, column=c).value) or 0
for r in range(out_start, out_end + 1))
assert abs(source_total - output_total) < 0.01, f"Total mismatch: {source_total} vs {output_total}"
Structural Inspection (release gate)
python3 "$XLSX_SKILL_DIR/xlsx.py" validate output.xlsx
- Exit 0 → file is releasable
- Non-zero → Builder must regenerate from scratch with corrected code
Known Traps & Countermeasures
These are recurring failure modes. The Builder must internalize them.
| Trap | What Goes Wrong | Countermeasure |
|---|---|---|
data_only=True then save |
Formulas permanently replaced with cached values | Never save after opening with data_only=True |
| Column index miscalculation | col 64 ≠ "BK" | Always use openpyxl.utils.get_column_letter() |
| Row offset confusion | DataFrame row 5 = Excel row 6 | Excel is 1-indexed, pandas is 0-indexed |
| NaN leaks into formulas | =A1+nan → broken formula string |
Check pd.notna() before referencing |
| Cross-sheet reference typo | Sheet1!A1 vs 'Sheet 1'!A1 |
Quote sheet names containing spaces |
| Division by zero | #DIV/0! in Excel |
Wrap with IFERROR() or IF(denom=0,...) |
Text starting with = |
#NAME? error |
Prefix descriptive text with ' |
| Implicit array formula | #N/A in Excel |
Avoid MATCH(TRUE(),range>0,0), use SUMPRODUCT |
| Chart renders blank | Formula cells have no cached values | Run recalc before creating charts |
| Hidden rows → empty chart | Chart skips hidden data | Set chart.plot_visible_only = False |
| Overlapping charts | Multiple charts stacked on same cells | Calculate anchor: ~15 rows per chart + 2 rows gap |
Verify newly-written formulas with data_only=True → get None |
openpyxl doesn't evaluate formulas; data_only=True only reads Excel's cached values which don't exist for new formulas |
Compute expected values in Python and compare directly. For TOTAL rows needing verification, write computed values (see SKILL.md Design Principle #1 Exception) |
| Manual row sort breaks references | Value-swap sorting doesn't update formula references | After sorting by swapping data, regenerate all formula strings with updated row numbers |
NBSP (\xa0) treated as non-empty |
Cells containing \xa0 or \u200b look blank but fail is None |
Normalize: \xa0, \u200b, whitespace-only → None before comparison or aggregation |
Cross-Validation Review Sheet
For analysis-heavy deliverables, embed a self-checking Review sheet in the workbook.
When Required
- Deliverables with computed metrics or aggregated data
- Financial models with cross-sheet references
- Data sourced from external APIs or web searches
Structure
review_ws = wb.create_sheet("Review")
review_ws.sheet_properties.tabColor = "FFC000" # amber tab
checks = [
["Check", "Expected", "Actual", "Status"],
["Total Revenue", "=SUM(Data!B2:B100)", "=Summary!B10", '=IF(B2=C2,"✓ PASS","✗ FAIL")'],
["Row Count", "=COUNTA(Data!A:A)-1", "=Summary!B3", '=IF(B3=C3,"✓ PASS","✗ FAIL")'],
["Grand Total Match", "=Detail!F50", "=Dashboard!C5", '=IF(B4=C4,"✓ PASS","✗ FAIL")'],
]
for i, row in enumerate(checks, 1):
for j, val in enumerate(row, 1):
review_ws.cell(row=i, column=j, value=val)
Rules
- Every Summary/Dashboard metric must have a cross-check formula back to source data
- Status column uses live formulas — green if correct, red if mismatch
- Review is the last sheet in the workbook (before Sources, if present)
Release Checklist
Before handing the file to the user:
- Every sheet passed the Builder's self-check chain
- Semantic inspection passed (if applicable)
validatereturned exit code 0- All temp files, drafts, and retry artifacts removed
- If multiple versions exist from retries, only the latest correct version remains
- Every remaining file in the output directory is an expected deliverable
- VBA check (if
.xlsm): VBA modules preserved, no unintended macro removal - VBA security (if VBA generated): passes security checklist in
scenes/vba.md