Files
mantle-ai-trader/skills/xlsx/quality/pipeline.md
2026-06-06 05:21:10 +00:00

9.1 KiB
Executable File

Spreadsheet Integrity Pipeline

Every xlsx deliverable is built and verified through a role-based workflow. Three roles collaborate in sequence: Blueprint Architect, Builder, and Inspector. Each role has explicit responsibilities and handoff criteria.


Tool Reference: xlsx.py

All commands: python3 "$XLSX_SKILL_DIR/xlsx.py" <command> [arguments]

Command Purpose Called By
recalc <file> Recalculate formulas via LibreOffice, scan for errors Builder (self-check)
audit <file> Deep formula error scan + zero-value + implicit array detection Builder (self-check)
scan <file> Detect out-of-range, header-included, small-aggregate, inconsistent patterns Builder (self-check)
inspect <file> --pretty Get sheet structure, data ranges, headers (JSON) Blueprint Architect
pivot <in> <out> --source --values [--rows --cols --filters --style --chart] Create PivotTable Builder (final step only)
chart-verify <file> Verify embedded charts have data Builder (self-check)
validate <file> Structural validation (release gate) Inspector

Role 1: Blueprint Architect

Before any code runs, the Architect produces a build plan:

  • Decompose the request: separate explicit requirements from implicit business context
  • Map every sheet: name, column structure, formula dependencies, cross-references
  • Identify data flow: which sheets feed into which (source → derived → summary)
  • Flag ambiguity: if the request is unclear, ask — don't guess

The Architect's output is a mental blueprint. No files are created yet.


Role 2: Builder

The Builder writes code and produces the workbook. The Builder operates under a strict single-sheet discipline: complete one sheet fully, verify it, then move on.

Build Cycle (per sheet)

┌─────────────────────────────────────────────┐
│  Write sheet (data, formulas, styling, charts)  │
│                    ↓                            │
│  Save workbook to disk                          │
│                    ↓                            │
│  Self-check chain:                              │
│    recalc → audit → scan                        │
│    + chart-verify (if sheet has charts)          │
│                    ↓                            │
│  All clear? ──Yes──→ Proceed to next sheet      │
│       │                                         │
│      No                                         │
│       ↓                                         │
│  Fix errors → re-save → re-run self-check       │
│  (loop until clean)                             │
└─────────────────────────────────────────────┘

Builder Constraints

  • No batch-then-check: you cannot create all sheets first and verify later. Errors in early sheets propagate silently into later sheets.
  • No error forwarding: a sheet with unresolved errors blocks all subsequent work.
  • No silent delivery: a file that hasn't passed self-check is not a deliverable — it's a draft.

Pivot Tables — Special Sequencing

PivotTables depend on finalized source data. They are always the last data operation:

python3 "$XLSX_SKILL_DIR/xlsx.py" inspect input.xlsx --pretty   # understand structure
python3 "$XLSX_SKILL_DIR/xlsx.py" pivot input.xlsx output.xlsx \
    --source "Sheet!A1:F100" \
    --values "Revenue:sum,Units:count" \
    --rows "Product,Region" \
    --cols "Quarter" \
    --filters "Year" \
    --location "Summary!A3" \
    --style "finance" \
    --chart "bar"

Aggregations: sum, count, average/avg, max, min Chart types: bar (default), line, pie Styles: monochrome (default), finance

Never modify pivot output with openpyxl afterward — it corrupts the pivotCache.


Role 3: Inspector

The Inspector runs after all sheets are built. Two levels of inspection: Semantic and Structural.

Semantic Inspection (for edit/transform tasks)

When the task involves transforming existing data (not creating from scratch), verify the transformation didn't corrupt meaning:

Check Method
Row count Does output have the expected number of rows? (e.g., grouping 15 rows by 5 keys → expect 5 rows)
Column totals Do numeric sums in output match source? (or expected transformation)
Spot-check Compare 2-3 specific rows between source and output
Formula evaluability Can formulas be verified in Python? If self-referencing or cross-sheet, verify computed values instead
# Semantic verification template
source_total = sum(normalize_cell_value(ws_src.cell(row=r, column=c).value) or 0
                   for r in range(start, end + 1))
output_total = sum(normalize_cell_value(ws_out.cell(row=r, column=c).value) or 0
                   for r in range(out_start, out_end + 1))
assert abs(source_total - output_total) < 0.01, f"Total mismatch: {source_total} vs {output_total}"

Structural Inspection (release gate)

python3 "$XLSX_SKILL_DIR/xlsx.py" validate output.xlsx
  • Exit 0 → file is releasable
  • Non-zero → Builder must regenerate from scratch with corrected code

Known Traps & Countermeasures

These are recurring failure modes. The Builder must internalize them.

Trap What Goes Wrong Countermeasure
data_only=True then save Formulas permanently replaced with cached values Never save after opening with data_only=True
Column index miscalculation col 64 ≠ "BK" Always use openpyxl.utils.get_column_letter()
Row offset confusion DataFrame row 5 = Excel row 6 Excel is 1-indexed, pandas is 0-indexed
NaN leaks into formulas =A1+nan → broken formula string Check pd.notna() before referencing
Cross-sheet reference typo Sheet1!A1 vs 'Sheet 1'!A1 Quote sheet names containing spaces
Division by zero #DIV/0! in Excel Wrap with IFERROR() or IF(denom=0,...)
Text starting with = #NAME? error Prefix descriptive text with '
Implicit array formula #N/A in Excel Avoid MATCH(TRUE(),range>0,0), use SUMPRODUCT
Chart renders blank Formula cells have no cached values Run recalc before creating charts
Hidden rows → empty chart Chart skips hidden data Set chart.plot_visible_only = False
Overlapping charts Multiple charts stacked on same cells Calculate anchor: ~15 rows per chart + 2 rows gap
Verify newly-written formulas with data_only=True → get None openpyxl doesn't evaluate formulas; data_only=True only reads Excel's cached values which don't exist for new formulas Compute expected values in Python and compare directly. For TOTAL rows needing verification, write computed values (see SKILL.md Design Principle #1 Exception)
Manual row sort breaks references Value-swap sorting doesn't update formula references After sorting by swapping data, regenerate all formula strings with updated row numbers
NBSP (\xa0) treated as non-empty Cells containing \xa0 or \u200b look blank but fail is None Normalize: \xa0, \u200b, whitespace-only → None before comparison or aggregation

Cross-Validation Review Sheet

For analysis-heavy deliverables, embed a self-checking Review sheet in the workbook.

When Required

  • Deliverables with computed metrics or aggregated data
  • Financial models with cross-sheet references
  • Data sourced from external APIs or web searches

Structure

review_ws = wb.create_sheet("Review")
review_ws.sheet_properties.tabColor = "FFC000"  # amber tab

checks = [
    ["Check", "Expected", "Actual", "Status"],
    ["Total Revenue", "=SUM(Data!B2:B100)", "=Summary!B10", '=IF(B2=C2,"✓ PASS","✗ FAIL")'],
    ["Row Count", "=COUNTA(Data!A:A)-1", "=Summary!B3", '=IF(B3=C3,"✓ PASS","✗ FAIL")'],
    ["Grand Total Match", "=Detail!F50", "=Dashboard!C5", '=IF(B4=C4,"✓ PASS","✗ FAIL")'],
]
for i, row in enumerate(checks, 1):
    for j, val in enumerate(row, 1):
        review_ws.cell(row=i, column=j, value=val)

Rules

  • Every Summary/Dashboard metric must have a cross-check formula back to source data
  • Status column uses live formulas — green if correct, red if mismatch
  • Review is the last sheet in the workbook (before Sources, if present)

Release Checklist

Before handing the file to the user:

  • Every sheet passed the Builder's self-check chain
  • Semantic inspection passed (if applicable)
  • validate returned exit code 0
  • All temp files, drafts, and retry artifacts removed
  • If multiple versions exist from retries, only the latest correct version remains
  • Every remaining file in the output directory is an expected deliverable
  • VBA check (if .xlsm): VBA modules preserved, no unintended macro removal
  • VBA security (if VBA generated): passes security checklist in scenes/vba.md