197 lines
9.1 KiB
Markdown
Executable File
197 lines
9.1 KiB
Markdown
Executable File
# Spreadsheet Integrity Pipeline
|
|
|
|
Every xlsx deliverable is built and verified through a role-based workflow. Three roles collaborate in sequence: **Blueprint Architect**, **Builder**, and **Inspector**. Each role has explicit responsibilities and handoff criteria.
|
|
|
|
---
|
|
|
|
## Tool Reference: xlsx.py
|
|
|
|
All commands: `python3 "$XLSX_SKILL_DIR/xlsx.py" <command> [arguments]`
|
|
|
|
| Command | Purpose | Called By |
|
|
|---------|---------|-----------|
|
|
| `recalc <file>` | Recalculate formulas via LibreOffice, scan for errors | Builder (self-check) |
|
|
| `audit <file>` | Deep formula error scan + zero-value + implicit array detection | Builder (self-check) |
|
|
| `scan <file>` | Detect out-of-range, header-included, small-aggregate, inconsistent patterns | Builder (self-check) |
|
|
| `inspect <file> --pretty` | Get sheet structure, data ranges, headers (JSON) | Blueprint Architect |
|
|
| `pivot <in> <out> --source --values [--rows --cols --filters --style --chart]` | Create PivotTable | Builder (final step only) |
|
|
| `chart-verify <file>` | Verify embedded charts have data | Builder (self-check) |
|
|
| `validate <file>` | Structural validation (release gate) | Inspector |
|
|
|
|
---
|
|
|
|
## Role 1: Blueprint Architect
|
|
|
|
Before any code runs, the Architect produces a build plan:
|
|
|
|
- **Decompose the request**: separate explicit requirements from implicit business context
|
|
- **Map every sheet**: name, column structure, formula dependencies, cross-references
|
|
- **Identify data flow**: which sheets feed into which (source → derived → summary)
|
|
- **Flag ambiguity**: if the request is unclear, ask — don't guess
|
|
|
|
The Architect's output is a mental blueprint. No files are created yet.
|
|
|
|
---
|
|
|
|
## Role 2: Builder
|
|
|
|
The Builder writes code and produces the workbook. The Builder operates under a strict **single-sheet discipline**: complete one sheet fully, verify it, then move on.
|
|
|
|
### Build Cycle (per sheet)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────┐
|
|
│ Write sheet (data, formulas, styling, charts) │
|
|
│ ↓ │
|
|
│ Save workbook to disk │
|
|
│ ↓ │
|
|
│ Self-check chain: │
|
|
│ recalc → audit → scan │
|
|
│ + chart-verify (if sheet has charts) │
|
|
│ ↓ │
|
|
│ All clear? ──Yes──→ Proceed to next sheet │
|
|
│ │ │
|
|
│ No │
|
|
│ ↓ │
|
|
│ Fix errors → re-save → re-run self-check │
|
|
│ (loop until clean) │
|
|
└─────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Builder Constraints
|
|
|
|
- **No batch-then-check**: you cannot create all sheets first and verify later. Errors in early sheets propagate silently into later sheets.
|
|
- **No error forwarding**: a sheet with unresolved errors blocks all subsequent work.
|
|
- **No silent delivery**: a file that hasn't passed self-check is not a deliverable — it's a draft.
|
|
|
|
### Pivot Tables — Special Sequencing
|
|
|
|
PivotTables depend on finalized source data. They are always the **last data operation**:
|
|
|
|
```bash
|
|
python3 "$XLSX_SKILL_DIR/xlsx.py" inspect input.xlsx --pretty # understand structure
|
|
python3 "$XLSX_SKILL_DIR/xlsx.py" pivot input.xlsx output.xlsx \
|
|
--source "Sheet!A1:F100" \
|
|
--values "Revenue:sum,Units:count" \
|
|
--rows "Product,Region" \
|
|
--cols "Quarter" \
|
|
--filters "Year" \
|
|
--location "Summary!A3" \
|
|
--style "finance" \
|
|
--chart "bar"
|
|
```
|
|
|
|
Aggregations: `sum`, `count`, `average`/`avg`, `max`, `min`
|
|
Chart types: `bar` (default), `line`, `pie`
|
|
Styles: `monochrome` (default), `finance`
|
|
|
|
**Never modify pivot output with openpyxl afterward** — it corrupts the pivotCache.
|
|
|
|
---
|
|
|
|
## Role 3: Inspector
|
|
|
|
The Inspector runs after all sheets are built. Two levels of inspection: **Semantic** and **Structural**.
|
|
|
|
### Semantic Inspection (for edit/transform tasks)
|
|
|
|
When the task involves transforming existing data (not creating from scratch), verify the transformation didn't corrupt meaning:
|
|
|
|
| Check | Method |
|
|
|-------|--------|
|
|
| **Row count** | Does output have the expected number of rows? (e.g., grouping 15 rows by 5 keys → expect 5 rows) |
|
|
| **Column totals** | Do numeric sums in output match source? (or expected transformation) |
|
|
| **Spot-check** | Compare 2-3 specific rows between source and output |
|
|
| **Formula evaluability** | Can formulas be verified in Python? If self-referencing or cross-sheet, verify computed values instead |
|
|
|
|
```python
|
|
# Semantic verification template
|
|
source_total = sum(normalize_cell_value(ws_src.cell(row=r, column=c).value) or 0
|
|
for r in range(start, end + 1))
|
|
output_total = sum(normalize_cell_value(ws_out.cell(row=r, column=c).value) or 0
|
|
for r in range(out_start, out_end + 1))
|
|
assert abs(source_total - output_total) < 0.01, f"Total mismatch: {source_total} vs {output_total}"
|
|
```
|
|
|
|
### Structural Inspection (release gate)
|
|
|
|
```bash
|
|
python3 "$XLSX_SKILL_DIR/xlsx.py" validate output.xlsx
|
|
```
|
|
|
|
- Exit 0 → file is releasable
|
|
- Non-zero → Builder must regenerate from scratch with corrected code
|
|
|
|
---
|
|
|
|
## Known Traps & Countermeasures
|
|
|
|
These are recurring failure modes. The Builder must internalize them.
|
|
|
|
| Trap | What Goes Wrong | Countermeasure |
|
|
|------|----------------|----------------|
|
|
| `data_only=True` then save | Formulas permanently replaced with cached values | Never save after opening with `data_only=True` |
|
|
| Column index miscalculation | col 64 ≠ "BK" | Always use `openpyxl.utils.get_column_letter()` |
|
|
| Row offset confusion | DataFrame row 5 = Excel row 6 | Excel is 1-indexed, pandas is 0-indexed |
|
|
| NaN leaks into formulas | `=A1+nan` → broken formula string | Check `pd.notna()` before referencing |
|
|
| Cross-sheet reference typo | `Sheet1!A1` vs `'Sheet 1'!A1` | Quote sheet names containing spaces |
|
|
| Division by zero | `#DIV/0!` in Excel | Wrap with `IFERROR()` or `IF(denom=0,...)` |
|
|
| Text starting with `=` | `#NAME?` error | Prefix descriptive text with `'` |
|
|
| Implicit array formula | `#N/A` in Excel | Avoid `MATCH(TRUE(),range>0,0)`, use `SUMPRODUCT` |
|
|
| Chart renders blank | Formula cells have no cached values | Run `recalc` before creating charts |
|
|
| Hidden rows → empty chart | Chart skips hidden data | Set `chart.plot_visible_only = False` |
|
|
| Overlapping charts | Multiple charts stacked on same cells | Calculate anchor: ~15 rows per chart + 2 rows gap |
|
|
| Verify newly-written formulas with `data_only=True` → get `None` | openpyxl doesn't evaluate formulas; `data_only=True` only reads Excel's cached values which don't exist for new formulas | Compute expected values in Python and compare directly. For TOTAL rows needing verification, write computed values (see SKILL.md Design Principle #1 Exception) |
|
|
| Manual row sort breaks references | Value-swap sorting doesn't update formula references | After sorting by swapping data, regenerate all formula strings with updated row numbers |
|
|
| NBSP (`\xa0`) treated as non-empty | Cells containing `\xa0` or `\u200b` look blank but fail `is None` | Normalize: `\xa0`, `\u200b`, whitespace-only → `None` before comparison or aggregation |
|
|
|
|
---
|
|
|
|
## Cross-Validation Review Sheet
|
|
|
|
For analysis-heavy deliverables, embed a self-checking Review sheet in the workbook.
|
|
|
|
### When Required
|
|
|
|
- Deliverables with computed metrics or aggregated data
|
|
- Financial models with cross-sheet references
|
|
- Data sourced from external APIs or web searches
|
|
|
|
### Structure
|
|
|
|
```python
|
|
review_ws = wb.create_sheet("Review")
|
|
review_ws.sheet_properties.tabColor = "FFC000" # amber tab
|
|
|
|
checks = [
|
|
["Check", "Expected", "Actual", "Status"],
|
|
["Total Revenue", "=SUM(Data!B2:B100)", "=Summary!B10", '=IF(B2=C2,"✓ PASS","✗ FAIL")'],
|
|
["Row Count", "=COUNTA(Data!A:A)-1", "=Summary!B3", '=IF(B3=C3,"✓ PASS","✗ FAIL")'],
|
|
["Grand Total Match", "=Detail!F50", "=Dashboard!C5", '=IF(B4=C4,"✓ PASS","✗ FAIL")'],
|
|
]
|
|
for i, row in enumerate(checks, 1):
|
|
for j, val in enumerate(row, 1):
|
|
review_ws.cell(row=i, column=j, value=val)
|
|
```
|
|
|
|
### Rules
|
|
|
|
- Every Summary/Dashboard metric must have a cross-check formula back to source data
|
|
- Status column uses live formulas — green if correct, red if mismatch
|
|
- Review is the **last sheet** in the workbook (before Sources, if present)
|
|
|
|
---
|
|
|
|
## Release Checklist
|
|
|
|
Before handing the file to the user:
|
|
|
|
- [ ] Every sheet passed the Builder's self-check chain
|
|
- [ ] Semantic inspection passed (if applicable)
|
|
- [ ] `validate` returned exit code 0
|
|
- [ ] All temp files, drafts, and retry artifacts removed
|
|
- [ ] If multiple versions exist from retries, only the latest correct version remains
|
|
- [ ] Every remaining file in the output directory is an expected deliverable
|
|
- [ ] **VBA check** (if `.xlsm`): VBA modules preserved, no unintended macro removal
|
|
- [ ] **VBA security** (if VBA generated): passes security checklist in `scenes/vba.md`
|