Release v1.01 Enhanced: Vi Control, TUI Gen5, Core Stability
This commit is contained in:
101
Documentation/iq_exchange_improvement_proposal.md
Normal file
101
Documentation/iq_exchange_improvement_proposal.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# IQ Exchange & Computer Use: Research & Improvement Proposal
|
||||
|
||||
## Executive Summary
|
||||
The current IQ Exchange implementation in `opencode-ink.mjs` provides a basic retry loop but lacks a robust "Translation Layer" for converting natural language into precise computer actions. It currently relies on placeholder logic or simple string matching.
|
||||
|
||||
Research into state-of-the-art agents (Windows-Use, browser-use, OpenDevin) reveals that reliable agents use **structured translation layers** that map natural language to specific, hook-based APIs (Playwright, UIA) rather than fragile shell commands or pure vision.
|
||||
|
||||
This proposal outlines a plan to upgrade the IQ Exchange with a proper **AI Translation Layer** and a **Robust Execution Loop** inspired by these findings.
|
||||
|
||||
---
|
||||
|
||||
## 1. Analysis of Current Implementation
|
||||
|
||||
### Strengths
|
||||
- **Retry Loop:** `IQExchange` class has a solid retry mechanism with `maxRetries`.
|
||||
- **Feedback Loop:** Captures stdout/stderr and feeds it back to the AI for self-healing.
|
||||
- **Task Detection:** Simple regex-based detection for browser vs. desktop tasks.
|
||||
|
||||
### Weaknesses
|
||||
- **Missing Translation Layer:** The `opencode-ink.mjs` file has a placeholder comment `// NEW: Computer Use Translation Layer` but no actual AI call to convert "Open Spotify and play jazz" into specific PowerShell/Playwright commands. It relies on the *main* chat response to hopefully contain the commands, which is unreliable.
|
||||
- **Fragile Command Parsing:** `extractCommands` uses regex finding \`\`\` code blocks, which can be hit-or-miss if the AI is chatty.
|
||||
- **No Structural Enforcing:** The AI is free to hallucinate commands or arguments.
|
||||
|
||||
---
|
||||
|
||||
## 2. Research Findings & Inspiration
|
||||
|
||||
### A. Windows-Use (CursorTouch)
|
||||
- **Key Insight:** Uses **native UI Automation (UIA)** hooks instead of just vision.
|
||||
- **Relevance:** We should prefer `Input.ps1` using UIA (via PowerShell .NET access) over blind mouse coordinates.
|
||||
- **Takeaway:** The Translation Layer should map "Click X" to `uiclick "X"` (UIA) rather than `mouse x y`.
|
||||
|
||||
### B. browser-use
|
||||
- **Key Insight:** **Separation of Concerns**.
|
||||
1. **Perception:** Get DOM/State.
|
||||
2. **Cognition (Planner):** Decide *next action* based on state.
|
||||
3. **Action:** Execute.
|
||||
- **Relevance:** Our loop tries to do everything in one prompt.
|
||||
- **Takeaway:** We should split the "Translation" step.
|
||||
1. User Request -> Translator AI (Specialized System Prompt) -> Standardized JSON/Script
|
||||
2. Execution Engine -> Runs Script
|
||||
3. Result -> Feedback
|
||||
|
||||
### C. Open-Interface
|
||||
- **Key Insight:** **Continuous Course Correction**. Takes screenshots *during* execution to verify state.
|
||||
- **Relevance:** Our current loop only checks return codes (exit code 0/1).
|
||||
- **Takeaway:** We need "Verification Steps" in our commands (e.g., `waitfor "WindowName"`).
|
||||
|
||||
---
|
||||
|
||||
## 3. Proposed Improvements
|
||||
|
||||
### Phase 1: The "Translation Layer" (Immediate Fix)
|
||||
Instead of relying on the main chat model to implicitly generate commands, we introduce a **dedicated translation step**.
|
||||
|
||||
**Workflow:**
|
||||
1. **Detection:** Main Chat detects intent (e.g., "Computer Use").
|
||||
2. **Translation:** System calls a fast, specialized model (or same model with focused prompt) with the *specific schema* of available tools.
|
||||
- **Input:** "Open Spotify and search for Jazz"
|
||||
- **System Prompt:** "You are a Command Translator. Available tools: `open(app)`, `click(text)`, `type(text)`. Output ONLY the plan."
|
||||
- **Output:**
|
||||
```powershell
|
||||
powershell bin/input.ps1 open "Spotify"
|
||||
powershell bin/input.ps1 waitfor "Search" 5
|
||||
powershell bin/input.ps1 uiclick "Search"
|
||||
powershell bin/input.ps1 type "Jazz"
|
||||
```
|
||||
3. **Execution:** The existing `IQExchange` loop runs this reliable script.
|
||||
|
||||
### Phase 2: Enhanced Tooling (Library Update)
|
||||
Update `lib/computer-use.mjs` and `bin/input.ps1` to support **UIA-based robust actions**:
|
||||
- `uiclick "Text"`: Finds element by text name via UIA (more robust than coordinates).
|
||||
- `waitfor "Text"`: Polling loop to wait for UI state changes.
|
||||
- `app_state "App"`: Returns detailed window state/focus.
|
||||
|
||||
### Phase 3: The "Cognitive Loop" (Architecture Shift)
|
||||
Move from **"Plan -> Execute All"** to **"Observe -> Plan -> Act -> Observe"**.
|
||||
- Instead of generating a full script at start, the agent generates *one step*, executes it, observes the result (screenshot/output), then generates the next step.
|
||||
- This handles dynamic popups and loading times much better.
|
||||
|
||||
---
|
||||
|
||||
## 4. Implementation Plan (for Phase 1 & 2)
|
||||
|
||||
### Step 1: Implement Dedicated Translation Function
|
||||
In `lib/iq-exchange.mjs` or `bin/opencode-ink.mjs`, create `translateToCommands(userRequest, context)`:
|
||||
- Uses a strict system prompt defining the *exact* API.
|
||||
- Enforces output format (e.g., JSON or strict Code Block).
|
||||
|
||||
### Step 2: Integrate into `handleExecuteCommands`
|
||||
- Detect if request is "Computer Use".
|
||||
- If so, *pause* main chat generation.
|
||||
- Call `translateToCommands`.
|
||||
- Feed result into the `auto-heal` loop.
|
||||
|
||||
### Step 3: Upgrade `input.ps1`
|
||||
- Ensure it supports the robust UIA methods discovered in Windows-Use (using .NET `System.Windows.Automation`).
|
||||
|
||||
## 5. User Review Required
|
||||
- **Decision:** Do we want the full "Cognitive Loop" (slower, more tokens, highly reliable) or the "Batch Script" approach (faster, cheaper, less robust)?
|
||||
- **Recommendation:** Start with **Batch Script + Translation Layer** (Phase 1). It fits the current TUI architecture best without a total rewrite.
|
||||
Reference in New Issue
Block a user