5.3 KiB
IQ Exchange & Computer Use: Research & Improvement Proposal
Executive Summary
The current IQ Exchange implementation in opencode-ink.mjs provides a basic retry loop but lacks a robust "Translation Layer" for converting natural language into precise computer actions. It currently relies on placeholder logic or simple string matching.
Research into state-of-the-art agents (Windows-Use, browser-use, OpenDevin) reveals that reliable agents use structured translation layers that map natural language to specific, hook-based APIs (Playwright, UIA) rather than fragile shell commands or pure vision.
This proposal outlines a plan to upgrade the IQ Exchange with a proper AI Translation Layer and a Robust Execution Loop inspired by these findings.
1. Analysis of Current Implementation
Strengths
- Retry Loop:
IQExchangeclass has a solid retry mechanism withmaxRetries. - Feedback Loop: Captures stdout/stderr and feeds it back to the AI for self-healing.
- Task Detection: Simple regex-based detection for browser vs. desktop tasks.
Weaknesses
- Missing Translation Layer: The
opencode-ink.mjsfile has a placeholder comment// NEW: Computer Use Translation Layerbut no actual AI call to convert "Open Spotify and play jazz" into specific PowerShell/Playwright commands. It relies on the main chat response to hopefully contain the commands, which is unreliable. - Fragile Command Parsing:
extractCommandsuses regex finding ``` code blocks, which can be hit-or-miss if the AI is chatty. - No Structural Enforcing: The AI is free to hallucinate commands or arguments.
2. Research Findings & Inspiration
A. Windows-Use (CursorTouch)
- Key Insight: Uses native UI Automation (UIA) hooks instead of just vision.
- Relevance: We should prefer
Input.ps1using UIA (via PowerShell .NET access) over blind mouse coordinates. - Takeaway: The Translation Layer should map "Click X" to
uiclick "X"(UIA) rather thanmouse x y.
B. browser-use
- Key Insight: Separation of Concerns.
- Perception: Get DOM/State.
- Cognition (Planner): Decide next action based on state.
- Action: Execute.
- Relevance: Our loop tries to do everything in one prompt.
- Takeaway: We should split the "Translation" step.
- User Request -> Translator AI (Specialized System Prompt) -> Standardized JSON/Script
- Execution Engine -> Runs Script
- Result -> Feedback
C. Open-Interface
- Key Insight: Continuous Course Correction. Takes screenshots during execution to verify state.
- Relevance: Our current loop only checks return codes (exit code 0/1).
- Takeaway: We need "Verification Steps" in our commands (e.g.,
waitfor "WindowName").
3. Proposed Improvements
Phase 1: The "Translation Layer" (Immediate Fix)
Instead of relying on the main chat model to implicitly generate commands, we introduce a dedicated translation step.
Workflow:
- Detection: Main Chat detects intent (e.g., "Computer Use").
- Translation: System calls a fast, specialized model (or same model with focused prompt) with the specific schema of available tools.
- Input: "Open Spotify and search for Jazz"
- System Prompt: "You are a Command Translator. Available tools:
open(app),click(text),type(text). Output ONLY the plan." - Output:
powershell bin/input.ps1 open "Spotify" powershell bin/input.ps1 waitfor "Search" 5 powershell bin/input.ps1 uiclick "Search" powershell bin/input.ps1 type "Jazz"
- Execution: The existing
IQExchangeloop runs this reliable script.
Phase 2: Enhanced Tooling (Library Update)
Update lib/computer-use.mjs and bin/input.ps1 to support UIA-based robust actions:
uiclick "Text": Finds element by text name via UIA (more robust than coordinates).waitfor "Text": Polling loop to wait for UI state changes.app_state "App": Returns detailed window state/focus.
Phase 3: The "Cognitive Loop" (Architecture Shift)
Move from "Plan -> Execute All" to "Observe -> Plan -> Act -> Observe".
- Instead of generating a full script at start, the agent generates one step, executes it, observes the result (screenshot/output), then generates the next step.
- This handles dynamic popups and loading times much better.
4. Implementation Plan (for Phase 1 & 2)
Step 1: Implement Dedicated Translation Function
In lib/iq-exchange.mjs or bin/opencode-ink.mjs, create translateToCommands(userRequest, context):
- Uses a strict system prompt defining the exact API.
- Enforces output format (e.g., JSON or strict Code Block).
Step 2: Integrate into handleExecuteCommands
- Detect if request is "Computer Use".
- If so, pause main chat generation.
- Call
translateToCommands. - Feed result into the
auto-healloop.
Step 3: Upgrade input.ps1
- Ensure it supports the robust UIA methods discovered in Windows-Use (using .NET
System.Windows.Automation).
5. User Review Required
- Decision: Do we want the full "Cognitive Loop" (slower, more tokens, highly reliable) or the "Batch Script" approach (faster, cheaper, less robust)?
- Recommendation: Start with Batch Script + Translation Layer (Phase 1). It fits the current TUI architecture best without a total rewrite.