Files
OpenQode/Documentation/iq_exchange_improvement_proposal.md

5.3 KiB

IQ Exchange & Computer Use: Research & Improvement Proposal

Executive Summary

The current IQ Exchange implementation in opencode-ink.mjs provides a basic retry loop but lacks a robust "Translation Layer" for converting natural language into precise computer actions. It currently relies on placeholder logic or simple string matching.

Research into state-of-the-art agents (Windows-Use, browser-use, OpenDevin) reveals that reliable agents use structured translation layers that map natural language to specific, hook-based APIs (Playwright, UIA) rather than fragile shell commands or pure vision.

This proposal outlines a plan to upgrade the IQ Exchange with a proper AI Translation Layer and a Robust Execution Loop inspired by these findings.


1. Analysis of Current Implementation

Strengths

  • Retry Loop: IQExchange class has a solid retry mechanism with maxRetries.
  • Feedback Loop: Captures stdout/stderr and feeds it back to the AI for self-healing.
  • Task Detection: Simple regex-based detection for browser vs. desktop tasks.

Weaknesses

  • Missing Translation Layer: The opencode-ink.mjs file has a placeholder comment // NEW: Computer Use Translation Layer but no actual AI call to convert "Open Spotify and play jazz" into specific PowerShell/Playwright commands. It relies on the main chat response to hopefully contain the commands, which is unreliable.
  • Fragile Command Parsing: extractCommands uses regex finding ``` code blocks, which can be hit-or-miss if the AI is chatty.
  • No Structural Enforcing: The AI is free to hallucinate commands or arguments.

2. Research Findings & Inspiration

A. Windows-Use (CursorTouch)

  • Key Insight: Uses native UI Automation (UIA) hooks instead of just vision.
  • Relevance: We should prefer Input.ps1 using UIA (via PowerShell .NET access) over blind mouse coordinates.
  • Takeaway: The Translation Layer should map "Click X" to uiclick "X" (UIA) rather than mouse x y.

B. browser-use

  • Key Insight: Separation of Concerns.
    1. Perception: Get DOM/State.
    2. Cognition (Planner): Decide next action based on state.
    3. Action: Execute.
  • Relevance: Our loop tries to do everything in one prompt.
  • Takeaway: We should split the "Translation" step.
    1. User Request -> Translator AI (Specialized System Prompt) -> Standardized JSON/Script
    2. Execution Engine -> Runs Script
    3. Result -> Feedback

C. Open-Interface

  • Key Insight: Continuous Course Correction. Takes screenshots during execution to verify state.
  • Relevance: Our current loop only checks return codes (exit code 0/1).
  • Takeaway: We need "Verification Steps" in our commands (e.g., waitfor "WindowName").

3. Proposed Improvements

Phase 1: The "Translation Layer" (Immediate Fix)

Instead of relying on the main chat model to implicitly generate commands, we introduce a dedicated translation step.

Workflow:

  1. Detection: Main Chat detects intent (e.g., "Computer Use").
  2. Translation: System calls a fast, specialized model (or same model with focused prompt) with the specific schema of available tools.
    • Input: "Open Spotify and search for Jazz"
    • System Prompt: "You are a Command Translator. Available tools: open(app), click(text), type(text). Output ONLY the plan."
    • Output:
      powershell bin/input.ps1 open "Spotify"
      powershell bin/input.ps1 waitfor "Search" 5
      powershell bin/input.ps1 uiclick "Search"
      powershell bin/input.ps1 type "Jazz"
      
  3. Execution: The existing IQExchange loop runs this reliable script.

Phase 2: Enhanced Tooling (Library Update)

Update lib/computer-use.mjs and bin/input.ps1 to support UIA-based robust actions:

  • uiclick "Text": Finds element by text name via UIA (more robust than coordinates).
  • waitfor "Text": Polling loop to wait for UI state changes.
  • app_state "App": Returns detailed window state/focus.

Phase 3: The "Cognitive Loop" (Architecture Shift)

Move from "Plan -> Execute All" to "Observe -> Plan -> Act -> Observe".

  • Instead of generating a full script at start, the agent generates one step, executes it, observes the result (screenshot/output), then generates the next step.
  • This handles dynamic popups and loading times much better.

4. Implementation Plan (for Phase 1 & 2)

Step 1: Implement Dedicated Translation Function

In lib/iq-exchange.mjs or bin/opencode-ink.mjs, create translateToCommands(userRequest, context):

  • Uses a strict system prompt defining the exact API.
  • Enforces output format (e.g., JSON or strict Code Block).

Step 2: Integrate into handleExecuteCommands

  • Detect if request is "Computer Use".
  • If so, pause main chat generation.
  • Call translateToCommands.
  • Feed result into the auto-heal loop.

Step 3: Upgrade input.ps1

  • Ensure it supports the robust UIA methods discovered in Windows-Use (using .NET System.Windows.Automation).

5. User Review Required

  • Decision: Do we want the full "Cognitive Loop" (slower, more tokens, highly reliable) or the "Batch Script" approach (faster, cheaper, less robust)?
  • Recommendation: Start with Batch Script + Translation Layer (Phase 1). It fits the current TUI architecture best without a total rewrite.