Files

Gemini AI 142aaeee1e Release v1.01 Enhanced: Vi Control, TUI Gen5, Core Stability

2025-12-20 01:12:45 +04:00

5.3 KiB

Raw Blame History

IQ Exchange & Computer Use: Research & Improvement Proposal

Executive Summary

The current IQ Exchange implementation in opencode-ink.mjs provides a basic retry loop but lacks a robust "Translation Layer" for converting natural language into precise computer actions. It currently relies on placeholder logic or simple string matching.

Research into state-of-the-art agents (Windows-Use, browser-use, OpenDevin) reveals that reliable agents use structured translation layers that map natural language to specific, hook-based APIs (Playwright, UIA) rather than fragile shell commands or pure vision.

This proposal outlines a plan to upgrade the IQ Exchange with a proper AI Translation Layer and a Robust Execution Loop inspired by these findings.

1. Analysis of Current Implementation

Strengths

Retry Loop: IQExchange class has a solid retry mechanism with maxRetries.
Feedback Loop: Captures stdout/stderr and feeds it back to the AI for self-healing.
Task Detection: Simple regex-based detection for browser vs. desktop tasks.

Weaknesses

Missing Translation Layer: The opencode-ink.mjs file has a placeholder comment // NEW: Computer Use Translation Layer but no actual AI call to convert "Open Spotify and play jazz" into specific PowerShell/Playwright commands. It relies on the main chat response to hopefully contain the commands, which is unreliable.
Fragile Command Parsing: extractCommands uses regex finding ``` code blocks, which can be hit-or-miss if the AI is chatty.
No Structural Enforcing: The AI is free to hallucinate commands or arguments.

2. Research Findings & Inspiration

A. Windows-Use (CursorTouch)

Key Insight: Uses native UI Automation (UIA) hooks instead of just vision.
Relevance: We should prefer Input.ps1 using UIA (via PowerShell .NET access) over blind mouse coordinates.
Takeaway: The Translation Layer should map "Click X" to uiclick "X" (UIA) rather than mouse x y.

B. browser-use

Key Insight: Separation of Concerns.
1. Perception: Get DOM/State.
2. Cognition (Planner): Decide next action based on state.
3. Action: Execute.
Relevance: Our loop tries to do everything in one prompt.
Takeaway: We should split the "Translation" step.
1. User Request -> Translator AI (Specialized System Prompt) -> Standardized JSON/Script
2. Execution Engine -> Runs Script
3. Result -> Feedback

C. Open-Interface

Key Insight: Continuous Course Correction. Takes screenshots during execution to verify state.
Relevance: Our current loop only checks return codes (exit code 0/1).
Takeaway: We need "Verification Steps" in our commands (e.g., waitfor "WindowName").

3. Proposed Improvements

Phase 1: The "Translation Layer" (Immediate Fix)

Instead of relying on the main chat model to implicitly generate commands, we introduce a dedicated translation step.

Workflow:

Detection: Main Chat detects intent (e.g., "Computer Use").
Translation: System calls a fast, specialized model (or same model with focused prompt) with the specific schema of available tools.
- Input: "Open Spotify and search for Jazz"
- System Prompt: "You are a Command Translator. Available tools: open(app), click(text), type(text). Output ONLY the plan."
- Output:
```
powershell bin/input.ps1 open "Spotify"
powershell bin/input.ps1 waitfor "Search" 5
powershell bin/input.ps1 uiclick "Search"
powershell bin/input.ps1 type "Jazz"
```
Execution: The existing IQExchange loop runs this reliable script.

Phase 2: Enhanced Tooling (Library Update)

Update lib/computer-use.mjs and bin/input.ps1 to support UIA-based robust actions:

uiclick "Text": Finds element by text name via UIA (more robust than coordinates).
waitfor "Text": Polling loop to wait for UI state changes.
app_state "App": Returns detailed window state/focus.

Phase 3: The "Cognitive Loop" (Architecture Shift)

Move from "Plan -> Execute All" to "Observe -> Plan -> Act -> Observe".

Instead of generating a full script at start, the agent generates one step, executes it, observes the result (screenshot/output), then generates the next step.
This handles dynamic popups and loading times much better.

4. Implementation Plan (for Phase 1 & 2)

Step 1: Implement Dedicated Translation Function

In lib/iq-exchange.mjs or bin/opencode-ink.mjs, create translateToCommands(userRequest, context):

Uses a strict system prompt defining the exact API.
Enforces output format (e.g., JSON or strict Code Block).

Step 2: Integrate into `handleExecuteCommands`

Detect if request is "Computer Use".
If so, pause main chat generation.
Call translateToCommands.
Feed result into the auto-heal loop.

Step 3: Upgrade `input.ps1`

Ensure it supports the robust UIA methods discovered in Windows-Use (using .NET System.Windows.Automation).

5. User Review Required

Decision: Do we want the full "Cognitive Loop" (slower, more tokens, highly reliable) or the "Batch Script" approach (faster, cheaper, less robust)?
Recommendation: Start with Batch Script + Translation Layer (Phase 1). It fits the current TUI architecture best without a total rewrite.

5.3 KiB Raw Blame History