Files
OpenQode/.opencode/feature_audit.md

7.7 KiB

Computer Use Feature Audit: OpenQode TUI GEN5 🕵️

Audit Date: 2025-12-15 Auditor: Opus 4.5


Executive Summary

OpenQode TUI GEN5 has implemented a comprehensive input.ps1 script (1175 lines) that covers most features from the three reference projects. However, there are gaps in advanced automation patterns, visual feedback loops, and persistent browser control.


Feature Comparison Matrix

1. Windows-Use (CursorTouch/Windows-Use)

Feature Windows-Use OpenQode Status Notes
Mouse Control PyAutoGUI P/Invoke FULL Native Win32 API
mouse move mouse x y
smooth movement mousemove Duration parameter
click types all 4 types left/right/double/middle
drag drag
scroll scroll
Keyboard Control PyAutoGUI SendKeys/P/Invoke FULL
type text type
key press key Special keys supported
hotkey combos hotkey CTRL+C, ALT+TAB, etc
keydown/keyup both For modifiers
UI Automation UIAutomation UIAutomationClient FULL
find element find By name
find all findall Multiple instances
find by property findby controltype, class, automationid
click element uiclick InvokePattern + fallback
waitfor element waitfor Timeout support
App Control FULL
list apps/windows apps With position/size
kill process kill By name or title
Shell Commands subprocess ⚠️ PARTIAL Via /run in TUI
Telemetry 🔵 NOT NEEDED Privacy-focused

2. Open-Interface (AmberSahdev/Open-Interface)

Feature Open-Interface OpenQode Status Notes
Screenshot Capture Pillow/pyautogui System.Drawing FULL
full screen screenshot
region capture region x,y,w,h
Visual Feedback Loop GPT-4V/Gemini TERMINUS prompt ⚠️ PARTIAL See improvements
screenshot → LLM → action ⚠️ prompt-based ⚠️ No automatic loop
course correction MISSING Needs implementation
OCR pytesseract (stub) ⚠️ STUB Needs Tesseract
text recognition Described only ⚠️
Color Detection FULL
get pixel color ? color Hex output
wait for color ? waitforcolor With tolerance
Multi-Monitor Limited Limited ⚠️ Primary only

3. Browser-Use (browser-use/browser-use)

Feature Browser-Use OpenQode Status Notes
Browser Launch Playwright Start-Process FULL
open URL browse, open Multiple browsers
google search googlesearch Direct URL
Page Navigation Playwright ⚠️ PARTIAL
navigate playwright navigate ⚠️ Opens in system browser
Element Interaction Playwright UIAutomation ⚠️ DIFFERENT
click by selector CSS/XPath ⚠️ Name only ⚠️ No CSS/XPath
fill form ⚠️ browsercontrol fill ⚠️ UIAutomation-based
Content Extraction Playwright MISSING
get page content Needs Playwright
get element text
Persistent Session Playwright MISSING No CDP/WebSocket
cookies/auth
Multi-Tab Playwright MISSING
Agent Loop Built-in TUI TERMINUS ⚠️ PARTIAL Different architecture

Missing Features & Implementation Suggestions

🔴 Critical Gaps

  1. Visual Feedback Loop (Open-Interface Style)

    • Gap: No automatic "take screenshot → analyze → act → repeat" loop
    • Fix: Implement a /vision-loop command that:
      1. Takes screenshot
      2. Sends to vision model (Qwen-VL or GPT-4V)
      3. Parses response for actions
      4. Executes via input.ps1
      5. Repeats until goal achieved
    • Credit: AmberSahdev/Open-Interface
  2. Full OCR Support

    • Gap: OCR is a stub in input.ps1
    • Fix: Integrate Windows 10+ OCR API or Tesseract
    • Code from: Windows.Media.Ocr namespace
  3. Playwright Integration (Real)

    • Gap: playwright command just simulates
    • Fix: Create bin/playwright-bridge.js that:
      1. Launches Chromium with Playwright
      2. Exposes WebSocket for commands
      3. input.ps1 playwright calls this bridge
    • Credit: browser-use/browser-use
  4. Content Extraction

    • Gap: Cannot read web page content
    • Fix: Use Playwright page.content() or clipboard hack

🟡 Enhancement Opportunities

  1. Course Correction (Open-Interface)

    • After each action, automatically take screenshot and verify success
    • If UI doesn't match expected state, retry or ask for guidance
  2. CSS/XPath Selectors (Browser-Use)

    • Current findby only supports Name, ControlType, Class
    • For web: need Playwright or CDP for CSS selectors
  3. Multi-Tab Browser Control

    • Use --remote-debugging-port to connect via CDP
    • Enable tab switching, new tabs, close tabs

Opus 4.5 Improvement Recommendations

1. Natural Language → Action Translation

Current TERMINUS prompt is complex. Simplify with:

// Decision Tree in handleSubmit
if (isComputerUseRequest) {
    // Skip AI interpretation, directly map to actions
    const actionMap = {
        'click start': 'input.ps1 key LWIN',
        'open chrome': 'input.ps1 open chrome.exe',
        'google X': 'input.ps1 googlesearch X'
    };
    // Execute immediately without LLM call for simple requests
}

2. Action Confirmation UI

Add visual feedback in TUI when executing:

🖱️ Executing: uiclick "Start"
⏳ Waiting for element...
✅ Clicked at (45, 1050)

3. Streaming Action Execution

Instead of generating all commands then executing, stream:

  1. AI generates first command
  2. TUI executes immediately
  3. AI generates next based on result
  4. Repeat

4. Safety Sandbox

Add /sandbox mode that:

  • Shows preview of actions before execution
  • Requires confirmation for system-level changes
  • Logs all actions for audit

5. Vision Model Integration

// In agent-prompt.mjs, add:
if (activeSkill?.id === 'win-vision') {
    // Attach screenshot to next API call
    const screenshot = await captureScreen();
    context.visionImage = screenshot;
}

Attribution Requirements

When committing changes inspired by these projects:

git commit -m "feat(computer-use): Add visual feedback loop

Inspired by: AmberSahdev/Open-Interface
Credit: https://github.com/AmberSahdev/Open-Interface
License: MIT"
git commit -m "feat(browser): Add Playwright bridge for web automation

Inspired by: browser-use/browser-use  
Credit: https://github.com/browser-use/browser-use
License: MIT"

Summary

Module Completeness Notes
Computer Use (Windows-Use) 95% Full parity
Computer Vision (Open-Interface) ⚠️ 60% Missing feedback loop, OCR
Browser Use (browser-use) ⚠️ 50% Missing Playwright, content extraction
Server Management 90% Via PowerShell skills

Overall: 75% Feature Parity with room for improvement in visual automation and browser control.