7.7 KiB
Computer Use Feature Audit: OpenQode TUI GEN5 🕵️
Audit Date: 2025-12-15 Auditor: Opus 4.5
Executive Summary
OpenQode TUI GEN5 has implemented a comprehensive input.ps1 script (1175 lines) that covers most features from the three reference projects. However, there are gaps in advanced automation patterns, visual feedback loops, and persistent browser control.
Feature Comparison Matrix
1. Windows-Use (CursorTouch/Windows-Use)
| Feature | Windows-Use | OpenQode | Status | Notes |
|---|---|---|---|---|
| Mouse Control | PyAutoGUI | P/Invoke | ✅ FULL | Native Win32 API |
| mouse move | ✅ | ✅ mouse x y |
✅ | |
| smooth movement | ✅ | ✅ mousemove |
✅ | Duration parameter |
| click types | ✅ | ✅ all 4 types | ✅ | left/right/double/middle |
| drag | ✅ | ✅ drag |
✅ | |
| scroll | ✅ | ✅ scroll |
✅ | |
| Keyboard Control | PyAutoGUI | SendKeys/P/Invoke | ✅ FULL | |
| type text | ✅ | ✅ type |
✅ | |
| key press | ✅ | ✅ key |
✅ | Special keys supported |
| hotkey combos | ✅ | ✅ hotkey |
✅ | CTRL+C, ALT+TAB, etc |
| keydown/keyup | ✅ | ✅ both | ✅ | For modifiers |
| UI Automation | UIAutomation | UIAutomationClient | ✅ FULL | |
| find element | ✅ | ✅ find |
✅ | By name |
| find all | ✅ | ✅ findall |
✅ | Multiple instances |
| find by property | ✅ | ✅ findby |
✅ | controltype, class, automationid |
| click element | ✅ | ✅ uiclick |
✅ | InvokePattern + fallback |
| waitfor element | ✅ | ✅ waitfor |
✅ | Timeout support |
| App Control | ✅ FULL | |||
| list apps/windows | ✅ | ✅ apps |
✅ | With position/size |
| kill process | ✅ | ✅ kill |
✅ | By name or title |
| Shell Commands | subprocess | ⚠️ PARTIAL | Via /run in TUI |
|
| Telemetry | ✅ | ❌ | 🔵 NOT NEEDED | Privacy-focused |
2. Open-Interface (AmberSahdev/Open-Interface)
| Feature | Open-Interface | OpenQode | Status | Notes |
|---|---|---|---|---|
| Screenshot Capture | Pillow/pyautogui | System.Drawing | ✅ FULL | |
| full screen | ✅ | ✅ screenshot |
✅ | |
| region capture | ✅ | ✅ region |
✅ | x,y,w,h |
| Visual Feedback Loop | GPT-4V/Gemini | TERMINUS prompt | ⚠️ PARTIAL | See improvements |
| screenshot → LLM → action | ✅ | ⚠️ prompt-based | ⚠️ | No automatic loop |
| course correction | ✅ | ❌ | ❌ MISSING | Needs implementation |
| OCR | pytesseract | (stub) | ⚠️ STUB | Needs Tesseract |
| text recognition | ✅ | Described only | ⚠️ | |
| Color Detection | ✅ FULL | |||
| get pixel color | ? | ✅ color |
✅ | Hex output |
| wait for color | ? | ✅ waitforcolor |
✅ | With tolerance |
| Multi-Monitor | Limited | Limited | ⚠️ | Primary only |
3. Browser-Use (browser-use/browser-use)
| Feature | Browser-Use | OpenQode | Status | Notes |
|---|---|---|---|---|
| Browser Launch | Playwright | Start-Process | ✅ FULL | |
| open URL | ✅ | ✅ browse, open |
✅ | Multiple browsers |
| google search | ✅ | ✅ googlesearch |
✅ | Direct URL |
| Page Navigation | Playwright | ⚠️ PARTIAL | ||
| navigate | ✅ | ✅ playwright navigate |
⚠️ | Opens in system browser |
| Element Interaction | Playwright | UIAutomation | ⚠️ DIFFERENT | |
| click by selector | ✅ CSS/XPath | ⚠️ Name only | ⚠️ | No CSS/XPath |
| fill form | ✅ | ⚠️ browsercontrol fill |
⚠️ | UIAutomation-based |
| Content Extraction | Playwright | ❌ MISSING | ||
| get page content | ✅ | ❌ | ❌ | Needs Playwright |
| get element text | ✅ | ❌ | ❌ | |
| Persistent Session | Playwright | ❌ | ❌ MISSING | No CDP/WebSocket |
| cookies/auth | ✅ | ❌ | ❌ | |
| Multi-Tab | Playwright | ❌ | ❌ MISSING | |
| Agent Loop | Built-in | TUI TERMINUS | ⚠️ PARTIAL | Different architecture |
Missing Features & Implementation Suggestions
🔴 Critical Gaps
-
Visual Feedback Loop (Open-Interface Style)
- Gap: No automatic "take screenshot → analyze → act → repeat" loop
- Fix: Implement a
/vision-loopcommand that:- Takes screenshot
- Sends to vision model (Qwen-VL or GPT-4V)
- Parses response for actions
- Executes via
input.ps1 - Repeats until goal achieved
- Credit: AmberSahdev/Open-Interface
-
Full OCR Support
- Gap: OCR is a stub in
input.ps1 - Fix: Integrate Windows 10+ OCR API or Tesseract
- Code from: Windows.Media.Ocr namespace
- Gap: OCR is a stub in
-
Playwright Integration (Real)
- Gap:
playwrightcommand just simulates - Fix: Create
bin/playwright-bridge.jsthat:- Launches Chromium with Playwright
- Exposes WebSocket for commands
input.ps1 playwrightcalls this bridge
- Credit: browser-use/browser-use
- Gap:
-
Content Extraction
- Gap: Cannot read web page content
- Fix: Use Playwright
page.content()or clipboard hack
🟡 Enhancement Opportunities
-
Course Correction (Open-Interface)
- After each action, automatically take screenshot and verify success
- If UI doesn't match expected state, retry or ask for guidance
-
CSS/XPath Selectors (Browser-Use)
- Current
findbyonly supports Name, ControlType, Class - For web: need Playwright or CDP for CSS selectors
- Current
-
Multi-Tab Browser Control
- Use
--remote-debugging-portto connect via CDP - Enable tab switching, new tabs, close tabs
- Use
Opus 4.5 Improvement Recommendations
1. Natural Language → Action Translation
Current TERMINUS prompt is complex. Simplify with:
// Decision Tree in handleSubmit
if (isComputerUseRequest) {
// Skip AI interpretation, directly map to actions
const actionMap = {
'click start': 'input.ps1 key LWIN',
'open chrome': 'input.ps1 open chrome.exe',
'google X': 'input.ps1 googlesearch X'
};
// Execute immediately without LLM call for simple requests
}
2. Action Confirmation UI
Add visual feedback in TUI when executing:
🖱️ Executing: uiclick "Start"
⏳ Waiting for element...
✅ Clicked at (45, 1050)
3. Streaming Action Execution
Instead of generating all commands then executing, stream:
- AI generates first command
- TUI executes immediately
- AI generates next based on result
- Repeat
4. Safety Sandbox
Add /sandbox mode that:
- Shows preview of actions before execution
- Requires confirmation for system-level changes
- Logs all actions for audit
5. Vision Model Integration
// In agent-prompt.mjs, add:
if (activeSkill?.id === 'win-vision') {
// Attach screenshot to next API call
const screenshot = await captureScreen();
context.visionImage = screenshot;
}
Attribution Requirements
When committing changes inspired by these projects:
git commit -m "feat(computer-use): Add visual feedback loop
Inspired by: AmberSahdev/Open-Interface
Credit: https://github.com/AmberSahdev/Open-Interface
License: MIT"
git commit -m "feat(browser): Add Playwright bridge for web automation
Inspired by: browser-use/browser-use
Credit: https://github.com/browser-use/browser-use
License: MIT"
Summary
| Module | Completeness | Notes |
|---|---|---|
| Computer Use (Windows-Use) | ✅ 95% | Full parity |
| Computer Vision (Open-Interface) | ⚠️ 60% | Missing feedback loop, OCR |
| Browser Use (browser-use) | ⚠️ 50% | Missing Playwright, content extraction |
| Server Management | ✅ 90% | Via PowerShell skills |
Overall: 75% Feature Parity with room for improvement in visual automation and browser control.