feat: Integrated Vision & Robust Translation Layer, Secured Repo (removed keys)
This commit is contained in:
207
.opencode/feature_audit.md
Normal file
207
.opencode/feature_audit.md
Normal file
@@ -0,0 +1,207 @@
|
||||
# Computer Use Feature Audit: OpenQode TUI GEN5 🕵️
|
||||
|
||||
**Audit Date:** 2025-12-15
|
||||
**Auditor:** Opus 4.5
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
OpenQode TUI GEN5 has implemented a **comprehensive** `input.ps1` script (1175 lines) that covers **most** features from the three reference projects. However, there are gaps in advanced automation patterns, visual feedback loops, and persistent browser control.
|
||||
|
||||
---
|
||||
|
||||
## Feature Comparison Matrix
|
||||
|
||||
### 1. Windows-Use (CursorTouch/Windows-Use)
|
||||
| Feature | Windows-Use | OpenQode | Status | Notes |
|
||||
|---------|------------|----------|--------|-------|
|
||||
| **Mouse Control** | PyAutoGUI | P/Invoke | ✅ FULL | Native Win32 API |
|
||||
| mouse move | ✅ | ✅ `mouse x y` | ✅ | |
|
||||
| smooth movement | ✅ | ✅ `mousemove` | ✅ | Duration parameter |
|
||||
| click types | ✅ | ✅ all 4 types | ✅ | left/right/double/middle |
|
||||
| drag | ✅ | ✅ `drag` | ✅ | |
|
||||
| scroll | ✅ | ✅ `scroll` | ✅ | |
|
||||
| **Keyboard Control** | PyAutoGUI | SendKeys/P/Invoke | ✅ FULL | |
|
||||
| type text | ✅ | ✅ `type` | ✅ | |
|
||||
| key press | ✅ | ✅ `key` | ✅ | Special keys supported |
|
||||
| hotkey combos | ✅ | ✅ `hotkey` | ✅ | CTRL+C, ALT+TAB, etc |
|
||||
| keydown/keyup | ✅ | ✅ both | ✅ | For modifiers |
|
||||
| **UI Automation** | UIAutomation | UIAutomationClient | ✅ FULL | |
|
||||
| find element | ✅ | ✅ `find` | ✅ | By name |
|
||||
| find all | ✅ | ✅ `findall` | ✅ | Multiple instances |
|
||||
| find by property | ✅ | ✅ `findby` | ✅ | controltype, class, automationid |
|
||||
| click element | ✅ | ✅ `uiclick` | ✅ | InvokePattern + fallback |
|
||||
| waitfor element | ✅ | ✅ `waitfor` | ✅ | Timeout support |
|
||||
| **App Control** | | | ✅ FULL | |
|
||||
| list apps/windows | ✅ | ✅ `apps` | ✅ | With position/size |
|
||||
| kill process | ✅ | ✅ `kill` | ✅ | By name or title |
|
||||
| **Shell Commands** | subprocess | | ⚠️ PARTIAL | Via `/run` in TUI |
|
||||
| **Telemetry** | ✅ | ❌ | 🔵 NOT NEEDED | Privacy-focused |
|
||||
|
||||
### 2. Open-Interface (AmberSahdev/Open-Interface)
|
||||
| Feature | Open-Interface | OpenQode | Status | Notes |
|
||||
|---------|---------------|----------|--------|-------|
|
||||
| **Screenshot Capture** | Pillow/pyautogui | System.Drawing | ✅ FULL | |
|
||||
| full screen | ✅ | ✅ `screenshot` | ✅ | |
|
||||
| region capture | ✅ | ✅ `region` | ✅ | x,y,w,h |
|
||||
| **Visual Feedback Loop** | GPT-4V/Gemini | TERMINUS prompt | ⚠️ PARTIAL | See improvements |
|
||||
| screenshot → LLM → action | ✅ | ⚠️ prompt-based | ⚠️ | No automatic loop |
|
||||
| course correction | ✅ | ❌ | ❌ MISSING | Needs implementation |
|
||||
| **OCR** | pytesseract | (stub) | ⚠️ STUB | Needs Tesseract |
|
||||
| text recognition | ✅ | Described only | ⚠️ | |
|
||||
| **Color Detection** | | | ✅ FULL | |
|
||||
| get pixel color | ? | ✅ `color` | ✅ | Hex output |
|
||||
| wait for color | ? | ✅ `waitforcolor` | ✅ | With tolerance |
|
||||
| **Multi-Monitor** | Limited | Limited | ⚠️ | Primary only |
|
||||
|
||||
### 3. Browser-Use (browser-use/browser-use)
|
||||
| Feature | Browser-Use | OpenQode | Status | Notes |
|
||||
|---------|-------------|----------|--------|-------|
|
||||
| **Browser Launch** | Playwright | Start-Process | ✅ FULL | |
|
||||
| open URL | ✅ | ✅ `browse`, `open` | ✅ | Multiple browsers |
|
||||
| google search | ✅ | ✅ `googlesearch` | ✅ | Direct URL |
|
||||
| **Page Navigation** | Playwright | | ⚠️ PARTIAL | |
|
||||
| navigate | ✅ | ✅ `playwright navigate` | ⚠️ | Opens in system browser |
|
||||
| **Element Interaction** | Playwright | UIAutomation | ⚠️ DIFFERENT | |
|
||||
| click by selector | ✅ CSS/XPath | ⚠️ Name only | ⚠️ | No CSS/XPath |
|
||||
| fill form | ✅ | ⚠️ `browsercontrol fill` | ⚠️ | UIAutomation-based |
|
||||
| **Content Extraction** | Playwright | | ❌ MISSING | |
|
||||
| get page content | ✅ | ❌ | ❌ | Needs Playwright |
|
||||
| get element text | ✅ | ❌ | ❌ | |
|
||||
| **Persistent Session** | Playwright | ❌ | ❌ MISSING | No CDP/WebSocket |
|
||||
| cookies/auth | ✅ | ❌ | ❌ | |
|
||||
| **Multi-Tab** | Playwright | ❌ | ❌ MISSING | |
|
||||
| **Agent Loop** | Built-in | TUI TERMINUS | ⚠️ PARTIAL | Different architecture |
|
||||
|
||||
---
|
||||
|
||||
## Missing Features & Implementation Suggestions
|
||||
|
||||
### 🔴 Critical Gaps
|
||||
|
||||
1. **Visual Feedback Loop (Open-Interface Style)**
|
||||
- **Gap:** No automatic "take screenshot → analyze → act → repeat" loop
|
||||
- **Fix:** Implement a `/vision-loop` command that:
|
||||
1. Takes screenshot
|
||||
2. Sends to vision model (Qwen-VL or GPT-4V)
|
||||
3. Parses response for actions
|
||||
4. Executes via `input.ps1`
|
||||
5. Repeats until goal achieved
|
||||
- **Credit:** AmberSahdev/Open-Interface
|
||||
|
||||
2. **Full OCR Support**
|
||||
- **Gap:** OCR is a stub in `input.ps1`
|
||||
- **Fix:** Integrate Windows 10+ OCR API or Tesseract
|
||||
- **Code from:** Windows.Media.Ocr namespace
|
||||
|
||||
3. **Playwright Integration (Real)**
|
||||
- **Gap:** `playwright` command just simulates
|
||||
- **Fix:** Create `bin/playwright-bridge.js` that:
|
||||
1. Launches Chromium with Playwright
|
||||
2. Exposes WebSocket for commands
|
||||
3. `input.ps1 playwright` calls this bridge
|
||||
- **Credit:** browser-use/browser-use
|
||||
|
||||
4. **Content Extraction**
|
||||
- **Gap:** Cannot read web page content
|
||||
- **Fix:** Use Playwright `page.content()` or clipboard hack
|
||||
|
||||
### 🟡 Enhancement Opportunities
|
||||
|
||||
1. **Course Correction (Open-Interface)**
|
||||
- After each action, automatically take screenshot and verify success
|
||||
- If UI doesn't match expected state, retry or ask for guidance
|
||||
|
||||
2. **CSS/XPath Selectors (Browser-Use)**
|
||||
- Current `findby` only supports Name, ControlType, Class
|
||||
- For web: need Playwright or CDP for CSS selectors
|
||||
|
||||
3. **Multi-Tab Browser Control**
|
||||
- Use `--remote-debugging-port` to connect via CDP
|
||||
- Enable tab switching, new tabs, close tabs
|
||||
|
||||
---
|
||||
|
||||
## Opus 4.5 Improvement Recommendations
|
||||
|
||||
### 1. **Natural Language → Action Translation**
|
||||
Current TERMINUS prompt is complex. Simplify with:
|
||||
```javascript
|
||||
// Decision Tree in handleSubmit
|
||||
if (isComputerUseRequest) {
|
||||
// Skip AI interpretation, directly map to actions
|
||||
const actionMap = {
|
||||
'click start': 'input.ps1 key LWIN',
|
||||
'open chrome': 'input.ps1 open chrome.exe',
|
||||
'google X': 'input.ps1 googlesearch X'
|
||||
};
|
||||
// Execute immediately without LLM call for simple requests
|
||||
}
|
||||
```
|
||||
|
||||
### 2. **Action Confirmation UI**
|
||||
Add visual feedback in TUI when executing:
|
||||
```
|
||||
🖱️ Executing: uiclick "Start"
|
||||
⏳ Waiting for element...
|
||||
✅ Clicked at (45, 1050)
|
||||
```
|
||||
|
||||
### 3. **Streaming Action Execution**
|
||||
Instead of generating all commands then executing, stream:
|
||||
1. AI generates first command
|
||||
2. TUI executes immediately
|
||||
3. AI generates next based on result
|
||||
4. Repeat
|
||||
|
||||
### 4. **Safety Sandbox**
|
||||
Add `/sandbox` mode that:
|
||||
- Shows preview of actions before execution
|
||||
- Requires confirmation for system-level changes
|
||||
- Logs all actions for audit
|
||||
|
||||
### 5. **Vision Model Integration**
|
||||
```javascript
|
||||
// In agent-prompt.mjs, add:
|
||||
if (activeSkill?.id === 'win-vision') {
|
||||
// Attach screenshot to next API call
|
||||
const screenshot = await captureScreen();
|
||||
context.visionImage = screenshot;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Attribution Requirements
|
||||
|
||||
When committing changes inspired by these projects:
|
||||
|
||||
```
|
||||
git commit -m "feat(computer-use): Add visual feedback loop
|
||||
|
||||
Inspired by: AmberSahdev/Open-Interface
|
||||
Credit: https://github.com/AmberSahdev/Open-Interface
|
||||
License: MIT"
|
||||
```
|
||||
|
||||
```
|
||||
git commit -m "feat(browser): Add Playwright bridge for web automation
|
||||
|
||||
Inspired by: browser-use/browser-use
|
||||
Credit: https://github.com/browser-use/browser-use
|
||||
License: MIT"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
| Module | Completeness | Notes |
|
||||
|--------|-------------|-------|
|
||||
| **Computer Use (Windows-Use)** | ✅ 95% | Full parity |
|
||||
| **Computer Vision (Open-Interface)** | ⚠️ 60% | Missing feedback loop, OCR |
|
||||
| **Browser Use (browser-use)** | ⚠️ 50% | Missing Playwright, content extraction |
|
||||
| **Server Management** | ✅ 90% | Via PowerShell skills |
|
||||
|
||||
**Overall: 75% Feature Parity** with room for improvement in visual automation and browser control.
|
||||
Reference in New Issue
Block a user