35 KiB
35 KiB
🦆 GOOSE SUPER - CORE SUBSYSTEMS DETAILED SPECIFICATIONS
Table of Contents
- Playwright Integration
- Browser Use System
- Computer Use System
- Full Vision Capabilities
- Vibe Server Management
1. Playwright Integration
Purpose
Automate web browsers with precision - navigate, click, fill forms, extract data, take screenshots.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ PLAYWRIGHT SUBSYSTEM │
├────────────────────────────────────────────────────────────────┤
│ │
│ User: "Search for hotels in Dubai on Booking.com" │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ IQ Exchange Translation Layer │ │
│ │ → BROWSER action detected │ │
│ │ → Route to Playwright Bridge │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Playwright Bridge (bin/playwright-bridge.js) │ │
│ │ │ │
│ │ Commands: │ │
│ │ navigate <url> → Go to URL │ │
│ │ click <selector|text> → Click element │ │
│ │ fill <selector> <text> → Type in input │ │
│ │ type <text> → Type at cursor │ │
│ │ press <key> → Press key (Enter, Tab, etc) │ │
│ │ wait <selector> <ms> → Wait for element │ │
│ │ screenshot <file> → Capture page │ │
│ │ content → Get page text │ │
│ │ elements → List all elements │ │
│ │ close → Close browser session │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Chromium Browser (Headless or Visible) │ │
│ │ • Persistent session across commands │ │
│ │ • Cookies & auth remembered │ │
│ │ • DevTools access for debugging │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Current State vs. Enhancement
| Feature | Current | Enhanced |
|---|---|---|
| Session persistence | ✅ Basic | ✅ Full cookie/auth persistence |
| Element selection | Text/CSS only | Smart AI-assisted selector |
| Form filling | Manual fields | Auto-detect all form fields |
| Multi-tab | ❌ Single tab | ✅ Multiple tabs |
| Screenshots | ✅ Basic | ✅ With element annotations |
| DOM extraction | ❌ | ✅ Structured for AI analysis |
Enhanced Playwright Bridge Commands
// NEW: Smart element finding (AI-assisted)
node playwright-bridge.js smart-click "the search button"
node playwright-bridge.js smart-fill "email" "user@example.com"
// NEW: Element discovery for AI
node playwright-bridge.js list-interactive // Lists all clickable/fillable elements
node playwright-bridge.js describe-page // AI-friendly page description
// NEW: Multi-tab
node playwright-bridge.js new-tab
node playwright-bridge.js switch-tab 2
node playwright-bridge.js list-tabs
// NEW: Authentication helpers
node playwright-bridge.js save-cookies "site-name"
node playwright-bridge.js load-cookies "site-name"
// NEW: Form automation
node playwright-bridge.js auto-fill-form '{"email":"x","password":"y"}'
Integration with IQ Exchange
// In lib/iq-exchange.mjs - TASK_PATTERNS
browser: /\b(website|browser|search|navigate|booking\.com|amazon|google|fill.*form|click.*button|login|signup)\b/i
// Translation example:
// User: "Book a hotel in Dubai on Booking.com"
// IQ Exchange generates:
[
'node playwright-bridge.js navigate "https://booking.com"',
'node playwright-bridge.js wait "[name=ss]" 5000',
'node playwright-bridge.js fill "[name=ss]" "Dubai hotels"',
'node playwright-bridge.js click "[type=submit]"',
'node playwright-bridge.js screenshot "search-results.png"'
]
2. Browser Use System
Purpose
Built-in browser panel inside Goose for:
- App Preview - See your code rendered live
- Web Browsing - AI can browse the web inside Goose
- Automation Control - Visual feedback for Playwright actions
Architecture
┌────────────────────────────────────────────────────────────────┐
│ BROWSER USE SYSTEM │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ BUILT-IN BROWSER (Electron BrowserView) │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────┐ │ │
│ │ │ 🌐 https://localhost:5173 🔄 ◀ ▶ 🔍 🛠️ 📸 │ │ │
│ │ ├───────────────────────────────────────────────────┤ │ │
│ │ │ │ │ │
│ │ │ YOUR APP RENDERS HERE │ │ │
│ │ │ │ │ │
│ │ │ [Live preview of HTML/CSS/JS you write] │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Toolbar: │ │
│ │ 🔄 Refresh ◀ Back ▶ Forward 🔍 URL Bar │ │
│ │ 🛠️ DevTools 📸 Screenshot 📍 Element Inspect │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Modes: │
│ [Preview] - Shows your local app (default) │
│ [Browse] - Navigate to any URL │
│ [Automate]- Watch Playwright actions in real-time │
│ │
└────────────────────────────────────────────────────────────────┘
Features
Preview Mode (Vibe Coding)
User: "Make me a landing page"
↓
Goose writes HTML/CSS/JS
↓
Auto-starts Vite dev server (port 5173)
↓
Built-in browser loads http://localhost:5173
↓
User: "Make the header blue"
↓
Goose edits CSS → Hot reload → Preview updates instantly
Browse Mode
User: "Go to GitHub and star the browser-use repo"
↓
Built-in browser navigates to github.com
↓
AI uses Playwright to interact with the page
↓
User can SEE every action happening
Automate Mode (Visible Automation)
When Playwright runs:
1. Built-in browser shows the exact page
2. Red highlight appears on clicked elements
3. Green highlight on filled inputs
4. Status bar shows current action
IPC Handlers (main.cjs)
// Browser panel control
ipcMain.handle('browser:create-view', async (event, options) => {
browserView = new BrowserView({ webPreferences });
mainWindow.addBrowserView(browserView);
return { success: true };
});
ipcMain.handle('browser:navigate', async (event, url) => {
await browserView.webContents.loadURL(url);
return { success: true, url };
});
ipcMain.handle('browser:screenshot', async () => {
const image = await browserView.webContents.capturePage();
return { success: true, base64: image.toDataURL() };
});
ipcMain.handle('browser:execute-js', async (event, script) => {
const result = await browserView.webContents.executeJavaScript(script);
return { success: true, result };
});
ipcMain.handle('browser:get-dom', async () => {
const dom = await browserView.webContents.executeJavaScript(`
Array.from(document.querySelectorAll('a, button, input, select, textarea'))
.map(el => ({
tag: el.tagName,
text: el.textContent?.trim().slice(0, 50),
id: el.id,
name: el.name,
type: el.type,
placeholder: el.placeholder
}))
`);
return { success: true, elements: dom };
});
3. Computer Use System
Purpose
Control the Windows desktop - open apps, click UI elements, type text, take screenshots.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ COMPUTER USE SYSTEM │
├────────────────────────────────────────────────────────────────┤
│ │
│ User: "Open Spotify and play some jazz" │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ IQ Exchange Translation Layer │ │
│ │ → DESKTOP action detected │ │
│ │ → Route to Computer Use module │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────┬─────────────────────┐ │
│ │ input.ps1 (PowerShell) │ computer-use.cjs │ │
│ │ (Current - 66KB!) │ (Node.js native) │ │
│ │ │ │ │
│ │ • UIAutomation via .NET │ • robotjs for speed │ │
│ │ • OCR via Windows API │ • screenshot-desktop│ │
│ │ • Window management │ • Cross-platform │ │
│ └────────────────────────────────────┴─────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Windows Desktop │ │
│ │ • Apps, dialogs, menus │ │
│ │ • Task bar, system tray │ │
│ │ • Any visible UI element │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Available Commands (input.ps1)
Vision Commands (The Eyes)
# See what's on screen
input.ps1 screenshot "screen.png" # Capture full screen
input.ps1 ocr "full" # Read ALL text on screen
input.ps1 ocr "region" 100 100 500 400 # Read text in region
input.ps1 app_state "Notepad" # Get UI tree of app (buttons, menus, etc.)
Navigation Commands
# Open and focus apps
input.ps1 open "Spotify" # Launch or focus app
input.ps1 startmenu # Open Windows Start menu
input.ps1 focus "Save As" # Focus specific dialog/element
input.ps1 waitfor "Ready" 10 # Wait for text to appear (max 10s)
Interaction Commands (Smart - via UIAutomation)
# Click by element name (RELIABLE!)
input.ps1 uiclick "Play" # Click button named "Play"
input.ps1 uiclick "File" # Click menu item
input.ps1 uipress "Remember me" # Toggle checkbox
input.ps1 uipress "Jazz" # Select list item
# Type text
input.ps1 type "Hello World" # Type into focused element
input.ps1 keyboard "search query" # Alternative typing method
input.ps1 key "Enter" # Press single key
input.ps1 hotkey "Ctrl+S" # Key combination
Fallback Commands (Coordinate-based)
# When UIAutomation fails, use coordinates
input.ps1 mouse 500 300 # Move mouse to x,y
input.ps1 click # Click at current position
input.ps1 drag 100 100 500 500 # Drag from point to point
Example Flow
User: "Open Paint and draw a red circle"
IQ Exchange generates:
1. input.ps1 open "Paint"
2. input.ps1 waitfor "Untitled" 5
3. input.ps1 uiclick "Brushes"
4. input.ps1 uiclick "Red"
5. input.ps1 mouse 400 300
6. input.ps1 drag 400 300 500 400
After each step:
→ Screenshot
→ OCR/app_state to verify
→ If failed → Self-heal and retry
Enhancements Needed
| Current | Enhanced |
|---|---|
| PowerShell scripts (slow startup) | Node.js native with robotjs |
| Basic element finding | Smart element grounding with labels |
| Manual verification | Auto-screenshot after each action |
| English-only OCR | Multi-language OCR support |
4. Full Vision Capabilities
Purpose
Let the AI "see" the screen like a human - understand what's visible, where elements are, and verify actions worked.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ VISION SYSTEM │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ SCREENSHOT LAYER │ │
│ │ • Full screen capture │ │
│ │ • Window-specific capture │ │
│ │ • Region capture │ │
│ │ • Before/after comparison │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ ANALYSIS LAYER │ │
│ │ │ │
│ │ ┌────────────────┬────────────────┬──────────────────┐ │ │
│ │ │ OCR Engine │ UI Automation │ AI Vision (LLM) │ │ │
│ │ │ │ │ │ │ │
│ │ │ • Read text │ • Element tree │ • Understand │ │ │
│ │ │ • Find words │ • Button names │ context │ │ │
│ │ │ • Coordinates │ • Input fields │ • Verify success │ │ │
│ │ │ of text │ • Menus │ • Describe scene │ │ │
│ │ └────────────────┴────────────────┴──────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ GROUNDING LAYER (Element Labeling) │ │
│ │ │ │
│ │ Screenshot with overlays: │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ [1] File [2] Edit [3] View │ │ │
│ │ │ ┌─────────────────────────────┐ │ │ │
│ │ │ │ [4] Search... │ │ │ │
│ │ │ └─────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ [5] Save [6] Cancel │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ │ │ │
│ │ AI can say: "Click element [5]" → We click Save button │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Vision Flow
Before Action
1. Screenshot current state
2. OCR to find text locations
3. UIAutomation to get element names
4. Create "grounding map" of clickable elements
5. Send to AI: "Here's what I see on screen..."
After Action
1. Screenshot new state
2. Compare with previous
3. OCR to verify expected text appeared
4. Send to AI: "Did the action succeed?"
5. If failed → Generate fix → Retry
Grounding Map Example
{
"screenshot": "base64://...",
"elements": [
{ "id": 1, "type": "button", "text": "File", "bounds": [10, 5, 50, 25] },
{ "id": 2, "type": "button", "text": "Edit", "bounds": [60, 5, 100, 25] },
{ "id": 3, "type": "input", "placeholder": "Search...", "bounds": [150, 50, 400, 80] },
{ "id": 4, "type": "button", "text": "Save", "bounds": [300, 500, 380, 530] },
{ "id": 5, "type": "button", "text": "Cancel", "bounds": [400, 500, 480, 530] }
],
"ocr_text": [
{ "text": "Untitled - Notepad", "bounds": [100, 0, 300, 20] },
{ "text": "Hello World", "bounds": [50, 100, 200, 120] }
]
}
AI Prompt with Vision
You are controlling a Windows computer. Here's what you see:
[SCREENSHOT: base64 image]
INTERACTIVE ELEMENTS:
[1] Button: "File" at (10, 5)
[2] Button: "Edit" at (60, 5)
[3] Input: "Search..." at (150, 50)
[4] Button: "Save" at (300, 500)
[5] Button: "Cancel" at (400, 500)
VISIBLE TEXT:
- "Untitled - Notepad" (title bar)
- "Hello World" (document content)
USER REQUEST: "Save this file as hello.txt"
YOUR ACTION:
- To click an element, say: CLICK [4]
- To type, say: TYPE "hello.txt"
- To press key, say: KEY Enter
5. Vibe Server Management
Purpose
"Vibe Server Management" = Server tasks for people who don't know servers.
No SSH commands to memorize. Just tell Goose what you want in plain English.
Architecture
┌────────────────────────────────────────────────────────────────┐
│ VIBE SERVER MANAGEMENT │
├────────────────────────────────────────────────────────────────┤
│ │
│ User: "My website is down, fix it" │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ IQ Exchange Translation Layer │ │
│ │ → SERVER action detected │ │
│ │ → Route to Server Management module │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ SERVER DIAGNOSIS FLOW │ │
│ │ │ │
│ │ 1. Check if we have SSH credentials │ │
│ │ → If not: "I need SSH access. Enter credentials?" │ │
│ │ │ │
│ │ 2. Connect to server │ │
│ │ → ssh root@86.105.224.125 │ │
│ │ │ │
│ │ 3. Diagnose the issue │ │
│ │ → systemctl status nginx │ │
│ │ → pm2 status │ │
│ │ → tail -n 50 /var/log/nginx/error.log │ │
│ │ │ │
│ │ 4. Fix the issue │ │
│ │ → systemctl restart nginx │ │
│ │ → pm2 restart all │ │
│ │ │ │
│ │ 5. Verify fix worked │ │
│ │ → curl http://localhost │ │
│ │ → Report back to user │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Noob-Friendly Commands (What User Says → What Goose Does)
| User Says | Goose Does |
|---|---|
| "Is my server running?" | ssh → systemctl status nginx → parse output |
| "Restart my website" | pm2 restart all or systemctl restart nginx |
| "Check the logs" | tail -n 100 /var/log/nginx/error.log |
| "Deploy my latest code" | git pull → npm install → npm run build → pm2 restart |
| "My server is slow" | Check CPU/memory, identify resource hogs |
| "Set up SSL" | certbot --nginx -d domain.com |
| "Add a new website" | Create nginx config, set up PM2 process |
| "Backup my database" | pg_dump or mysqldump |
| "My disk is full" | df -h, find large files, clean up safely |
Server Panel UI
┌────────────────────────────────────────────────────────────────┐
│ 🔗 SERVER MANAGEMENT [X] │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────┐│
│ │ CONNECTIONS [+ Add New] ││
│ │ ───────────────────────────────────────────────────────── ││
│ │ ● Production Server (86.105.224.125) [Connect] ││
│ │ ○ Staging Server (192.168.1.100) [Connect] ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌───────────────────────────────────────────────────────────┐│
│ │ QUICK ACTIONS ││
│ │ ───────────────────────────────────────────────────────── ││
│ │ [🔄 Restart Website] [📊 Check Status] [📋 View Logs] ││
│ │ [🚀 Deploy Latest] [💾 Backup DB] [🔒 SSL Setup] ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌───────────────────────────────────────────────────────────┐│
│ │ LIVE TERMINAL ││
│ │ ───────────────────────────────────────────────────────── ││
│ │ root@server:~# systemctl status nginx ││
│ │ ● nginx.service - A high performance web server ││
│ │ Loaded: loaded (/lib/systemd/system/nginx.service) ││
│ │ Active: active (running) since Mon 2025-12-16 10:00 ││
│ │ ││
│ │ $> _ ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌───────────────────────────────────────────────────────────┐│
│ │ 💬 ASK GOOSE (natural language) ││
│ │ ───────────────────────────────────────────────────────── ││
│ │ "Why is my website slow?" [Ask Goose] ││
│ └───────────────────────────────────────────────────────────┘│
│ │
└────────────────────────────────────────────────────────────────┘
SSH Service Implementation
// services/SSHService.js
const { Client } = require('ssh2');
class SSHService {
constructor() {
this.connections = new Map();
}
async connect(name, config) {
const conn = new Client();
return new Promise((resolve, reject) => {
conn.on('ready', () => {
this.connections.set(name, conn);
resolve({ success: true });
});
conn.on('error', reject);
conn.connect(config);
});
}
async exec(name, command) {
const conn = this.connections.get(name);
if (!conn) throw new Error('Not connected');
return new Promise((resolve, reject) => {
conn.exec(command, (err, stream) => {
if (err) reject(err);
let output = '';
stream.on('data', (data) => output += data);
stream.on('close', () => resolve(output));
});
});
}
// High-level helper methods
async checkWebsiteStatus(name) {
const nginx = await this.exec(name, 'systemctl is-active nginx');
const pm2 = await this.exec(name, 'pm2 jlist');
return { nginx: nginx.trim(), pm2: JSON.parse(pm2) };
}
async restartWebsite(name) {
await this.exec(name, 'pm2 restart all');
await this.exec(name, 'systemctl reload nginx');
return { success: true };
}
async getLogs(name, lines = 50) {
return this.exec(name, `tail -n ${lines} /var/log/nginx/error.log`);
}
async deploy(name, path = '/var/www/app') {
await this.exec(name, `cd ${path} && git pull`);
await this.exec(name, `cd ${path} && npm install`);
await this.exec(name, `cd ${path} && npm run build`);
await this.exec(name, 'pm2 restart all');
return { success: true };
}
}
Saved Server Configurations
// .opencode/servers.json
{
"production": {
"name": "Production Server",
"host": "86.105.224.125",
"port": 22,
"username": "root",
"authType": "password",
"webRoot": "/var/www/myapp",
"pm2App": "myapp"
},
"staging": {
"name": "Staging Server",
"host": "192.168.1.100",
"port": 22,
"username": "deploy",
"authType": "key",
"keyPath": "~/.ssh/id_rsa",
"webRoot": "/var/www/staging"
}
}
Summary: All Five Subsystems
| Subsystem | Purpose | Key Tech |
|---|---|---|
| 1. Playwright | Browser automation | Playwright + persistent sessions |
| 2. Browser Use | Built-in browser panel | Electron BrowserView |
| 3. Computer Use | Desktop automation | input.ps1 + UIAutomation |
| 4. Full Vision | See and verify | OCR + Screenshot + Grounding |
| 5. Vibe Server | Noob-friendly server ops | SSH2 + high-level helpers |
All subsystems connect through IQ Exchange which:
- Detects task type from natural language
- Routes to appropriate subsystem
- Executes with self-healing loop
- Verifies success using vision
- Retries until success or max attempts
This document provides implementation-ready specifications for each core subsystem.