# Ralph Multi-Agent Orchestration System ## Architecture Overview The Ralph Multi-Agent Orchestration System enables running 10+ Claude instances in parallel with intelligent coordination, conflict resolution, and real-time observability. ``` ┌─────────────────────────────────────────────┐ │ Meta-Agent Orchestrator │ │ (ralph-integration.py) │ │ - Analyzes requirements │ │ - Breaks into independent tasks │ │ - Manages dependencies │ │ - Coordinates worker agents │ └──────────────────┬──────────────────────────┘ │ Creates tasks ▼ ┌─────────────────────────────────────────────┐ │ Task Queue (Redis) │ │ Stores and distributes work │ └─────┬───────┬───────┬───────┬──────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │ │ Agent N │ │Frontend │ │ Backend │ │ Tests │ │ Docs │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ └───────┴───────┴───────┘ │ ▼ ┌──────────────────┐ │ Observability │ │ Dashboard │ │ (Real-time UI) │ └──────────────────┘ ``` ## Core Components ### 1. Meta-Agent Orchestrator The meta-agent is Ralph running in orchestration mode where it manages other agents instead of writing code directly. **Key Responsibilities:** - Analyze project requirements - Break down into parallelizable tasks - Manage task dependencies - Spawn and coordinate worker agents - Monitor progress and handle conflicts - Aggregate results **Configuration:** ```bash # Enable multi-agent mode RALPH_MULTI_AGENT=true RALPH_MAX_WORKERS=12 RALPH_TASK_QUEUE_HOST=localhost RALPH_TASK_QUEUE_PORT=6379 RALPH_OBSERVABILITY_PORT=3001 ``` ### 2. Task Queue System Uses Redis for reliable task distribution and state management. **Task Structure:** ```json { "id": "unique-task-id", "type": "frontend|backend|testing|docs|refactor|analysis", "description": "What needs to be done", "dependencies": ["task-id-1", "task-id-2"], "files": ["path/to/file1.ts", "path/to/file2.ts"], "priority": 1-10, "specialization": "optional-specific-agent-type", "timeout": 300, "retry_count": 0, "max_retries": 3 } ``` **Queue Operations:** - `claude_tasks` - Main task queue - `claude_tasks:pending` - Tasks waiting for dependencies - `claude_tasks:complete` - Completed tasks - `claude_tasks:failed` - Failed tasks for retry - `lock:{file_path}` - File-level locks - `task:{task_id}` - Task status tracking ### 3. Specialized Worker Agents Each worker agent has a specific role and configuration. **Agent Types:** | Agent Type | Specialization | Example Tasks | |------------|----------------|---------------| | **Frontend** | UI/UX, React, Vue, Svelte | Component refactoring, styling | | **Backend** | APIs, databases, services | Endpoint creation, data models | | **Testing** | Unit tests, integration tests | Test writing, coverage improvement | | **Documentation** | Docs, comments, README | API docs, inline documentation | | **Refactor** | Code quality, optimization | Performance tuning, code cleanup | | **Analysis** | Code review, architecture | Dependency analysis, security audit | **Worker Configuration:** ```json { "agent_id": "agent-frontend-1", "specialization": "frontend", "max_concurrent_tasks": 1, "file_lock_timeout": 300, "heartbeat_interval": 10, "log_level": "info" } ``` ### 4. File Locking & Conflict Resolution Prevents multiple agents from modifying the same file simultaneously. **Lock Acquisition Flow:** 1. Agent requests locks for required files 2. Redis attempts to set lock keys with NX flag 3. If all locks acquired, agent proceeds 4. If any lock fails, agent waits and retries 5. Locks auto-expire after timeout (safety mechanism) **Conflict Detection:** ```python def detect_conflicts(agent_files: Dict[str, List[str]]) -> List[Conflict]: """Detect file access conflicts between agents""" file_agents = {} for agent_id, files in agent_files.items(): for file_path in files: if file_path in file_agents: file_agents[file_path].append(agent_id) else: file_agents[file_path] = [agent_id] conflicts = [ {"file": f, "agents": agents} for f, agents in file_agents.items() if len(agents) > 1 ] return conflicts ``` **Resolution Strategies:** 1. **Dependency-based ordering** - Add dependencies between conflicting tasks 2. **File splitting** - Break tasks into smaller units 3. **Agent specialization** - Assign conflicting tasks to same agent 4. **Merge coordination** - Use git merge strategies ### 5. Real-Time Observability Dashboard WebSocket-based dashboard for monitoring all agents in real-time. **Dashboard Features:** - Live agent status (active, busy, idle, error) - Task progress tracking - File modification visualization - Conflict alerts and resolution - Activity stream with timestamps - Performance metrics **WebSocket Events:** ```javascript // Agent update { "type": "agent_update", "agent": { "id": "agent-frontend-1", "status": "active", "currentTask": "refactor-buttons", "progress": 65, "workingFiles": ["components/Button.tsx"], "completedCount": 12 } } // Conflict detected { "type": "conflict", "conflict": { "file": "components/Button.tsx", "agents": ["agent-frontend-1", "agent-frontend-2"], "timestamp": "2025-08-02T15:30:00Z" } } // Task completed { "type": "task_complete", "taskId": "refactor-buttons", "agentId": "agent-frontend-1", "duration": 45.2, "filesModified": ["components/Button.tsx", "components/Button.test.tsx"] } ``` ## Usage Examples ### Example 1: Frontend Refactor ```bash # Start multi-agent Ralph for frontend refactor RALPH_MULTI_AGENT=true \ RALPH_MAX_WORKERS=8 \ /ralph "Refactor all components from class to functional with hooks" ``` **Meta-Agent Breakdown:** ```json [ { "id": "analyze-1", "type": "analysis", "description": "Scan all components and create refactoring plan", "dependencies": [], "files": [] }, { "id": "refactor-buttons", "type": "frontend", "description": "Convert all Button components to functional", "dependencies": ["analyze-1"], "files": ["components/Button/*.tsx"] }, { "id": "refactor-forms", "type": "frontend", "description": "Convert all Form components to functional", "dependencies": ["analyze-1"], "files": ["components/Form/*.tsx"] }, { "id": "update-tests-buttons", "type": "testing", "description": "Update Button component tests", "dependencies": ["refactor-buttons"], "files": ["__tests__/Button/*.test.tsx"] } ] ``` ### Example 2: Full-Stack Feature ```bash # Build feature with parallel frontend/backend RALPH_MULTI_AGENT=true \ RALPH_MAX_WORKERS=6 \ /ralph "Build user authentication with OAuth, profile management, and email verification" ``` **Parallel Execution:** - Agent 1 (Frontend): Build login form UI - Agent 2 (Frontend): Build profile page UI - Agent 3 (Backend): Implement OAuth endpoints - Agent 4 (Backend): Implement profile API - Agent 5 (Testing): Write integration tests - Agent 6 (Docs): Write API documentation ### Example 3: Codebase Optimization ```bash # Parallel optimization across codebase RALPH_MULTI_AGENT=true \ RALPH_MAX_WORKERS=10 \ /ralph "Optimize performance: bundle size, lazy loading, image optimization, caching strategy" ``` ## Environment Variables ```bash # Multi-Agent Configuration RALPH_MULTI_AGENT=true # Enable multi-agent mode RALPH_MAX_WORKERS=12 # Maximum worker agents RALPH_MIN_WORKERS=2 # Minimum worker agents # Task Queue (Redis) RALPH_TASK_QUEUE_HOST=localhost # Redis host RALPH_TASK_QUEUE_PORT=6379 # Redis port RALPH_TASK_QUEUE_DB=0 # Redis database RALPH_TASK_QUEUE_PASSWORD= # Redis password (optional) # Observability RALPH_OBSERVABILITY_ENABLED=true # Enable dashboard RALPH_OBSERVABILITY_PORT=3001 # WebSocket port RALPH_OBSERVABILITY_HOST=localhost # Dashboard host # Agent Behavior RALPH_AGENT_TIMEOUT=300 # Task timeout (seconds) RALPH_AGENT_HEARTBEAT=10 # Heartbeat interval (seconds) RALPH_FILE_LOCK_TIMEOUT=300 # File lock timeout (seconds) RALPH_MAX_RETRIES=3 # Task retry count # Logging RALPH_VERBOSE=true # Verbose logging RALPH_LOG_LEVEL=info # Log level RALPH_LOG_FILE=.ralph/multi-agent.log # Log file path ``` ## Monitoring & Debugging ### Check Multi-Agent Status ```bash # View active agents redis-cli keys "agent:*" # View task queue redis-cli lrange claude_tasks 0 10 # View file locks redis-cli keys "lock:*" # View task status redis-cli hgetall "task:task-id" # View completed tasks redis-cli lrange claude_tasks:complete 0 10 ``` ### Observability Dashboard Access dashboard at: `http://localhost:3001` **Dashboard Sections:** 1. **Mission Status** - Overall progress 2. **Agent Grid** - Individual agent status 3. **Conflict Alerts** - Active file conflicts 4. **Activity Stream** - Real-time event log 5. **Performance Metrics** - Agent efficiency ## Best Practices ### 1. Task Design - Keep tasks independent when possible - Minimize cross-task file dependencies - Use specialization to guide agent assignment - Set appropriate timeouts ### 2. Dependency Management - Use topological sort for execution order - Minimize dependency depth - Allow parallel execution at every opportunity - Handle circular dependencies gracefully ### 3. Conflict Prevention - Group related file modifications in single task - Use file-specific agents when conflicts likely - Implement merge strategies for common conflicts - Monitor lock acquisition time ### 4. Observability - Log all agent activities - Track file modifications in real-time - Alert on conflicts immediately - Maintain activity history for debugging ### 5. Error Handling - Implement retry logic with exponential backoff - Quarantine failing tasks for analysis - Provide detailed error context - Allow manual intervention when needed ## Troubleshooting ### Common Issues **Agents stuck waiting:** ```bash # Check for stale locks redis-cli keys "lock:*" # Clear stale locks redis-cli del "lock:path/to/file" ``` **Tasks not executing:** ```bash # Check task queue redis-cli lrange claude_tasks 0 -1 # Check pending tasks redis-cli lrange claude_tasks:pending 0 -1 ``` **Dashboard not updating:** ```bash # Check WebSocket server netstat -an | grep 3001 # Restart observability server pkill -f ralph-observability RALPH_OBSERVABILITY_ENABLED=true ralph-observability ``` ## Performance Tuning ### Optimize Worker Count ```bash # Calculate optimal workers WORKERS = (CPU_CORES * 1.5) - 1 # For I/O bound tasks WORKERS = CPU_CORES * 2 # For CPU bound tasks WORKERS = CPU_CORES ``` ### Redis Configuration ```bash # redis.conf maxmemory 2gb maxmemory-policy allkeys-lru timeout 300 tcp-keepalive 60 ``` ### Agent Pool Sizing ```bash # Dynamic scaling based on queue depth QUEUE_DEPTH=$(redis-cli llen claude_tasks) if [ $QUEUE_DEPTH -gt 50 ]; then SCALE_UP=true elif [ $QUEUE_DEPTH -lt 10 ]; then SCALE_DOWN=true fi ``` ## Security Considerations 1. **File Access Control** - Restrict agent file system access 2. **Redis Authentication** - Use Redis password in production 3. **Network Isolation** - Run agents in isolated network 4. **Resource Limits** - Set CPU/memory limits per agent 5. **Audit Logging** - Log all agent actions for compliance ## Integration with Claude Code The Ralph Multi-Agent System integrates seamlessly with Claude Code: ```bash # Use with Claude Code projects export RALPH_AGENT=claude export RALPH_MULTI_AGENT=true cd /path/to/claude-code-project /ralph "Refactor authentication system" ``` **Claude Code Integration Points:** - Uses Claude Code agent pool - Respects Claude Code project structure - Integrates with Claude Code hooks - Supports Claude Code tool ecosystem