Massive training corpus for AI coding models containing: - 10 JSONL training datasets (641+ examples across coding, reasoning, planning, architecture, communication, debugging, security, workflows, error handling, UI/UX) - 11 agent behavior specifications (explorer, planner, reviewer, debugger, executor, UI designer, Linux admin, kernel engineer, security architect, automation engineer, API architect) - 6 skill definition files (coding, API engineering, kernel, Linux server, security architecture, server automation, UI/UX) - Master README with project origin story and philosophy Built by Pony Alpha 2 to help AI models learn expert-level coding approaches.
97 lines
3.9 KiB
Markdown
97 lines
3.9 KiB
Markdown
# Code Review and Debugging Dataset
|
|
|
|
This dataset contains **62 examples** of code review and debugging scenarios covering security vulnerabilities, performance issues, error handling, concurrency bugs, and memory leaks across multiple programming languages.
|
|
|
|
## Dataset Format
|
|
|
|
JSONL format - one JSON object per line with the following structure:
|
|
|
|
- `type`: Either "code_review" or "debugging"
|
|
- `input_code`: The code being reviewed or debugged
|
|
- `analysis`: Step-by-step analysis of the code
|
|
- `findings`: List of issues with severity levels and CWE references
|
|
- `fix`: The recommended fix for the identified issues
|
|
|
|
## Coverage
|
|
|
|
### Security Vulnerabilities
|
|
- **SQL Injection**: Direct string concatenation in queries (CWE-89)
|
|
- **Cross-Site Scripting (XSS)**: Unescaped output in templates (CWE-79)
|
|
- **Command Injection**: User input in shell commands (CWE-78)
|
|
- **Path Traversal**: Unvalidated file paths (CWE-22)
|
|
- **SSRF**: Unvalidated URL parameters (CWE-918)
|
|
- **Missing Authentication**: No auth checks on endpoints (CWE-306)
|
|
- **Insecure Session Management**: Unsigned cookies, missing expiration (CWE-613)
|
|
- **Weak Cryptography**: MD5, missing salts, insecure modes (CWE-327)
|
|
- **Code Injection**: eval() and similar dangerous functions (CWE-94)
|
|
|
|
### Performance Issues
|
|
- **String Concatenation**: Quadratic time complexity (CWE-407)
|
|
- **N+1 Query Problem**: Sequential database queries (CWE-1050)
|
|
- **Unbounded Growth**: Memory leaks in caches, queues, maps (CWE-400)
|
|
- **Missing Connection Pooling**: Creating new connections (CWE-407)
|
|
- **Busy Waiting**: Inefficient polling loops (CWE-842)
|
|
|
|
### Error Handling
|
|
- **Silent Failures**: Broad exception catching (CWE-390)
|
|
- **Information Disclosure**: Leaking error details (CWE-209)
|
|
- **Missing Validation**: No input sanitization (CWE-20)
|
|
- **Resource Leaks**: Unclosed files, connections, threads (CWE-772)
|
|
|
|
### Concurrency Bugs
|
|
- **Race Conditions**: Unprotected shared state (CWE-362)
|
|
- **TOCTOU Issues**: Check-then-act patterns (CWE-367)
|
|
- **Deadlocks**: Missing timeout handling (CWE-833)
|
|
- **Missing Synchronization**: No locks on shared data (CWE-820)
|
|
|
|
### Memory Leaks
|
|
- **Unbounded Caches**: No size limits or TTL (CWE-401)
|
|
- **Unclosed Resources**: Files, connections, threads (CWE-772)
|
|
- **Growing Lists**: No eviction policies (CWE-400)
|
|
- **Circular References**: Event listeners, callbacks (CWE-459)
|
|
|
|
## Languages Covered
|
|
|
|
- **Python**: 35+ examples
|
|
- **JavaScript/TypeScript**: 15+ examples
|
|
- **Go**: 10+ examples
|
|
|
|
## Example Entry
|
|
|
|
```json
|
|
{
|
|
"type": "code_review",
|
|
"input_code": "def login(username, password):\n query = \"SELECT * FROM users WHERE username='\" + username + \"' AND password='\" + password + \"'\"\n cursor.execute(query)\n return cursor.fetchone()",
|
|
"analysis": "1. The code directly concatenates user input into a SQL query without any sanitization.\n2. This creates a classic SQL injection vulnerability where an attacker can manipulate the query.",
|
|
"findings": [
|
|
{"issue": "SQL Injection Vulnerability", "severity": "CRITICAL", "location": "query construction", "cwe": "CWE-89"},
|
|
{"issue": "Plaintext Password Storage", "severity": "HIGH", "location": "password comparison", "cwe": "CWE-256"}
|
|
],
|
|
"fix": "def login(username, password):\n cursor.execute(\"SELECT user_id FROM users WHERE username = %s\", (username,))\n result = cursor.fetchone()\n if result and verify_password(password, result['password_hash']):\n return result"
|
|
}
|
|
```
|
|
|
|
## Usage
|
|
|
|
This dataset is suitable for:
|
|
- Training code review AI models
|
|
- Teaching secure coding practices
|
|
- Automated code analysis tools
|
|
- Security awareness training
|
|
- Bug bounty preparation
|
|
|
|
## Statistics
|
|
|
|
- **Total Examples**: 62
|
|
- **Code Review**: ~32 examples
|
|
- **Debugging**: ~30 examples
|
|
- **File Size**: ~75KB
|
|
- **Unique CWEs**: 25+ vulnerability types
|
|
- **Languages**: Python, JavaScript, TypeScript, Go
|
|
|
|
## File Location
|
|
|
|
```
|
|
/c/Users/admin/Pony-Alpha-2-Dataset-Training/datasets/06-code-review-debugging/code-review-debugging.jsonl
|
|
```
|