# Code Review and Debugging Dataset This dataset contains **62 examples** of code review and debugging scenarios covering security vulnerabilities, performance issues, error handling, concurrency bugs, and memory leaks across multiple programming languages. ## Dataset Format JSONL format - one JSON object per line with the following structure: - `type`: Either "code_review" or "debugging" - `input_code`: The code being reviewed or debugged - `analysis`: Step-by-step analysis of the code - `findings`: List of issues with severity levels and CWE references - `fix`: The recommended fix for the identified issues ## Coverage ### Security Vulnerabilities - **SQL Injection**: Direct string concatenation in queries (CWE-89) - **Cross-Site Scripting (XSS)**: Unescaped output in templates (CWE-79) - **Command Injection**: User input in shell commands (CWE-78) - **Path Traversal**: Unvalidated file paths (CWE-22) - **SSRF**: Unvalidated URL parameters (CWE-918) - **Missing Authentication**: No auth checks on endpoints (CWE-306) - **Insecure Session Management**: Unsigned cookies, missing expiration (CWE-613) - **Weak Cryptography**: MD5, missing salts, insecure modes (CWE-327) - **Code Injection**: eval() and similar dangerous functions (CWE-94) ### Performance Issues - **String Concatenation**: Quadratic time complexity (CWE-407) - **N+1 Query Problem**: Sequential database queries (CWE-1050) - **Unbounded Growth**: Memory leaks in caches, queues, maps (CWE-400) - **Missing Connection Pooling**: Creating new connections (CWE-407) - **Busy Waiting**: Inefficient polling loops (CWE-842) ### Error Handling - **Silent Failures**: Broad exception catching (CWE-390) - **Information Disclosure**: Leaking error details (CWE-209) - **Missing Validation**: No input sanitization (CWE-20) - **Resource Leaks**: Unclosed files, connections, threads (CWE-772) ### Concurrency Bugs - **Race Conditions**: Unprotected shared state (CWE-362) - **TOCTOU Issues**: Check-then-act patterns (CWE-367) - **Deadlocks**: Missing timeout handling (CWE-833) - **Missing Synchronization**: No locks on shared data (CWE-820) ### Memory Leaks - **Unbounded Caches**: No size limits or TTL (CWE-401) - **Unclosed Resources**: Files, connections, threads (CWE-772) - **Growing Lists**: No eviction policies (CWE-400) - **Circular References**: Event listeners, callbacks (CWE-459) ## Languages Covered - **Python**: 35+ examples - **JavaScript/TypeScript**: 15+ examples - **Go**: 10+ examples ## Example Entry ```json { "type": "code_review", "input_code": "def login(username, password):\n query = \"SELECT * FROM users WHERE username='\" + username + \"' AND password='\" + password + \"'\"\n cursor.execute(query)\n return cursor.fetchone()", "analysis": "1. The code directly concatenates user input into a SQL query without any sanitization.\n2. This creates a classic SQL injection vulnerability where an attacker can manipulate the query.", "findings": [ {"issue": "SQL Injection Vulnerability", "severity": "CRITICAL", "location": "query construction", "cwe": "CWE-89"}, {"issue": "Plaintext Password Storage", "severity": "HIGH", "location": "password comparison", "cwe": "CWE-256"} ], "fix": "def login(username, password):\n cursor.execute(\"SELECT user_id FROM users WHERE username = %s\", (username,))\n result = cursor.fetchone()\n if result and verify_password(password, result['password_hash']):\n return result" } ``` ## Usage This dataset is suitable for: - Training code review AI models - Teaching secure coding practices - Automated code analysis tools - Security awareness training - Bug bounty preparation ## Statistics - **Total Examples**: 62 - **Code Review**: ~32 examples - **Debugging**: ~30 examples - **File Size**: ~75KB - **Unique CWEs**: 25+ vulnerability types - **Languages**: Python, JavaScript, TypeScript, Go ## File Location ``` /c/Users/admin/Pony-Alpha-2-Dataset-Training/datasets/06-code-review-debugging/code-review-debugging.jsonl ```