CAPTCHAv2 Leaderboard

Compare model performance across different CAPTCHA types

📤 Upload Results

Option A: Using browser-use Agent Framework

  1. Start the CAPTCHA server:

    python app.py
    

    The server will run on http://127.0.0.1:7860

  2. Run the browser-use agent evaluation (default is their in house model BU1.0):

    python -m agent_frameworks.browseruse_cli \
      --url http://127.0.0.1:7860 \
      --llm browser-use \
    

    Or with a different LLM:

    python -m agent_frameworks.browseruse_cli \
      --url http://127.0.0.1:7860 \
      --llm openai \
      --model gpt-4o 
    
  3. The evaluation will automatically save results to benchmark_results.json in the project root. Each puzzle attempt is logged as a JSON object with fields:

    • puzzle_type, puzzle_id, user_answer, correct_answer, correct
    • elapsed_time, timestamp
    • model, provider, agent_framework

Option B: Using Other Agent Frameworks

Follow your framework's evaluation protocol. Ensure results are saved in benchmark_results.json format (JSONL: one JSON object per line) with the same field structure.

Method 1: Convert to CSV Format (Recommended)

Use the provided conversion script (convert_benchmark_to_csv.py in the project root):

python convert_benchmark_to_csv.py benchmark_results.json leaderboard/results.csv

Method 2: Directly Upload to Leaderboard (Auto-conversion)

You can upload benchmark_results.json directly here. The system will automatically handle all.

Optionally provide metadata below if auto-detection fails:

  • Model Name (e.g., "gpt-4", "claude-3-sonnet", "bu-1-0")
  • Provider (e.g., "OpenAI", "Anthropic", "browser-use")
  • Agent Framework (e.g., "browser-use", "crewai")

Supported file formats:

  • ✅ benchmark_results.json - Per-puzzle results (JSONL format)
  • ✅ results.csv - Aggregated results Recommended
  • ✅ JSON files - Single object or array of aggregated results

File format requirements:

For benchmark_results.json (per-puzzle format):

{"puzzle_type": "Dice_Count", "puzzle_id": "dice1.png", "user_answer": "24", "correct_answer": 24, "correct": true, "elapsed_time": "12.5", "timestamp": "2025-01-01T00:00:00Z", "model": "bu-1-0", "provider": "browser-use", "agent_framework": "browser-use"}

For CSV (aggregated format):

  • Required columns: Model, Provider, Agent Framework, Type, Overall Pass Rate , Avg Duration (s), Avg Cost ($), and puzzle type columns (e.g., Dice_Count, Mirror, etc.)

Category/Type
Sort by
Sort Direction
Model Filter (for Performance by Type plot)