CAPTCHAv2 Leaderboard

Compare model performance across different CAPTCHA types

📤 Upload Results

Option A: Using browser-use Agent Framework

Start the CAPTCHA server:
```
python app.py
```
The server will run on http://127.0.0.1:7860

Run the browser-use agent evaluation (default is their in house model BU1.0):

python -m agent_frameworks.browseruse_cli \
  --url http://127.0.0.1:7860 \
  --llm browser-use \

Or with a different LLM:

python -m agent_frameworks.browseruse_cli \
  --url http://127.0.0.1:7860 \
  --llm openai \
  --model gpt-4o

The evaluation will automatically save results to benchmark_results.json in the project root. Each puzzle attempt is logged as a JSON object with fields:
- puzzle_type, puzzle_id, user_answer, correct_answer, correct
- elapsed_time, timestamp
- model, provider, agent_framework

Option B: Using Other Agent Frameworks

Follow your framework's evaluation protocol. Ensure results are saved in benchmark_results.json format (JSONL: one JSON object per line) with the same field structure.

Method 1: Convert to CSV Format (Recommended)

Use the provided conversion script (convert_benchmark_to_csv.py in the project root):

python convert_benchmark_to_csv.py benchmark_results.json leaderboard/results.csv

Method 2: Directly Upload to Leaderboard (Auto-conversion)

You can upload benchmark_results.json directly here. The system will automatically handle all.

Optionally provide metadata below if auto-detection fails:

Model Name (e.g., "gpt-4", "claude-3-sonnet", "bu-1-0")
Provider (e.g., "OpenAI", "Anthropic", "browser-use")
Agent Framework (e.g., "browser-use", "crewai")

Supported file formats:

✅ benchmark_results.json - Per-puzzle results (JSONL format)
✅ results.csv - Aggregated results Recommended
✅ JSON files - Single object or array of aggregated results

File format requirements:

For benchmark_results.json (per-puzzle format):

{"puzzle_type": "Dice_Count", "puzzle_id": "dice1.png", "user_answer": "24", "correct_answer": 24, "correct": true, "elapsed_time": "12.5", "timestamp": "2025-01-01T00:00:00Z", "model": "bu-1-0", "provider": "browser-use", "agent_framework": "browser-use"}

For CSV (aggregated format):

Required columns: Model, Provider, Agent Framework, Type, Overall Pass Rate , Avg Duration (s), Avg Cost ($), and puzzle type columns (e.g., Dice_Count, Mirror, etc.)

Upload Results File

Model Name (optional, for benchmark_results.json)

Provider (optional, for benchmark_results.json)

Agent Framework (optional, for benchmark_results.json)

Category/Type

Sort by

Sort Direction

High→Low Low→High

Model Filter (for Performance by Type plot)

Performance Comparison

Performance by Type

Cost-Effectiveness Analysis

Built with Gradio logo