-
Models Tested
200
Tasks
4
Difficulty Tiers

Leaderboard

# Model LCB Score0-100 Structural
Completeness
Documentation
Quality
Behavioral
Fidelity
T1Basic T4Enterprise
Loading results...
Scoring: LCB Score = 30% Structural Completeness + 20% Documentation Quality + 50% Behavioral Fidelity. High BF means the AI's documentation accurately describes what the code actually does.

What We Measure

If AI truly understands code, you should be able to recreate it from the documentation. That's what we test.

Structural Completeness 30%

Static analysis extracts all business rules, data structures, control flow, and external calls. We check if the AI documented each one.

Documentation Quality 20%

Algorithmic assessment of structure, readability, traceability, and abstraction level. No LLM-as-judge required.

Behavioral Fidelity 50%

Claim verification via execution (with Behavioral Specification Matching fallback for infrastructure). Documentation must accurately describe what the code actually does, verified through test generation for logic and pattern matching for dependencies.

LCB Score = (0.30 × SC) + (0.20 × DQ) + (0.50 × BF)
Critical failures (hallucinated functions, incorrect business rules) result in a score of 0 for that task.

Run the Benchmark

# Install pip install -e . # Set API keys export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." # Run full benchmark (recommended) legacycodebench run-full-benchmark --enable-execution
For Behavioral Fidelity testing, build the Docker image:

cd docker/cobol-sandbox && docker build -t legacycodebench-cobol:latest .

Without Docker, BF evaluation falls back to heuristic verification (claim quality analysis). For full accuracy, Docker is recommended.
Submit results: Run the benchmark and open a PR on GitHub with your results/ directory.