LegacyCodeBench — Benchmark for AI Code Understanding

Models Tested

200

Tasks

Difficulty Tiers

Leaderboard

#	Model	LCB SCORE 0-100	STRUCTURAL COMPLETENESS	DOCUMENTATION QUALITY	BEHAVIORAL FIDELITY	T1 Basic	T4 Enterprise	LANG Supported
Loading results...

Scoring: LCB Score = 30% Structural Completeness + 20% Documentation Quality + 50% Behavioral Fidelity. High BF means the AI's documentation accurately describes what the code actually does.

What We Measure

If AI truly understands code, you should be able to recreate it from the documentation. That's what we test.

Structural Completeness 30%

Static analysis extracts all business rules, data structures, control flow, and external calls. We check if the AI documented each one.

Documentation Quality 20%

Algorithmic assessment of structure, readability, traceability, and abstraction level. No LLM-as-judge required.

Behavioral Fidelity 50%

Claim verification via execution (with Behavioral Specification Matching fallback for infrastructure). Documentation must accurately describe what the code actually does, verified through test generation for logic and pattern matching for dependencies.

LCB Score = (0.30 × SC) + (0.20 × DQ) + (0.50 × BF)

Critical failures (hallucinated functions, incorrect business rules) result in a score of 0 for that task.

Run the Benchmark

# Install pip install -e . # Set API keys export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." # Run full benchmark (recommended) legacycodebench run-full-benchmark --enable-execution

For Behavioral Fidelity testing, build the Docker image:

cd docker/cobol-sandbox && docker build -t legacycodebench-cobol:latest .

Without Docker, BF evaluation falls back to heuristic verification (claim quality analysis). For full accuracy, Docker is recommended.

Submit results: Run the benchmark and open a PR on GitHub with your results/ directory.

Can AI understand legacy code?

Leaderboard

What We Measure

Structural Completeness 30%

Documentation Quality 20%

Behavioral Fidelity 50%

Run the Benchmark