What is LegacyCodeBench?
LegacyCodeBench is the benchmark designed to evaluate how well AI systems can understand and document decades-old legacy software, especially COBOL.
Modernization fails not because conversion is hard, but because the business rules, the intent, the quiet rules are buried inside the legacy code.
You can’t modernize what you don’t understand. Yet no one measures whether AI actually understands these systems. That’s why LegacyCodeBench tests whether AI can accurately extract and explain that knowledge.
What LegacyCodeBench Measures
LegacyCodeBench evaluates two core capabilities:
- Documentation Quality — Can AI explain the program clearly?
- Program Understanding — Can AI extract the structure of the system?
Why Understanding Matters
Modernization projects often rush into COBOL→Java translation, automated refactoring, or AI code generation without first understanding:
- How the system works
- What the business rules mean
- Why exceptions exist
- What domain logic must not break
This lack of understanding leads to failed migration.
Documentation Tasks
Explain business purpose, rules, edge cases, and data structures for legacy programs.
Understanding Tasks
Extract dependency graphs, business rules, and data flows. Scored via F1 precision/recall.
Evaluation
50% documentation + 50% understanding. Weighted metrics with expert validation.
Methodology Highlights
- Hand-curated tasks from open source, synthetic, and anonymized enterprise COBOL.
- Deterministic scoring scripts inside Docker to ensure reproducibility.
- Reference documentation created by COBOL experts with inter-annotator agreement ≥ 0.80.
- Leaderboards updated via CLI submissions.