The Problem We Found
Our original approach classified programs before trying to run them. We scanned source code
for patterns like EXEC CICS or EXEC SQL and decided in advance
whether a program could be executed. If it couldn't, we'd fall back to heuristic scoring.
This seemed reasonable. But pattern detection guesses; compilers know. When we examined actual evaluation runs, we found programs failing for reasons our patterns didn't anticipate—missing copybooks, cascading syntax errors, environment issues. The fallback kicked in silently, and documentation that couldn't be verified still received scores.
In one case, a program with 15 test cases failed compilation entirely. All 15 tests were skipped. The heuristic fallback scored the documentation at 100% BF. That's not measurement; that's noise.
What We Changed
We moved to compile-first classification. Instead of guessing what will compile, we attempt compilation and let the compiler tell us what's actually possible.
The flow is now:
- Try to compile the program. Resolve copybooks, set up the Docker environment, run the compiler.
- If compilation succeeds: Execute tests and verify behavioral claims against actual program output.
- If compilation fails: Classify the reason (missing copybook, IBM middleware, syntax error) and route to static verification.
Crucially, there are now only two verification methods, and both are explicitly labeled:
- Executed: The program compiled and ran. Claims were verified against actual output.
- Static: The program couldn't compile. Claims were verified against source code patterns.
No silent fallbacks. Every result records which method was used and why.
Why Static Verification Instead of Zero?
A reasonable question: if a program can't compile, why not just score it zero?
The answer is that these are real production programs extracted from real mainframe systems. They work—they just require IBM CICS, or DB2 precompilers, or proprietary middleware we can't replicate in a Docker container. Assigning zero would penalize AI models for infrastructure constraints entirely outside their control.
Static verification checks that documentation claims match what's actually in the source code. If the documentation says "TOTAL is calculated by multiplying PRICE by QUANTITY," we verify that a COMPUTE statement with those variables exists. It's not as strong as execution, but it's a meaningful signal—and we label it clearly so you know what you're looking at.
What This Means for Results
The leaderboard now shows verification method breakdowns for every model. You can see exactly how many tasks were verified through execution versus static analysis. You can click into any model to see task-by-task details: which programs compiled, which didn't, and why.
Scores may look different than they would have under the old system. That's expected. The goal isn't high numbers; it's accurate measurement.
Realistic Expectations
Based on our analysis of the 200 COBOL benchmark tasks:
- Many require IBM CICS (can't be executed)
- Some require DB2 SQL precompilation (can't be executed)
- The remainder are pure batch programs (can be executed with GnuCOBOL)
A subset of COBOL tasks will use executed verification where the environment allows. The remainder use static verification with documented reasons. This is a limitation of open-source tooling, not a flaw in the benchmark.
Refining the Score: Hybrid Evaluation
In v2.0, we also improved how we calculate the Behavioral Fidelity score.
Previously, writing detailed documentation could ironically hurt a model's score. If a model made 20 claims and verified 15, it might score lower than a model that made 5 claims and verified 5. This penalized comprehensiveness.
We introduced Hybrid Scoring. We now reward verified claims up to a target threshold. This ensures that models providing rich, detailed documentation are recognized for their depth, while still punishing hallucinations.
Looking Ahead
Benchmarks earn trust through transparency. When something doesn't work the way we expected, we'd rather fix it than hide it. The methodology details are documented in our GitHub repository for anyone who wants to review them.