Honest Verification: How We Fixed Behavioral Fidelity Testing

When we analyzed our evaluation logs, we found something uncomfortable: programs that failed to compile were sometimes receiving 100% Behavioral Fidelity scores. We've redesigned how BF verification works. Here's what changed and why.

The Problem We Found

Our original approach classified programs before trying to run them. We scanned source code for patterns like EXEC CICS or EXEC SQL and decided in advance whether a program could be executed. If it couldn't, we'd fall back to heuristic scoring.

This seemed reasonable. But pattern detection guesses; compilers know. When we examined actual evaluation runs, we found programs failing for reasons our patterns didn't anticipate—missing copybooks, cascading syntax errors, environment issues. The fallback kicked in silently, and documentation that couldn't be verified still received scores.

In one case, a program with 15 test cases failed compilation entirely. All 15 tests were skipped. The heuristic fallback scored the documentation at 100% BF. That's not measurement; that's noise.

What We Changed

We moved to compile-first classification. Instead of guessing what will compile, we attempt compilation and let the compiler tell us what's actually possible.

The flow is now:

  1. Try to compile the program. Resolve copybooks, set up the Docker environment, run the compiler.
  2. If compilation succeeds: Execute tests and verify behavioral claims against actual program output.
  3. If compilation fails: Classify the reason (missing copybook, IBM middleware, syntax error) and route to static verification.

Crucially, there are now only two verification methods, and both are explicitly labeled:

  • Executed: The program compiled and ran. Claims were verified against actual output.
  • Static: The program couldn't compile. Claims were verified against source code patterns.

No silent fallbacks. Every result records which method was used and why.

Why Static Verification Instead of Zero?

A reasonable question: if a program can't compile, why not just score it zero?

The answer is that these are real production programs extracted from real mainframe systems. They work—they just require IBM CICS, or DB2 precompilers, or proprietary middleware we can't replicate in a Docker container. Assigning zero would penalize AI models for infrastructure constraints entirely outside their control.

Static verification checks that documentation claims match what's actually in the source code. If the documentation says "TOTAL is calculated by multiplying PRICE by QUANTITY," we verify that a COMPUTE statement with those variables exists. It's not as strong as execution, but it's a meaningful signal—and we label it clearly so you know what you're looking at.

What This Means for Results

The leaderboard now shows verification method breakdowns for every model. You can see exactly how many tasks were verified through execution versus static analysis. You can click into any model to see task-by-task details: which programs compiled, which didn't, and why.

Scores may look different than they would have under the old system. That's expected. The goal isn't high numbers; it's accurate measurement.

Realistic Expectations

Based on our analysis of the 200 COBOL benchmark tasks:

  • Many require IBM CICS (can't be executed)
  • Some require DB2 SQL precompilation (can't be executed)
  • The remainder are pure batch programs (can be executed with GnuCOBOL)

A subset of COBOL tasks will use executed verification where the environment allows. The remainder use static verification with documented reasons. This is a limitation of open-source tooling, not a flaw in the benchmark.

Refining the Score: Hybrid Evaluation

In v2.0, we also improved how we calculate the Behavioral Fidelity score.

Previously, writing detailed documentation could ironically hurt a model's score. If a model made 20 claims and verified 15, it might score lower than a model that made 5 claims and verified 5. This penalized comprehensiveness.

We introduced Hybrid Scoring. We now reward verified claims up to a target threshold. This ensures that models providing rich, detailed documentation are recognized for their depth, while still punishing hallucinations.

Looking Ahead

Benchmarks earn trust through transparency. When something doesn't work the way we expected, we'd rather fix it than hide it. The methodology details are documented in our GitHub repository for anyone who wants to review them.

Beyond COBOL: UniBasic Support in LegacyCodeBench

COBOL gets the headlines, but it's not the only legacy language running critical systems. We've added 50 UniBasic tasks to the benchmark, using ScarletDME for execution verification.

Why UniBasic

UniBasic (and its variants: UniVerse BASIC, Pick BASIC, QMBasic) powers a significant portion of enterprise software you've never heard of—distribution systems, healthcare records, manufacturing control. These systems share COBOL's characteristics: decades old, business-critical, poorly documented.

If AI can understand legacy code, it should work across languages. Testing only COBOL would leave a gap.

Execution Environment

UniVerse and UniData are proprietary, licensed runtimes. We can't distribute them. Instead, we use ScarletDME, a GPL-licensed fork of OpenQM that provides ~95% syntax compatibility with UniVerse BASIC.

Trade-off: Programs using UniVerse-specific APIs (UDO, CallHTTP) can't execute in ScarletDME and require static verification. A significant portion of UniBasic tasks can be executed, providing high-confidence verification.

Task Structure

UniBasic tasks follow the same tier structure as COBOL:

  • T1 (12 tasks): Simple utilities, 50-150 lines
  • T2 (11 tasks): Moderate complexity, single external system
  • T3 (10 tasks): Multi-system integration, 400-800 lines
  • T4 (17 tasks): Large programs, complex control flow

Task IDs use the format LCB-UB-T1-001 to distinguish them from COBOL tasks (LCB-T1-001).

Interpreting Results

The leaderboard shows a language filter. You can view COBOL-only results, UniBasic-only results, or aggregate scores.

Important: Cross-language comparison is statistically unsound. Execution rates differ, task distributions differ, and the languages have different documentation conventions. We show aggregate scores for convenience, but comparing "Model X scores 75% on COBOL" to "Model X scores 72% on UniBasic" doesn't mean COBOL documentation is better. It means the tasks are different.

Running UniBasic Evaluations

To evaluate UniBasic tasks locally:

# Build the Docker image
docker build -t legacycodebench-unibasic:latest docker/unibasic-sandbox/

# Run benchmark
legacycodebench run-full-benchmark --language unibasic --enable-execution

Without Docker, evaluation falls back to static verification for all tasks.

Running the Benchmark

A practical guide to evaluating your model's ability to document legacy code.

Installation

# Clone and install
git clone https://github.com/Kalmantic/legacycodebench.git
cd legacycodebench
pip install -e .

# Set API keys for the model you want to evaluate
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Quick Test

# Run 3 tasks (fast, no Docker required)
legacycodebench run-full-benchmark --task-limit 3

This runs without execution verification. Useful for checking your setup.

Full Benchmark with Execution

# Build Docker images
docker build -t legacycodebench-cobol:latest docker/cobol-sandbox/
docker build -t legacycodebench-unibasic:latest docker/unibasic-sandbox/

# Run full benchmark
legacycodebench run-full-benchmark --enable-execution

With Docker enabled, programs that can compile will be executed, and behavioral claims will be verified against actual output.

Submitting Results

Results are saved to the results/ directory as JSON files. To submit your results to the public leaderboard, open a pull request on GitHub with your result files.

We review submissions for validity (correct task IDs, reasonable scores, no tampering) before adding them to the leaderboard.