Scoring Methodology

LegacyCodeBench uses a hybrid evaluation approach: reference-based comparison with semantic similarity and NLP metrics when expert documentation exists, falling back to heuristic and COBOL-extraction methods otherwise. All scores are deterministic and reproducible.

Evaluation includes ROUGE (recall), BLEU (precision), and TF-IDF semantic similarity for comprehensive assessment.

Task Types

Documentation

Systems must generate Markdown documentation covering purpose, business rules, edge cases, and data structures.

Understanding

Systems must output JSON capturing dependencies, business rules, and data flow from COBOL programs.

Hybrid Evaluation

50% documentation score + 50% understanding score → overall benchmark score.

Evaluation Modes

LegacyCodeBench uses two evaluation approaches depending on availability of expert references:

Reference-Based

When expert-validated reference documentation exists, submissions are compared using semantic similarity, ROUGE, and BLEU metrics.

Heuristic Fallback

For documentation: keyword matching, pattern detection. For understanding: direct COBOL extraction via regex.

Automatic Detection

The evaluator checks for reference files and automatically selects the appropriate mode.

Documentation Scoring

Each documentation submission is evaluated using weighted components:

Required sections present (40%) — Checks for: business purpose, business rules, edge cases, data structures, algorithm overview.
Business rule coverage (30%) — Uses semantic similarity + NLP metrics when reference exists, or heuristics otherwise.
Markdown format quality (20%) — Valid markdown, headers, code blocks, lists, paragraphs.
Appropriate length (10%) — Compares word count to reference (if available) or minimum requirement.

doc_score = 0.40 × sections + 0.30 × rules + 0.20 × format + 0.10 × length

Reference-Based Business Rule Scoring

When expert reference documentation exists, business rule coverage is enhanced with NLP metrics:

rule_coverage = 0.50 × semantic_similarity + 0.30 × ROUGE-L + 0.20 × BLEU

Semantic Similarity — TF-IDF vectors + cosine similarity to capture meaning.
ROUGE-L — Longest common subsequence, measures recall and structural similarity.
BLEU — N-gram precision (1-4 grams), measures accuracy of generated text.

Edge case coverage is also evaluated separately and weighted at 30% within the rule coverage component.

Understanding Scoring

Understanding tasks are scored via F1 metrics across three components:

precision = true_positives / predicted

recall = true_positives / actual

f1 = 2 × (precision × recall) / (precision + recall)

Three Evaluation Components

Dependencies (33%) — CALL and COPY relationships between programs.
Business Rules (33%) — IF/WHEN conditions and logic extracted from code.
Data Flow (33%) — File I/O operations (OPEN, READ, WRITE, CLOSE).

Final understanding score = average of the three F1 scores.

Ground Truth Sources

Reference-based — When references/understanding/<task>/reference.json exists, submissions are compared against expert-validated JSON.
COBOL extraction — Otherwise, ground truth is extracted directly from COBOL source files using regex patterns.

Reference-based evaluation includes semantic comparison for business rules, while COBOL extraction uses exact matching.

Overall Score

overall = 0.50 × documentation + 0.50 × understanding

Scores are capped between 0 and 100. Performance tiers follow PRD definitions (>60% useful, 40–60% research baseline, etc.).

NLP Metrics

When reference documentation exists, we use industry-standard NLP evaluation metrics:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-1 — Unigram (word) overlap, ensures important words are captured.
ROUGE-2 — Bigram overlap, ensures phrases are captured.
ROUGE-L — Longest common subsequence, measures structural similarity and fluency.
ROUGE-Lsum — Sentence-level LCS, better for multi-paragraph documents.

ROUGE is recall-oriented and commonly used for summarization evaluation.

BLEU (Bilingual Evaluation Understudy)

BLEU-1 through BLEU-4 — N-gram precision scores (1-gram to 4-gram).
Measures how many n-grams from the submission appear in the reference.
Emphasizes precision and phrase-level accuracy.

BLEU complements ROUGE by measuring precision rather than recall.

Semantic Similarity

TF-IDF vectorization + cosine similarity captures meaning even when different words are used. This handles paraphrasing and synonym usage that exact-match metrics miss.

Scoring Examples

Documentation Example (Reference-Based)

Submission has:

5/5 required sections → 1.0 × 0.40 = 0.40
Semantic: 0.75, ROUGE-L: 0.65, BLEU: 0.55 → (0.75×0.5 + 0.65×0.3 + 0.55×0.2) = 0.68 × 0.30 = 0.204
Perfect markdown format → 1.0 × 0.20 = 0.20
1,800 words (reference: 1,500) → 1.0 × 0.10 = 0.10

Total Score: 0.904 (90.4%)

Understanding Example (Reference-Based)

Component F1 Scores:

Dependencies: 5/6 predicted correct, 5/8 actual → P: 0.83, R: 0.63, F1: 0.71
Business Rules: Semantic comparison → F1: 0.68
Data Flow: 4/4 files correct → P: 1.0, R: 1.0, F1: 1.0

Average F1: (0.71 + 0.68 + 1.0) / 3 = 0.80 (80%)

Performance Tiers

Overall scores are interpreted using the following tiers:

Score Range	Tier	Interpretation
> 60%	Excellent	Practically useful with human oversight
40-60%	Good	Research baseline quality
20-40%	Fair	Expected AI performance
< 20%	Poor	Not functional for practical use

FAQ

How does the evaluator choose between reference and heuristic modes?

The evaluator automatically checks for reference files (reference.md or reference.json) in the appropriate directory. If found, it uses reference-based evaluation; otherwise, it falls back to heuristics or COBOL extraction.

Are NLP metrics always calculated?

ROUGE and BLEU scores are only calculated when reference documentation exists and the required libraries (rouge-score, nltk) are installed. If unavailable, the system gracefully falls back to simpler comparison methods.

Can I reproduce the scores?

Yes. All evaluation scripts are open source. Run legacycodebench evaluate locally with the same task and submission to get identical scores. Results include the evaluation method used ("reference" or "heuristic").

How are reference documents created?

2-3 COBOL experts independently document each task, then their outputs are merged to consensus with inter-annotator agreement > 0.80. The consensus version becomes the reference used for evaluation.

What if my submission doesn't match the reference wording exactly?

That's fine. Semantic similarity (TF-IDF + cosine) captures meaning even with different wording. ROUGE and BLEU complement this by measuring structural and phrase-level similarity. The combined approach rewards both accuracy and completeness.