Scoring Methodology
LegacyCodeBench uses a hybrid evaluation approach: reference-based comparison with semantic similarity and NLP metrics when expert documentation exists, falling back to heuristic and COBOL-extraction methods otherwise. All scores are deterministic and reproducible.
Evaluation includes ROUGE (recall), BLEU (precision), and TF-IDF semantic similarity for comprehensive assessment.
Task Types
Documentation
Systems must generate Markdown documentation covering purpose, business rules, edge cases, and data structures.
Understanding
Systems must output JSON capturing dependencies, business rules, and data flow from COBOL programs.
Hybrid Evaluation
50% documentation score + 50% understanding score → overall benchmark score.
Evaluation Modes
LegacyCodeBench uses two evaluation approaches depending on availability of expert references:
Reference-Based
When expert-validated reference documentation exists, submissions are compared using semantic similarity, ROUGE, and BLEU metrics.
Heuristic Fallback
For documentation: keyword matching, pattern detection. For understanding: direct COBOL extraction via regex.
Automatic Detection
The evaluator checks for reference files and automatically selects the appropriate mode.
Documentation Scoring
Each documentation submission is evaluated using weighted components:
- Required sections present (40%) — Checks for: business purpose, business rules, edge cases, data structures, algorithm overview.
- Business rule coverage (30%) — Uses semantic similarity + NLP metrics when reference exists, or heuristics otherwise.
- Markdown format quality (20%) — Valid markdown, headers, code blocks, lists, paragraphs.
- Appropriate length (10%) — Compares word count to reference (if available) or minimum requirement.
Reference-Based Business Rule Scoring
When expert reference documentation exists, business rule coverage is enhanced with NLP metrics:
- Semantic Similarity — TF-IDF vectors + cosine similarity to capture meaning.
- ROUGE-L — Longest common subsequence, measures recall and structural similarity.
- BLEU — N-gram precision (1-4 grams), measures accuracy of generated text.
Edge case coverage is also evaluated separately and weighted at 30% within the rule coverage component.
Understanding Scoring
Understanding tasks are scored via F1 metrics across three components:
recall = true_positives / actual
f1 = 2 × (precision × recall) / (precision + recall)
Three Evaluation Components
- Dependencies (33%) — CALL and COPY relationships between programs.
- Business Rules (33%) — IF/WHEN conditions and logic extracted from code.
- Data Flow (33%) — File I/O operations (OPEN, READ, WRITE, CLOSE).
Final understanding score = average of the three F1 scores.
Ground Truth Sources
- Reference-based — When
references/understanding/<task>/reference.jsonexists, submissions are compared against expert-validated JSON. - COBOL extraction — Otherwise, ground truth is extracted directly from COBOL source files using regex patterns.
Reference-based evaluation includes semantic comparison for business rules, while COBOL extraction uses exact matching.
Overall Score
Scores are capped between 0 and 100. Performance tiers follow PRD definitions (>60% useful, 40–60% research baseline, etc.).
NLP Metrics
When reference documentation exists, we use industry-standard NLP evaluation metrics:
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- ROUGE-1 — Unigram (word) overlap, ensures important words are captured.
- ROUGE-2 — Bigram overlap, ensures phrases are captured.
- ROUGE-L — Longest common subsequence, measures structural similarity and fluency.
- ROUGE-Lsum — Sentence-level LCS, better for multi-paragraph documents.
ROUGE is recall-oriented and commonly used for summarization evaluation.
BLEU (Bilingual Evaluation Understudy)
- BLEU-1 through BLEU-4 — N-gram precision scores (1-gram to 4-gram).
- Measures how many n-grams from the submission appear in the reference.
- Emphasizes precision and phrase-level accuracy.
BLEU complements ROUGE by measuring precision rather than recall.
Semantic Similarity
TF-IDF vectorization + cosine similarity captures meaning even when different words are used. This handles paraphrasing and synonym usage that exact-match metrics miss.
Scoring Examples
Documentation Example (Reference-Based)
Submission has:
- 5/5 required sections → 1.0 × 0.40 = 0.40
- Semantic: 0.75, ROUGE-L: 0.65, BLEU: 0.55 → (0.75×0.5 + 0.65×0.3 + 0.55×0.2) = 0.68 × 0.30 = 0.204
- Perfect markdown format → 1.0 × 0.20 = 0.20
- 1,800 words (reference: 1,500) → 1.0 × 0.10 = 0.10
Total Score: 0.904 (90.4%)
Understanding Example (Reference-Based)
Component F1 Scores:
- Dependencies: 5/6 predicted correct, 5/8 actual → P: 0.83, R: 0.63, F1: 0.71
- Business Rules: Semantic comparison → F1: 0.68
- Data Flow: 4/4 files correct → P: 1.0, R: 1.0, F1: 1.0
Average F1: (0.71 + 0.68 + 1.0) / 3 = 0.80 (80%)
Performance Tiers
Overall scores are interpreted using the following tiers:
| Score Range | Tier | Interpretation |
|---|---|---|
| > 60% | Excellent | Practically useful with human oversight |
| 40-60% | Good | Research baseline quality |
| 20-40% | Fair | Expected AI performance |
| < 20% | Poor | Not functional for practical use |
FAQ
How does the evaluator choose between reference and heuristic modes?
The evaluator automatically checks for reference files (reference.md or reference.json) in the appropriate directory. If found, it uses reference-based evaluation; otherwise, it falls back to heuristics or COBOL extraction.
Are NLP metrics always calculated?
ROUGE and BLEU scores are only calculated when reference documentation exists and the required libraries (rouge-score, nltk) are installed. If unavailable, the system gracefully falls back to simpler comparison methods.
Can I reproduce the scores?
Yes. All evaluation scripts are open source. Run legacycodebench evaluate locally with the same task and submission to get identical scores. Results include the evaluation method used ("reference" or "heuristic").
How are reference documents created?
2-3 COBOL experts independently document each task, then their outputs are merged to consensus with inter-annotator agreement > 0.80. The consensus version becomes the reference used for evaluation.
What if my submission doesn't match the reference wording exactly?
That's fine. Semantic similarity (TF-IDF + cosine) captures meaning even with different wording. ROUGE and BLEU complement this by measuring structural and phrase-level similarity. The combined approach rewards both accuracy and completeness.