Scoring Methodology

LegacyCodeBench uses a hybrid evaluation approach: reference-based comparison with semantic similarity and NLP metrics when expert documentation exists, falling back to heuristic and COBOL-extraction methods otherwise. All scores are deterministic and reproducible.

Evaluation includes ROUGE (recall), BLEU (precision), and TF-IDF semantic similarity for comprehensive assessment.

Task Types

Documentation

Systems must generate Markdown documentation covering purpose, business rules, edge cases, and data structures.

Understanding

Systems must output JSON capturing dependencies, business rules, and data flow from COBOL programs.

Hybrid Evaluation

50% documentation score + 50% understanding score → overall benchmark score.

Evaluation Modes

LegacyCodeBench uses two evaluation approaches depending on availability of expert references:

Reference-Based

When expert-validated reference documentation exists, submissions are compared using semantic similarity, ROUGE, and BLEU metrics.

Heuristic Fallback

For documentation: keyword matching, pattern detection. For understanding: direct COBOL extraction via regex.

Automatic Detection

The evaluator checks for reference files and automatically selects the appropriate mode.

Documentation Scoring

Each documentation submission is evaluated using weighted components:

doc_score = 0.40 × sections + 0.30 × rules + 0.20 × format + 0.10 × length

Reference-Based Business Rule Scoring

When expert reference documentation exists, business rule coverage is enhanced with NLP metrics:

rule_coverage = 0.50 × semantic_similarity + 0.30 × ROUGE-L + 0.20 × BLEU

Edge case coverage is also evaluated separately and weighted at 30% within the rule coverage component.

Understanding Scoring

Understanding tasks are scored via F1 metrics across three components:

precision = true_positives / predicted
recall = true_positives / actual
f1 = 2 × (precision × recall) / (precision + recall)

Three Evaluation Components

Final understanding score = average of the three F1 scores.

Ground Truth Sources

Reference-based evaluation includes semantic comparison for business rules, while COBOL extraction uses exact matching.

Overall Score

overall = 0.50 × documentation + 0.50 × understanding

Scores are capped between 0 and 100. Performance tiers follow PRD definitions (>60% useful, 40–60% research baseline, etc.).

NLP Metrics

When reference documentation exists, we use industry-standard NLP evaluation metrics:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is recall-oriented and commonly used for summarization evaluation.

BLEU (Bilingual Evaluation Understudy)

BLEU complements ROUGE by measuring precision rather than recall.

Semantic Similarity

TF-IDF vectorization + cosine similarity captures meaning even when different words are used. This handles paraphrasing and synonym usage that exact-match metrics miss.

Scoring Examples

Documentation Example (Reference-Based)

Submission has:

  • 5/5 required sections → 1.0 × 0.40 = 0.40
  • Semantic: 0.75, ROUGE-L: 0.65, BLEU: 0.55 → (0.75×0.5 + 0.65×0.3 + 0.55×0.2) = 0.68 × 0.30 = 0.204
  • Perfect markdown format → 1.0 × 0.20 = 0.20
  • 1,800 words (reference: 1,500) → 1.0 × 0.10 = 0.10

Total Score: 0.904 (90.4%)

Understanding Example (Reference-Based)

Component F1 Scores:

  • Dependencies: 5/6 predicted correct, 5/8 actual → P: 0.83, R: 0.63, F1: 0.71
  • Business Rules: Semantic comparison → F1: 0.68
  • Data Flow: 4/4 files correct → P: 1.0, R: 1.0, F1: 1.0

Average F1: (0.71 + 0.68 + 1.0) / 3 = 0.80 (80%)

Performance Tiers

Overall scores are interpreted using the following tiers:

Score Range Tier Interpretation
> 60% Excellent Practically useful with human oversight
40-60% Good Research baseline quality
20-40% Fair Expected AI performance
< 20% Poor Not functional for practical use

FAQ

How does the evaluator choose between reference and heuristic modes?

The evaluator automatically checks for reference files (reference.md or reference.json) in the appropriate directory. If found, it uses reference-based evaluation; otherwise, it falls back to heuristics or COBOL extraction.

Are NLP metrics always calculated?

ROUGE and BLEU scores are only calculated when reference documentation exists and the required libraries (rouge-score, nltk) are installed. If unavailable, the system gracefully falls back to simpler comparison methods.

Can I reproduce the scores?

Yes. All evaluation scripts are open source. Run legacycodebench evaluate locally with the same task and submission to get identical scores. Results include the evaluation method used ("reference" or "heuristic").

How are reference documents created?

2-3 COBOL experts independently document each task, then their outputs are merged to consensus with inter-annotator agreement > 0.80. The consensus version becomes the reference used for evaluation.

What if my submission doesn't match the reference wording exactly?

That's fine. Semantic similarity (TF-IDF + cosine) captures meaning even with different wording. ROUGE and BLEU complement this by measuring structural and phrase-level similarity. The combined approach rewards both accuracy and completeness.