Evaluating ML/GenAI Models
When your data is messy, incomplete, biased, or misaligned with the task, it inevitably shows up as weak or unstable evaluation metrics. This guide explains major evaluation metrics across generative AI, binary classification, and regression - including real-world examples, what “good” looks like, and remediation strategies when metrics are low.
Table of Contents
Summary Table of Machine Learning & Generative AI Evaluation Metrics
Note: Thresholds are general industry heuristics; real “good” values depend on domain, data distribution, and business constraints.
Text Generative-AI Metrics
| Metric | What It Measures | When to Use | Good Threshold | Targets (Model Types) |
|---|---|---|---|---|
| ROUGE‑1 | Unigram (word‑level) overlap | Summarization, QA, content recall | 0.4–0.5+ good; 0.5–0.6+ strong | Summarization models, LLM eval |
| ROUGE‑2 | Bigram (two‑word sequence) overlap | Fluency, coherence, phrase accuracy | 0.2–0.3 good; >0.35 strong | Summaries, paraphrasers, LLMs |
| ROUGE‑L | Longest Common Subsequence (structure similarity) | Narrative structure, sentence‑level accuracy | >0.4 useful; >0.5 strong | Summarization, reasoning LLMs |
| BLEU | Precision‑oriented n‑gram match + brevity penalty | Translation, paraphrasing | 30–40 good; 45–55+ strong | Machine translation, multilingual LLMs |
Binary Classification Metrics
| Metric | What It Measures | When to Use | Good Threshold | Targets (Model Types) |
|---|---|---|---|---|
| AUC‑ROC | Ranking ability between positive & negative classes | Balanced datasets; general classification | 0.7–0.8 OK; 0.8–0.9 good; >0.9 high | Fraud detection, risk scoring, medical tests |
| AUC‑PR | Precision‑Recall tradeoff | Imbalanced datasets (rare events) | 2–3× the positive base rate | Fraud, medical diagnosis, anomaly detection |
| Precision | % of predicted positives that are true | When false positives hurt | >0.8 common target | Spam filters, fraud alerts |
| Recall (TPR) | % of actual positives detected | When false negatives hurt | >0.7–0.8 good; >0.85 strong | Medical tests, safety models |
| False Positive Rate (FPR) | % of negatives incorrectly predicted positive | Critical when over-flagging is costly | <10% acceptable; <5% strong | Security screening, compliance systems |
| F1 Score | Harmonic mean of precision & recall | Balanced precision/recall importance | >0.7 good; >0.8 strong | Churn prediction, NER, general classification |
Regression Metrics
| Metric | What It Measures | When to Use | Good Threshold | Targets (Model Types) |
|---|---|---|---|---|
| R² | % of variance explained by model | General regression, explainability | >0.5 OK; >0.7 strong; >0.9 excellent | Pricing models, risk scoring |
| MAE | Average absolute error | Interpretability needed; robust to outliers | <10–20% of mean target | Forecasting, scheduling |
| MSE | Squared error (punishes large mistakes) | When large errors are extra costly | Lower is better (scale-dependent) | Safety systems, physical modeling |
| RMSE | Root of MSE, in same units as target | Comparing to business KPIs | <10–20% relative error | Demand forecasting, time-series models |
Deep Dive
1. Text Generation Metrics (ROUGE, BLEU)
Text generation metrics measure how similar model outputs are to reference texts. They are widely used for tasks like summarization, translation, and generative AI fine-tuning.
flowchart LR
A[Reference Text] --> C[Compare Token Overlap]
B[Generated Text] --> C
C --> D1[ROUGE-1: Unigrams]
C --> D2[ROUGE-2: Bigrams]
C --> D3[ROUGE-L: Longest Common Subsequence]
ROUGE‑1
Definition: Measures overlap of individual words (unigrams).
Usage: Summarization and content recall.
Real‑world example
Reference: “The Federal Reserve raised interest rates due to inflation concerns.”
Generated summary containing many of these words → higher ROUGE‑1.
Decent quality
- 0.4–0.5+ = good
- 0.5–0.6+ = strong
If low, improve by
- Adding domain‑specific training samples
- Fixing noisy or inconsistent reference summaries
- Improving decoding strategies (beam search, temperature)
ROUGE‑2
Definition: Measures bigram overlap (two-word sequences).
Usage: Fluency and phrase accuracy.
Real‑world example
Bigram match: “raised interest” appearing in both reference and output.
Decent quality
- 0.2–0.3 = good
- >0.35 = strong
If low
- Add more high-quality examples with clear phrasing
- Improve tokenization and text normalization
ROUGE‑L
Definition: Longest Common Subsequence (LCS) between generated and reference text.
Usage: Evaluates structural similarity.
Real‑world example
Two summaries using similar sentence structures achieve higher ROUGE‑L even if some words differ.
Decent quality
- 0.4+ = useful
- 0.5+ = strong
If low
- Fine‑tune on datasets emphasizing reasoning and structure
- Remove inconsistent or unstructured reference texts ***
BLEU
Definition: Precision‑based n‑gram match score with brevity penalty.
Usage: Machine translation, paraphrasing.
Real‑world example
“How are you?” → “¿Cómo estás?”
Perfect n‑gram match → BLEU = 100 (for a short sentence).
Decent quality
- 30–40 = acceptable
- 45–55+ = high-quality translation
If low
- Add multi‑reference translations
- Improve preprocessing and normalization
- Include more domain-specific parallel corpora
2. Binary Classification Metrics
Used in fraud detection, healthcare diagnostics, spam filtering, customer churn prediction, and more.
flowchart TD
A[Model Predictions] --> B[Confusion Matrix]
B --> C1["Precision = TP / (TP + FP)"]
B --> C2["Recall = TP / (TP + FN)"]
B --> C3[F1 = Harmonic Mean]
B --> C4[TPR, FPR]
AUC‑ROC
Definition: Measures ability to distinguish positive vs. negative classes.
Usage: Ranking and risk scoring.
Real-world example
A fraud detection model with AUC‑ROC = 0.90 correctly ranks random fraud cases above non‑fraud cases 90% of the time.
Decent quality
- 0.7–0.8 = acceptable
- 0.8–0.9 = good
- >0.9 = excellent
If low
- Remove label leakage
- Add more representative examples
- Use better feature engineering
AUC‑PR
Definition: Precision‑recall curve area, best for imbalanced data.
Real-world example
Cancer detection where positives are 1%.
AUC‑PR benchmark is 0.01 (base rate).
A model achieving 0.40 is performing extremely well.
Decent quality
- 2–3× the base rate is reasonable
- Higher is better
If low
- Improve sampling of positive cases
- Apply oversampling/SMOTE techniques
- Use task‑specific loss weighting
Precision
Definition: % of predicted positives that are correct.
Real-world example
Spam classifier:
100 emails marked as spam → 95 truly spam → precision = 0.95
Decent quality
- >0.8 typical for many domains
If low
- Raise decision threshold
- Reduce noisy negative samples
Recall (True Positive Rate)
Definition: % of actual positives the model detects.
Real-world example
Medical imaging model detecting diabetic retinopathy.
Recall = 0.95 means 95% of positive cases are captured.
Decent quality
- >0.7–0.8 = acceptable
- >0.85 = strong
If low
- Lower decision threshold
- Add better domain‑specific features
False Positive Rate (FPR)
Definition: % of negatives incorrectly predicted as positives.
Real-world example
Airport screening model:
5 out of 100 non‑threat items flagged → FPR = 5%
Decent quality
- <10% acceptable
- <5% strong
If high
- Improve calibration
- Re‑evaluate model thresholds
F1 Score
Definition: Harmonic mean of precision and recall.
Real-world example
Churn model:
Precision = 0.8, Recall = 0.8 → F1 = 0.8
Decent quality
- >0.7 good
- >0.8 strong
If low
- Increase minority-class examples
- Adjust threshold to balance precision/recall
3. Regression Metrics
Used in forecasting, pricing models, risk modeling, and time-series problems.
R² (Coefficient of Determination)
Definition: Measures how much variance in the target is explained.
Real-world example
House prices:
R² = 0.85 → 85% of price variance explained by features.
Decent quality
- >0.5 acceptable
- >0.7 strong
- >0.9 excellent
If low
- Add more predictive features
- Consider nonlinear models
MAE (Mean Absolute Error)
Definition: Mean absolute difference between predictions and true values.
Real-world example
Weather forecasting:
MAE = 2°F → predictions usually within ±2°F.
Decent quality
- <10–20% of mean target value
If high
- Remove outliers
- Normalize or scale features
MSE (Mean Squared Error)
Definition: Squared average error; punishes large mistakes.
Real-world example
Autonomous drone path prediction — large deviations are dangerous, so MSE is critical.
If high
- Investigate extreme errors
- Add regularization to prevent overfitting
RMSE (Root Mean Squared Error)
Definition: Square root of MSE; interpretable in target units.
Real-world example
Retail demand forecasting:
Average demand 100 units/day, RMSE = 12 → 12% average error.
Decent quality
- <10–20% relative error
If high
- Add seasonal features
- Use ensemble or gradient boosting models