Follow-up on the TranslateGemma subtitle benchmark: human review of segments rated "clean" by MetricX-24 and COMETKiwi [D]

Our take

In a recent follow-up, we conducted a human review of the segments rated "clean" by the MetricX-24 and COMETKiwi benchmarks, initially placing TranslateGemma-12b at the top for all language pairs. While these metrics correlate with human judgment overall, we sought to determine the accuracy of their "clean" designations. Analyzing 21 English subtitle segments, we found that 71% of segments flagged as clean contained errors upon human review. Notably, the Japanese translations showed significant mistranslations, highlighting the need for deeper scrutiny in subtitle evaluation.

A few weeks ago I shared the results of a benchmark here comparing 6 LLMs on subtitle translation, scored with two reference-free QE metrics - MetricX-24 (~13B mT5-XXL) and COMETKiwi (~10.7B XLM-R-XXL) - combined into a TQI index. Posting a follow-up because we did human review afterwards, and the result is worth discussing.

The original benchmark put TranslateGemma-12b first in every language pair. The natural question: are those high scores accurate, or are the metrics insensitive in their high-confidence zone? These metrics correlate well with human judgment at the population level (that's what they're trained for), but population-level correlation doesn't tell you whether the segments they call "clean" are actually clean.

So we ran the check directly. 21 English subtitle segments from one tutorial video. TranslateGemma's translations into 4 languages (ES, JA, TH, ZH-CN - Korean and Traditional Chinese got dropped). All 84 translations chosen because they passed the dashboard clean-rule (MX < 5 AND CK ≥ 0.70) in all 4 languages simultaneously. Then full MQM annotation by professional linguists - Major/Minor severity, with categories covering accuracy (mistranslation, omission, addition, untranslated), fluency (grammar, punctuation, inconsistency), style, terminology.

Results under the dashboard threshold:

Auto-flagged: 1/84
Human-flagged: 60/84 any-error, 13/84 Major-only
Metric-blindness rate (auto-clean ∩ human-flagged / auto-clean): 59/83 = 71% any-error, 12/83 = 14.5% Major-only
All 25 human-found Accuracy-class errors fell in the metric-blind quadrant. Zero overlap with the auto-flagged region (which contained one Style-category Major error).
Japanese carries 10 of 15 total mistranslations across the dataset, all metric-blind, despite having the highest mean COMETKiwi (0.863) of the four languages.

Caveat: small n, one model, one content set, so the numbers are directional rather than definitive.

Original thread: [link]
Full benchmark report: in comments.

submitted by /u/ritis88
[link] [comments]

Tagged with

#financial modeling with spreadsheets#natural language processing for spreadsheets#enterprise-level spreadsheet solutions#rows.com#natural language processing#generative AI for data analysis#large dataset processing#row zero#Excel alternatives for data analysis#no-code spreadsheet solutions#TranslateGemma#subtitle translation#benchmark#MetricX-24#COMETKiwi#human review#quality estimation metrics#TQI index#clean-rule#MQM annotation