A Structured Review of the Validity of BLEU

Ehud Reiter

doi:10.1162/COLI_a_00322

A Structured Review of the Validity of BLEU

Ehud Reiter

Research output: Contribution to journal › Article › peer-review

165 Citations (Scopus)

15 Downloads (Pure)

Abstract

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.

Original language	English
Pages (from-to)	393-401
Number of pages	9
Journal	Computational Linguistics
Volume	44
Issue number	3
Early online date	21 Sept 2018
DOIs	https://doi.org/10.1162/COLI_a_00322
Publication status	Published - Sept 2018

Access to Document

10.1162/COLI_a_00322Licence: CC BY

A Structured Review of the Validity of BLEU
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.
Final published version, 126 KBLicence: CC BY

Structured Review of the Validity of BLEU
Reiter, E. (Creator), University of Aberdeen, 2018
DOI: 10.20392/766c9dd8-75a7-4761-915d-856c0f7cc3c4
Dataset

Ehud Reiter
- Computational Linguistics at Aberdeen
- School of Natural & Computing Sciences, Computing Science - Chair in Computing Science.
Person: Academic

Cite this

@article{d72886c4bc0e4a2fabe64b173120beaf,

title = "A Structured Review of the Validity of BLEU",

abstract = "The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.",

author = "Ehud Reiter",

year = "2018",

month = sep,

doi = "10.1162/COLI_a_00322",

language = "English",

volume = "44",

pages = "393--401",

journal = "Computational Linguistics",

issn = "0891-2017",

publisher = "MIT Press Journals",

number = "3",

}

TY - JOUR

T1 - A Structured Review of the Validity of BLEU

AU - Reiter, Ehud

PY - 2018/9

Y1 - 2018/9

N2 - The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.

AB - The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique, in other words whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outwith MT, for evaluation of individual texts, or for scientific hypothesis testing.

U2 - 10.1162/COLI_a_00322

DO - 10.1162/COLI_a_00322

M3 - Article

SN - 0891-2017

VL - 44

SP - 393

EP - 401

JO - Computational Linguistics

JF - Computational Linguistics

IS - 3

ER -

A Structured Review of the Validity of BLEU

Abstract

Access to Document

Fingerprint

Datasets

Structured Review of the Validity of BLEU

Profiles

Ehud Reiter

Cite this