Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Anya Belz, Craig Thomson, Ehud Reiter, Simon Mille

Research output: Chapter in Book/Report/Conference proceedingChapter

6 Citations (Scopus)
1 Downloads (Pure)

Abstract

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: ACL 2023
EditorsAnna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Place of PublicationToronto, Canada
PublisherAssociation for Computational Linguistics
Pages3676-3687
Number of pages12
DOIs
Publication statusPublished - 1 Jul 2023

Fingerprint

Dive into the research topics of 'Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP'. Together they form a unique fingerprint.

Cite this