Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Anya Belz; Craig Thomson; Ehud Reiter; Gavin  Abercrombie; Jose M.  Alonso Moral; Mohammad  Arvan; Jackie  Cheung; Mark  Cieliebak; Elizabeth Clark; Kees van Deemter; Tanvi  Dinkar; Ondřej Dušek; Steffen  Eger; Qixiang  Fang; Albert Gatt; Dimitra Gkatzia; Javier  González Corbelle; Dirk  Hovy; Manuela  Hürlimann; Takumi  Ito; John D.  Kelleher; Filip  Klubicka; Huiyuan  Lai; Chris  van der Lee; Emiel van Miltenburg; Yiru  Li; Saad Mahamood; Margot  Mieskes; Malvina  Nissim; Natalie  Parde; Ondrej  Plátek; Verena  Rieser; Pablo  Mosteiro Romero; Joel  Tetreault; Antonio  Toral; Xiaojun Wang; Leo  Wanner; Lewis  Watson; Diyi  Yang

doi:10.18653/v1/2023.insights-1.1

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso Moral, Mohammad Arvan, Jackie Cheung, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Albert Gatt, Dimitra Gkatzia, Javier González Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi ItoJohn D. Kelleher, Filip Klubicka, Huiyuan Lai, Chris van der Lee, Emiel van Miltenburg, Yiru Li, Saad Mahamood, Margot Mieskes, Malvina Nissim, Natalie Parde, Ondrej Plátek, Verena Rieser, Pablo Mosteiro Romero, Joel Tetreault, Antonio Toral, Xiaojun Wang, Leo Wanner, Lewis Watson, Diyi Yang

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

12 Citations (Scopus)

Abstract

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

Original language	English
Title of host publication	The Fourth Workshop on Insights from Negative Results in NLP
Editors	Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
Place of Publication	Dubrovnik, Croatia
Publisher	Association for Computational Linguistics
Pages	1-10
Number of pages	10
DOIs	https://doi.org/10.18653/v1/2023.insights-1.1
Publication status	Published - 1 May 2023
Event	Insights 2023 : The Forth Workshop on Insights from Negative Results in NLP - Dubrovnik, Croatia Duration: 2 Jun 2023 → 6 Jun 2023

Workshop

Workshop	Insights 2023 : The Forth Workshop on Insights from Negative Results in NLP
Country/Territory	Croatia
City	Dubrovnik
Period	2/06/23 → 6/06/23

Access to Document

10.18653/v1/2023.insights-1.1Licence: CC BY

Belz_etal_ACL_Missing_Information_Unresposive_VoR
ACL materials are Copyright © 1963–2023 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License
Final published version, 184 KBLicence: CC BY

Cite this

Belz, A., Thomson, C., Reiter, E., Abercrombie, G., Alonso Moral, J. M., Arvan, M., Cheung, J., Cieliebak, M., Clark, E., van Deemter, K., Dinkar, T., Dušek, O., Eger, S., Fang, Q., Gatt, A., Gkatzia, D., González Corbelle, J., Hovy, D., Hürlimann, M., ... Yang, D. (2023). Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. In S. Tafreshi, A. Akula, J. Sedoc, A. Drozd, A. Rogers, & A. Rumshisky (Eds.), The Fourth Workshop on Insights from Negative Results in NLP (pp. 1-10). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.insights-1.1

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. / Belz, Anya; Thomson, Craig; Reiter, Ehud et al.
The Fourth Workshop on Insights from Negative Results in NLP. ed. / Shabnam Tafreshi; Arjun Akula; João Sedoc; Aleksandr Drozd; Anna Rogers; Anna Rumshisky. Dubrovnik, Croatia: Association for Computational Linguistics, 2023. p. 1-10.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Belz, A, Thomson, C, Reiter, E, Abercrombie, G, Alonso Moral, JM, Arvan, M, Cheung, J, Cieliebak, M, Clark, E, van Deemter, K, Dinkar, T, Dušek, O, Eger, S, Fang, Q, Gatt, A, Gkatzia, D, González Corbelle, J, Hovy, D, Hürlimann, M, Ito, T, Kelleher, JD, Klubicka, F, Lai, H, van der Lee, C, van Miltenburg, E, Li, Y, Mahamood, S, Mieskes, M, Nissim, M, Parde, N, Plátek, O, Rieser, V, Mosteiro Romero, P, Tetreault, J, Toral, A, Wang, X, Wanner, L, Watson, L & Yang, D 2023, Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. in S Tafreshi, A Akula, J Sedoc, A Drozd, A Rogers & A Rumshisky (eds), The Fourth Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Dubrovnik, Croatia, pp. 1-10, Insights 2023 : The Forth Workshop on Insights from Negative Results in NLP, Dubrovnik, Croatia, 2/06/23. https://doi.org/10.18653/v1/2023.insights-1.1

Belz A, Thomson C, Reiter E, Abercrombie G, Alonso Moral JM, Arvan M et al. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. In Tafreshi S, Akula A, Sedoc J, Drozd A, Rogers A, Rumshisky A, editors, The Fourth Workshop on Insights from Negative Results in NLP. Dubrovnik, Croatia: Association for Computational Linguistics. 2023. p. 1-10 doi: 10.18653/v1/2023.insights-1.1

Belz, Anya ; Thomson, Craig ; Reiter, Ehud et al. / Missing Information, Unresponsive Authors, Experimental Flaws : The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP. The Fourth Workshop on Insights from Negative Results in NLP. editor / Shabnam Tafreshi ; Arjun Akula ; João Sedoc ; Aleksandr Drozd ; Anna Rogers ; Anna Rumshisky. Dubrovnik, Croatia : Association for Computational Linguistics, 2023. pp. 1-10

@inproceedings{7c179fec358649f38b0daf7f357d1ffc,

title = "Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP",

abstract = "We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.",

author = "Anya Belz and Craig Thomson and Ehud Reiter and Gavin Abercrombie and {Alonso Moral}, {Jose M.} and Mohammad Arvan and Jackie Cheung and Mark Cieliebak and Elizabeth Clark and {van Deemter}, Kees and Tanvi Dinkar and Ond{\v r}ej Du{\v s}ek and Steffen Eger and Qixiang Fang and Albert Gatt and Dimitra Gkatzia and {Gonz{\'a}lez Corbelle}, Javier and Dirk Hovy and Manuela H{\"u}rlimann and Takumi Ito and Kelleher, {John D.} and Filip Klubicka and Huiyuan Lai and {van der Lee}, Chris and {van Miltenburg}, Emiel and Yiru Li and Saad Mahamood and Margot Mieskes and Malvina Nissim and Natalie Parde and Ondrej Pl{\'a}tek and Verena Rieser and {Mosteiro Romero}, Pablo and Joel Tetreault and Antonio Toral and Xiaojun Wang and Leo Wanner and Lewis Watson and Diyi Yang",

year = "2023",

month = may,

day = "1",

doi = "10.18653/v1/2023.insights-1.1",

language = "English",

pages = "1--10",

editor = "Shabnam Tafreshi and Akula, {Arjun } and Sedoc, {Jo{\~a}o } and Aleksandr Drozd and Anna Rogers and Rumshisky, {Anna }",

booktitle = "The Fourth Workshop on Insights from Negative Results in NLP",

publisher = "Association for Computational Linguistics",

note = "Insights 2023 : The Forth Workshop on Insights from Negative Results in NLP ; Conference date: 02-06-2023 Through 06-06-2023",

}

TY - GEN

T1 - Missing Information, Unresponsive Authors, Experimental Flaws

T2 - Insights 2023 : The Forth Workshop on Insights from Negative Results in NLP

AU - Belz, Anya

AU - Thomson, Craig

AU - Reiter, Ehud

AU - Abercrombie, Gavin

AU - Alonso Moral, Jose M.

AU - Arvan, Mohammad

AU - Cheung, Jackie

AU - Cieliebak, Mark

AU - Clark, Elizabeth

AU - van Deemter, Kees

AU - Dinkar, Tanvi

AU - Dušek, Ondřej

AU - Eger, Steffen

AU - Fang, Qixiang

AU - Gatt, Albert

AU - Gkatzia, Dimitra

AU - González Corbelle, Javier

AU - Hovy, Dirk

AU - Hürlimann, Manuela

AU - Ito, Takumi

AU - Kelleher, John D.

AU - Klubicka, Filip

AU - Lai, Huiyuan

AU - van der Lee, Chris

AU - van Miltenburg, Emiel

AU - Li, Yiru

AU - Mahamood, Saad

AU - Mieskes, Margot

AU - Nissim, Malvina

AU - Parde, Natalie

AU - Plátek, Ondrej

AU - Rieser, Verena

AU - Mosteiro Romero, Pablo

AU - Tetreault, Joel

AU - Toral, Antonio

AU - Wang, Xiaojun

AU - Wanner, Leo

AU - Watson, Lewis

AU - Yang, Diyi

PY - 2023/5/1

Y1 - 2023/5/1

N2 - We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

AB - We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

U2 - 10.18653/v1/2023.insights-1.1

DO - 10.18653/v1/2023.insights-1.1

M3 - Published conference contribution

SP - 1

EP - 10

BT - The Fourth Workshop on Insights from Negative Results in NLP

A2 - Tafreshi, Shabnam

A2 - Akula, Arjun

A2 - Sedoc, João

A2 - Drozd, Aleksandr

A2 - Rogers, Anna

A2 - Rumshisky, Anna

PB - Association for Computational Linguistics

CY - Dubrovnik, Croatia

Y2 - 2 June 2023 through 6 June 2023

ER -

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Abstract

Workshop

Access to Document

Fingerprint

Cite this