A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Craig Alexander Thomson; Ehud Reiter

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Craig Alexander Thomson, Ehud Reiter

Research output: Contribution to conference › Unpublished paper › peer-review

22 Citations (Scopus)

3 Downloads (Pure)

Abstract

Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

Original language	English
Pages	158-168
Number of pages	11
Publication status	Published - Dec 2020
Event	Proceedings of the 13th International Conference on Natural Language Generation - Held online Dublin City University, Dublin, Ireland Duration: 15 Dec 2020 → 18 Dec 2020 Conference number: 13 https://www.inlg2020.org/

Conference

Conference	Proceedings of the 13th International Conference on Natural Language Generation
Abbreviated title	INLG 2020
Country/Territory	Ireland
City	Dublin
Period	15/12/20 → 18/12/20
Internet address	https://www.inlg2020.org/

Bibliographical note

Acknowledgements:
Many thanks to the Mechanical Turk annotators who participated in our experiment, and also to David Reiter, Tim Daniels, Rodrigo de Oliveira, and Andrew Smith for serving as pilot annotators when we were developing the methodology described in this paper. We would also like to thank Moray Greig for being our basketball domain expert during development. We are also grateful for the very helpful comments on this paper from the anonymous reviewers, the Aberdeen CLAN group, David Howcroft, Clement Rebuffel, and Chris van ´ der Lee. We would also like to thank Sam Wiseman, Ratish Puduppully, and Clement Rebuffel for pro- viding the generated texts from their respective systems. The work presented here is partially funded by the Engineering and Physical Sciences Research Council (EPSRC), which funds Craig Thomson under a National Productivity Investment Fund Doctoral Studentship (EP/R512412/1).

Access to Document

Thomson_et_al_ACL_AGoldStandardMethodology_VoR
Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/
Final published version, 197 KBLicence: CC BY

https://www.aclweb.org/anthology/2020.inlg-1.22/Licence: CC BY

Cite this

@conference{479a65c139fe4dc1a075609ad193f532,

title = "A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems",

abstract = "Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.",

author = "Thomson, {Craig Alexander} and Ehud Reiter",

note = "Acknowledgements: Many thanks to the Mechanical Turk annotators who participated in our experiment, and also to David Reiter, Tim Daniels, Rodrigo de Oliveira, and Andrew Smith for serving as pilot annotators when we were developing the methodology described in this paper. We would also like to thank Moray Greig for being our basketball domain expert during development. We are also grateful for the very helpful comments on this paper from the anonymous reviewers, the Aberdeen CLAN group, David Howcroft, Clement Rebuffel, and Chris van ´ der Lee. We would also like to thank Sam Wiseman, Ratish Puduppully, and Clement Rebuffel for pro- viding the generated texts from their respective systems. The work presented here is partially funded by the Engineering and Physical Sciences Research Council (EPSRC), which funds Craig Thomson under a National Productivity Investment Fund Doctoral Studentship (EP/R512412/1).; Proceedings of the 13th International Conference on Natural Language Generation, INLG 2020 ; Conference date: 15-12-2020 Through 18-12-2020",

year = "2020",

month = dec,

language = "English",

pages = "158--168",

url = "https://www.inlg2020.org/",

}

TY - CONF

T1 - A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

AU - Thomson, Craig Alexander

AU - Reiter, Ehud

N1 - Conference code: 13

PY - 2020/12

Y1 - 2020/12

N2 - Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

AB - Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics.

M3 - Unpublished paper

SP - 158

EP - 168

T2 - Proceedings of the 13th International Conference on Natural Language Generation

Y2 - 15 December 2020 through 18 December 2020

ER -

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Abstract

Conference

Bibliographical note

Access to Document

Fingerprint

Cite this