Making test corpora for question answering more representative

Andrew Walker, Andrew Starkey, Jeff Z. Pan, Advaith Siddharthan

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

1 Citation (Scopus)


Despite two high profile series of challenges devoted to question answering technologies there remains no formal study into the representativeness that question corpora bear to real end-user inputs. We examine the corpora used presently and historically in the TREC and QALD challenges in juxtaposition with two more from natural sources and identify a degree of disjointedness between the two. We analyse these differences in depth before discussing a candidate approach to question corpora generation and provide a juxtaposition on its own representativeness. We conclude that these artificial corpora have good overall coverage of grammatical structures but the distribution is skewed, meaning performance measures may be inaccurate.

Original languageEnglish
Title of host publication Information Access Evaluation. Multilinguality, Multimodality, and Interaction
Subtitle of host publicationCLEF 2014.
EditorsEvangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, Elaine Toms
Number of pages6
ISBN (Electronic)9783319113821
ISBN (Print)9783319113814
Publication statusPublished - 2014
Event5th International Conference of the CLEF Initiative, CLEF 2014 - Sheffield, United Kingdom
Duration: 15 Sept 201418 Sept 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8685 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349


Conference5th International Conference of the CLEF Initiative, CLEF 2014
Country/TerritoryUnited Kingdom


Dive into the research topics of 'Making test corpora for question answering more representative'. Together they form a unique fingerprint.

Cite this