SportSett: Basketball - A robust and maintainable dataset for Natural Language Generation

Craig Thomson; Ehud Reiter; Somayajulu Sripada

SportSett: Basketball - A robust and maintainable dataset for Natural Language Generation

Craig Thomson, Ehud Reiter, Somayajulu Sripada

University of Aberdeen

Research output: Contribution to conference › Unpublished paper › peer-review

Abstract

Data2Text Natural Language Generation is a complex and varied task. We investigate the data requirements for the difficult real-world problem of generating statistic-focused summaries of basketball games. This has recently been tackled using the Rotowire and Rotowire-FG datasets of paired data and text. It can, however, be difficult to filter, query, and maintain such large volumes of data. In this resource paper, we introduce the SportSett:Basketball database. This easy-to-use resource allows for simple scripts to be written which generate data in suitable formats for a variety of systems. Building upon the existing
data, we provide more attributes, across multiple dimensions, increasing the overlap of content between data and text. We also highlight and resolve issues of training, validation and test partition contamination in these previous datasets

Original language	English
Publication status	Accepted/In press - 17 Aug 2020
Event	IntelLanG : Intelligent Information Processing and Natural Language Generation - Santiago de Compostela, Spain Duration: 7 Sept 2020 → 7 Sept 2020 https://intellang.github.io/

Conference

Conference	IntelLanG
Country/Territory	Spain
City	Santiago de Compostela
Period	7/09/20 → 7/09/20
Internet address	https://intellang.github.io/

Cite this

@conference{ed090ffbdb1943fe9311ba59196310d2,

title = "SportSett: Basketball - A robust and maintainable dataset for Natural Language Generation",

abstract = "Data2Text Natural Language Generation is a complex and varied task. We investigate the data requirements for the difficult real-world problem of generating statistic-focused summaries of basketball games. This has recently been tackled using the Rotowire and Rotowire-FG datasets of paired data and text. It can, however, be difficult to filter, query, and maintain such large volumes of data. In this resource paper, we introduce the SportSett:Basketball database. This easy-to-use resource allows for simple scripts to be written which generate data in suitable formats for a variety of systems. Building upon the existingdata, we provide more attributes, across multiple dimensions, increasing the overlap of content between data and text. We also highlight and resolve issues of training, validation and test partition contamination in these previous datasets",

author = "Craig Thomson and Ehud Reiter and Somayajulu Sripada",

year = "2020",

month = aug,

day = "17",

language = "English",

note = "IntelLanG : Intelligent Information Processing and Natural Language Generation ; Conference date: 07-09-2020 Through 07-09-2020",

url = "https://intellang.github.io/",

}

TY - CONF

T1 - SportSett

T2 - IntelLanG

AU - Thomson, Craig

AU - Reiter, Ehud

AU - Sripada, Somayajulu

PY - 2020/8/17

Y1 - 2020/8/17

N2 - Data2Text Natural Language Generation is a complex and varied task. We investigate the data requirements for the difficult real-world problem of generating statistic-focused summaries of basketball games. This has recently been tackled using the Rotowire and Rotowire-FG datasets of paired data and text. It can, however, be difficult to filter, query, and maintain such large volumes of data. In this resource paper, we introduce the SportSett:Basketball database. This easy-to-use resource allows for simple scripts to be written which generate data in suitable formats for a variety of systems. Building upon the existingdata, we provide more attributes, across multiple dimensions, increasing the overlap of content between data and text. We also highlight and resolve issues of training, validation and test partition contamination in these previous datasets

AB - Data2Text Natural Language Generation is a complex and varied task. We investigate the data requirements for the difficult real-world problem of generating statistic-focused summaries of basketball games. This has recently been tackled using the Rotowire and Rotowire-FG datasets of paired data and text. It can, however, be difficult to filter, query, and maintain such large volumes of data. In this resource paper, we introduce the SportSett:Basketball database. This easy-to-use resource allows for simple scripts to be written which generate data in suitable formats for a variety of systems. Building upon the existingdata, we provide more attributes, across multiple dimensions, increasing the overlap of content between data and text. We also highlight and resolve issues of training, validation and test partition contamination in these previous datasets

M3 - Unpublished paper

Y2 - 7 September 2020 through 7 September 2020

ER -

SportSett: Basketball - A robust and maintainable dataset for Natural Language Generation

Abstract

Conference

Fingerprint

Cite this