Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Francesco Moramarco; Alex Papadopoulos Korfiatis; Mark Perera; Damir Juric; Jack Flann; Ehud Reiter; Aleksandar Savkov; Anja Belz

doi:10.48550/ARXIV.2204.00447

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Aleksandar Savkov, Anja Belz

Babylon Health

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

16 Citations (Scopus)

Abstract

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

Original language	English
Title of host publication	Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Editors	Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Place of Publication	Dublin
Publisher	Association for Computational Linguistics
Pages	5739–5754
Number of pages	16
Volume	1
ISBN (Electronic)	978-1-955917-21-6
DOIs	https://doi.org/10.48550/ARXIV.2204.00447 https://doi.org/10.18653/v1/2022.acl-long.394
Publication status	Published - 1 May 2022
Event	ACL 2022: 60th Annual Meeting of the Association for Computational Linguistics - The Convention Centre Dublin , Dublin, Ireland Duration: 22 May 2022 → 27 May 2022 Conference number: 60 https://www.2022.aclweb.org/

Conference

Conference	ACL 2022
Abbreviated title	ACL
Country/Territory	Ireland
City	Dublin
Period	22/05/22 → 27/05/22
Internet address	https://www.2022.aclweb.org/

Bibliographical note

The authors would like to thank Rachel Young and Tom Knoll for supporting the team and hiring the evaluators, Vitalii Zhelezniak for his advice on revising the paper, and Kristian Boda for helping to set up the Stanza+Snomed fact-extraction system.

Keywords

Computation and Language (cs.CL)
FOS: Computer and information sciences

Access to Document

10.48550/ARXIV.2204.00447Licence: Unspecified
10.18653/v1/2022.acl-long.394Licence: CC BY

Moramarco_etal_ACL_Human_Evaluation_And_VoR
Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/
Final published version, 2.72 MBLicence: CC BY

Cite this

Moramarco, F., Korfiatis, A. P., Perera, M., Juric, D., Flann, J., Reiter, E., Savkov, A., & Belz, A. (2022). Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 5739–5754). Association for Computational Linguistics. https://doi.org/10.48550/ARXIV.2204.00447, https://doi.org/10.18653/v1/2022.acl-long.394

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. / Moramarco, Francesco; Korfiatis, Alex Papadopoulos; Perera, Mark et al.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ed. / Smaranda Muresan; Preslav Nakov; Aline Villavicencio. Vol. 1 Dublin: Association for Computational Linguistics, 2022. p. 5739–5754.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Moramarco, F, Korfiatis, AP, Perera, M, Juric, D, Flann, J, Reiter, E, Savkov, A & Belz, A 2022, Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. in S Muresan, P Nakov & A Villavicencio (eds), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, Association for Computational Linguistics, Dublin, pp. 5739–5754, ACL 2022, Dublin, Ireland, 22/05/22. https://doi.org/10.48550/ARXIV.2204.00447, https://doi.org/10.18653/v1/2022.acl-long.394

Moramarco F, Korfiatis AP, Perera M, Juric D, Flann J, Reiter E et al. Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. In Muresan S, Nakov P, Villavicencio A, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. Dublin: Association for Computational Linguistics. 2022. p. 5739–5754 doi: 10.48550/ARXIV.2204.00447, 10.18653/v1/2022.acl-long.394

Moramarco, Francesco ; Korfiatis, Alex Papadopoulos ; Perera, Mark et al. / Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). editor / Smaranda Muresan ; Preslav Nakov ; Aline Villavicencio. Vol. 1 Dublin : Association for Computational Linguistics, 2022. pp. 5739–5754

@inproceedings{784c0b86988146ce8c981524ac7bad21,

title = "Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation",

abstract = "In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.",

keywords = "Computation and Language (cs.CL), FOS: Computer and information sciences",

author = "Francesco Moramarco and Korfiatis, {Alex Papadopoulos} and Mark Perera and Damir Juric and Jack Flann and Ehud Reiter and Aleksandar Savkov and Anja Belz",

note = "The authors would like to thank Rachel Young and Tom Knoll for supporting the team and hiring the evaluators, Vitalii Zhelezniak for his advice on revising the paper, and Kristian Boda for helping to set up the Stanza+Snomed fact-extraction system. ; ACL 2022 : 60th Annual Meeting of the Association for Computational Linguistics, ACL ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

month = may,

day = "1",

doi = "10.48550/ARXIV.2204.00447",

language = "English",

volume = "1",

pages = "5739–5754",

editor = "Smaranda Muresan and Preslav Nakov and Aline Villavicencio",

booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",

publisher = "Association for Computational Linguistics",

url = "https://www.2022.aclweb.org/",

}

TY - GEN

T1 - Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

AU - Moramarco, Francesco

AU - Korfiatis, Alex Papadopoulos

AU - Perera, Mark

AU - Juric, Damir

AU - Flann, Jack

AU - Reiter, Ehud

AU - Savkov, Aleksandar

AU - Belz, Anja

N1 - Conference code: 60

PY - 2022/5/1

Y1 - 2022/5/1

N2 - In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

AB - In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

KW - Computation and Language (cs.CL)

KW - FOS: Computer and information sciences

UR - https://deepai.org/publication/human-evaluation-and-correlation-with-automatic-metrics-in-consultation-note-generation

U2 - 10.48550/ARXIV.2204.00447

DO - 10.48550/ARXIV.2204.00447

M3 - Published conference contribution

VL - 1

SP - 5739

EP - 5754

BT - Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A2 - Muresan, Smaranda

A2 - Nakov, Preslav

A2 - Villavicencio, Aline

PB - Association for Computational Linguistics

CY - Dublin

T2 - ACL 2022

Y2 - 22 May 2022 through 27 May 2022

ER -

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this