In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.
|Title of host publication||Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)|
|Editors||Smaranda Muresan, Preslav Nakov, Aline Villavicencio|
|Place of Publication||Dublin|
|Publisher||Association for Computational Linguistics|
|Number of pages||16|
|Publication status||Published - 1 May 2022|
|Event||ACL 2022: 60th Annual Meeting of the Association for Computational Linguistics - The Convention Centre Dublin , Dublin, Ireland|
Duration: 22 May 2022 → 27 May 2022
Conference number: 60
|Period||22/05/22 → 27/05/22|
Bibliographical noteThe authors would like to thank Rachel Young and Tom Knoll for supporting the team and hiring the evaluators, Vitalii Zhelezniak for his advice on revising the paper, and Kristian Boda for helping to set up the Stanza+Snomed fact-extraction system.
- Computation and Language (cs.CL)
- FOS: Computer and information sciences