It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Chen Chen; Ruizhe Li; Yuchen Hu; Sabato Marco Siniscalchi; Pin-Yu Chen; Ensiong Chng; Chao-Han Huck Yang

doi:10.48550/arXiv.2402.05457

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

Research output: Working paper › Preprint

1 Downloads (Pure)

Abstract

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.

Original language	English
Publisher	ArXiv
Number of pages	17
DOIs	https://doi.org/10.48550/arXiv.2402.05457
Publication status	Published - 8 Feb 2024

Bibliographical note

Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

Keywords

cs.CL
cs.AI
cs.MM
cs.SD
eess.AS

Access to Document

10.48550/arXiv.2402.05457Licence: CC BY-SA

2402.05457v1
https://creativecommons.org/licenses/by-nc-sa/4.0/
Final published version, 592 KBLicence: CC BY-SA

Cite this

@techreport{8b96a662d8504ca0908ffa2bc7aabab6,

title = "It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition",

abstract = "Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.",

keywords = "cs.CL, cs.AI, cs.MM, cs.SD, eess.AS",

author = "Chen Chen and Ruizhe Li and Yuchen Hu and Siniscalchi, {Sabato Marco} and Pin-Yu Chen and Ensiong Chng and Yang, {Chao-Han Huck}",

note = "Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license",

year = "2024",

month = feb,

day = "8",

doi = "10.48550/arXiv.2402.05457",

language = "English",

publisher = "ArXiv",

type = "WorkingPaper",

institution = "ArXiv",

}

TY - UNPB

T1 - It's Never Too Late

T2 - Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

AU - Chen, Chen

AU - Li, Ruizhe

AU - Hu, Yuchen

AU - Siniscalchi, Sabato Marco

AU - Chen, Pin-Yu

AU - Chng, Ensiong

AU - Yang, Chao-Han Huck

N1 - Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

PY - 2024/2/8

Y1 - 2024/2/8

N2 - Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.

AB - Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.

KW - cs.CL

KW - cs.AI

KW - cs.MM

KW - cs.SD

KW - eess.AS

U2 - 10.48550/arXiv.2402.05457

DO - 10.48550/arXiv.2402.05457

M3 - Preprint

BT - It's Never Too Late

PB - ArXiv

ER -

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this