GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Yuchen Hu; Chen Chen; Chao-Han Huck Yang; Ruizhe Li; Dong Zhang; Zhehuai Chen; Eng Siong Chng

doi:10.48550/arXiv.2402.06894

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

Research output: Working paper › Preprint

Abstract

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

Original language	English
Publisher	ArXiv
DOIs	https://doi.org/10.48550/arXiv.2402.06894
Publication status	Published - 10 Feb 2024

Bibliographical note

17 pages. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

Keywords

cs.CL
cs.AI
cs.LG
cs.SD
eess.AS

Access to Document

10.48550/arXiv.2402.06894Licence: Unspecified

2402.06894v1Submitted manuscript, 858 KB

Cite this

@techreport{bf78f5e88d4943df8fd1d615c4d88caf,

title = "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators",

abstract = "Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely {"}GenTranslate{"}, which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.",

keywords = "cs.CL, cs.AI, cs.LG, cs.SD, eess.AS",

author = "Yuchen Hu and Chen Chen and Yang, {Chao-Han Huck} and Ruizhe Li and Dong Zhang and Zhehuai Chen and Chng, {Eng Siong}",

note = "17 pages. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate",

year = "2024",

month = feb,

day = "10",

doi = "10.48550/arXiv.2402.06894",

language = "English",

publisher = "ArXiv",

type = "WorkingPaper",

institution = "ArXiv",

}

TY - UNPB

T1 - GenTranslate

T2 - Large Language Models are Generative Multilingual Speech and Machine Translators

AU - Hu, Yuchen

AU - Chen, Chen

AU - Yang, Chao-Han Huck

AU - Li, Ruizhe

AU - Zhang, Dong

AU - Chen, Zhehuai

AU - Chng, Eng Siong

N1 - 17 pages. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

PY - 2024/2/10

Y1 - 2024/2/10

N2 - Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

AB - Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

KW - cs.CL

KW - cs.AI

KW - cs.LG

KW - cs.SD

KW - eess.AS

U2 - 10.48550/arXiv.2402.06894

DO - 10.48550/arXiv.2402.06894

M3 - Preprint

BT - GenTranslate

PB - ArXiv

ER -

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this