ENACT Adaptation multimodale des systèmes de reconnaissance vocale de bout en bout.

Offre de thèse

Date limite de candidature

15-04-2025

Date de début de contrat

01-09-2025

Directeur de thèse

ILLINA Irina

Encadrement

Co-encadrement à 50-50%.

Type de contrat

Financement d'un établissement public Français

Candidater à cette offre

école doctorale

IAEM - INFORMATIQUE - AUTOMATIQUE - ELECTRONIQUE - ELECTROTECHNIQUE - MATHEMATIQUES

équipe

MULTISPEECH

contexte

Voir description du projet.

spécialité

Informatique

laboratoire

LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications

Mots clés

Traitement automatique des langues, Reconnaissance vocale, Grands modèles de langage, Adaptation de domaine , Multimodalité, Systèmes de bout en bout

Détail de l'offre

Motivations et contexte :

Les systèmes de reconnaissance automatique de la parole (ASR) de bout en bout (E2E) ont gagné en popularité grâce à leur architecture simplifiée et à leurs résultats prometteurs. Cependant, leur précision diminue lorsque les conditions de test diffèrent de l'entraînement. Les algorithmes d'adaptation de domaine (DA), consistent à adapter les modèles ASR au domaine cible, dont le contenu ne correspond pas au domaine source dans lequel ils ont été entraînés. Le nouveau domaine peut être un nouveau sujet, une nouvelle langue, une langue à ressources limitées, etc. Comme les modèles E2E ont tendance à sur-apprendre les données d'entraînement, leurs performances se détériorent de manière significative dans un nouveau domaine. Or, collecter une grande quantité de données appariées parole-texte pour un domaine cible spécifique reste difficile.

Un modèle ASR E2E inclut un modèle acoustique et un modèle de langage (LM) disctinct. Bien que le prédicteur partage une structure similaire avec un LM et qu'un LM interne [Variani20, Meng21c] puisse être extrait du prédicteur, il ne fonctionne pas comme un LM, car le prédicteur doit se coordonner étroitement avec l'encodeur acoustique. Il n'est donc pas évident d'utiliser uniquement des données textuelles pour adapter le modèle Transformer du domaine source au domaine cible. L'adaptation efficace et efficiente du LM reste un problème de recherche ouvert pour les modèles ASR E2E.

Une approche largement adoptée pour l'adaptation de domaine consiste à fusionner les modèles E2E avec un LM externe entraîné sur les données textuelles du nouveau domaine : fusion superficielle [Gulcehre15], fusion profonde [Gulcehre15], fusion froide [Sriram18], etc. D'autres approches ont été proposées, basées sur l'affinement du LM interne [Chen22, Meng21b] ou son remplacement par un LM du domaine cible [Deng23].

Avec les avancées des technologies de synthèse vocale (TTS), une nouvelle approche consiste à adapter les modèles E2E avec de la parole synthétisée à partir de textes du nouveau domaine [Zheng21]. Cependant, la parole générée par TTS diffère de la parole réelle, ce qui peut parfois dégrader la précision de reconnaissance [Li19], et nécessite la synthèse de fichiers audio à partir du texte. Comme la quantité de données textuelles est bien plus importante que celle des données appariées parole-texte, générer des données TTS à grande échelle représente un coût énorme.

Dans les modèles ASR E2E, il est très difficile d'ajouter continuellement de nouveaux mots inconnus lors de l'entraînement. Bien que certaines approches d'intégration sous-mots/mots existent [Settle19] [Collobert20], ce problème est loin d'être résolu.

Tirer parti des données textuelles seules pour améliorer la précision des modèles ASR E2E constitue une direction de recherche prometteuse.

Objectifs de la thèse :

Dans cette thèse, nous nous intéressons au problème de l'adaptation de domaine des systèmes ASR E2E basés sur les Transformers en exploitant uniquement des données textuelles. Pour ce faire, nous proposons une approche multimodale intégrant simultanément la parole et le texte [Oneata23].

L'adaptation de grands modèles de langage (LLM) aux tâches multimodales est devenue un sujet de recherche d'actualité : MiniGPT-4 [Zhu23], qui intègre directement des caractéristiques visuelles dans le LLM ; LLaMA-Adapter [Zhang23a], utilisant des vecteurs entraînables de longueur fixe comme invites (prompt) couche par couche ; SpeechGPT [Zhang23b], un LLM multimodal intégrant la parole et le texte.

Dans cette thèse, nous proposons de développer une fusion multimodale profonde entre LLM et ASR, pouvant être vue comme une adaptation du LM à la modalité vocale. Le problème des mots rares ou nouveaux, absents de l'entraînement, sera également pris en compte. Nous étudierons la fusion au niveau du décodeur ainsi qu'au niveau des représentations (embeddings).

Keywords

Natural language processing, Speech recognition, Large language models, Domain adaptation, Multimodality, End-to-end systems

Subject details

Motivations and contexte : End-to-end (E2E) automatic speech recognition (ASR) systems have gained popularity due to their simplified architecture and promising results. The performance of these ASR systems can degrade significantly when the test conditions differ from the training conditions. Domain Adaptation (DA) algorithms, designed to address these issues, involve adapting ASR models to the target domain whose content does not match the source domain in which they were trained. The new domain can be a new topic, a new language, a low-resource language, etc. Since E2E models tend to learn the training data well, their performance usually degrades significantly in a new domain. Collecting a large amount of speech-text matching data from a specific target domain remains difficult. There are now individual acoustic and language models (LM) in an E2E ASR model. Although the predictor shares a similar structure with a LM, and an internal LM [Variani20, Meng21c] can be extracted from the predictor and the joint network, it does not work like a LM, because the predictor needs to coordinate closely with the acoustic encoder. Therefore, it is not straightforward to use textual data alone to adapt the transformer model from the source domain to the target domain. The effective and efficient DA of the LM remains an open research problem for E2E ASR models. A broadly adopted approach to DA is the fusion between the E2E models and an external LM trained with the new domain text data: shallow fusion [Gulcehre15], deep fusion [Gulcehre15], cold fusion [Sriram18], etc. approaches. Therefore, alternative approaches were proposed, based on fine-tuning the internal LM [Chen22, Meng21b] or replacing it with a target-domain LM [Deng23]. With the advancement of text-to-speech (TTS) technologies, a new direction is to adapt E2E models with synthesized speech generated from new domain text [Zheng21]. But the TTS speech is different from real speech, sometimes degrades the recognition accuracy of real speech [Li19], and requires synthesizing audio files from text. Since the amount of text-only data is much larger than that of paired STT data, the cost is huge to generate TTS audio from text-only data on a large scale. In E2E ASR models, it is very difficult to continuously add new unseen words during training. Although some works on subword-to-word embedding exist [Settle19] [Collobert20], this problem is far from being solved. How to leverage text-only data to improve the accuracy of E2E ASR models is an interesting direction to explore. Thesis objectives : In this thesis, we are interested in the problem of DA of E2E Transformer-based ASR systems leveraging text-only data. To achieve this, we propose to use a multimodal approach that simultaneously incorporates both speech and text [Oneata23]. The DA of large LM (LLM) to multi-modal tasks has been a recent research hotspot: MiniGPT-4 [Zhu23], which directly feeds visual features into the LLM; LLaMA-Adapter [Zhang23a] with fixed-length trainable vectors as layer-wise prompts; SpeechGPT [Zhang23b], introducing multimodal LLM that integrates speech and text. In this thesis, we propose to develop a multimodal deep LLM/ASR fusion, which can be seen as adapting LM to the speech modality. The problem of new or rare words , not seen during training, will also be taken into account. The fusion at the decoder level and at the embedding level will be studied.

Profil du candidat

- Master en traitement de la parole/audio, vision par ordinateur, apprentissage automatique ou dans un domaine connexe,
- capacité à travailler de manière autonome ainsi qu'en équipe,
- solides compétences en programmation (Python, PyTorch) et connaissances en apprentissage profond,
- bon niveau d'anglais écrit et parlé.

Candidate profile

– MSc/MEng degree in speech/audio processing, computer vision, machine learning, or in a related field,
– ability to work independently as well as in a team,
– solid programming skills (Python, PyTorch), and deep learning knowledge,
– good level of written and spoken English.

Référence biblio

[Baskar19] M. K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Cernocky, “Semi-supervised sequence-to-sequence ASR using unpaired speech and text,” in Proc. Interspeech, 2019, pp. 3790– 3794.
[Chen22] X. Chen, Z. Meng, S. Parthasarathy, and J. Li, “Factorized neural transducer for efficient language model adaptation,” in IEEE ICASSP. 2022, pp. 8132–8136.
[Collobert20] R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder,” in International Conference on Machine Learning. PMLR, 2020, pp. 2100–2110.
[Deng23] K. Deng and P. C Woodland, “Adaptable end-to-end asr models using replaceable internal lms and residual softmax,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
[Jain20] M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y. Saraf, “Contextual RNN-T for open domain ASR,” in Proc. Interspeech, 2020, pp. 11–15.
[Gulcehre15] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015.
[Li19] B. Li, T. N. Sainath, R. Pang, and Z. Wu, “Semi-supervised training for end-to-end models via weak distillation,” in Proc. ICASSP. IEEE, 2019, pp. 2837–2841.
[McDermott19] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to language model fusion in end-to-end automatic speech recognition,” in Proc. ASRU. IEEE, 2019, pp. 434–441.
[Meng21a] Z. Meng, Y. Gaur, N. Kanda, J. Li, X. Chen, Y. Wu, and Y. Gong, “Internal language model adaptation with text-only data for end-to-end speech recognition,” arXiv preprint arXiv:2110.05354, 2021.
[Meng21b] Z. Meng, Y. Wu, N. Kanda, L. Lu, X. Chen, G. Ye, E. Sun, J. Li, and Y. Gong, “Minimum word error rate training with language model fusion for end-to-end speech recognition,” in Proc. Interspeech, pp. 2596–2600, 2021.
[Meng21c] Z. Meng, S. Parthasarathy, et al., “Internal language model estimation for domain-adaptive end-to-end speech recognition,” in IEEE SLT workshop, 2021.
[Oneata23] D. Oneata, H. Cucu. “Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations”, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[Renduchintala18] A. Renduchintala, S. Ding, M. Wiesner, and S. Watanabe, “Multimodal data augmentation for end-to-end ASR,” in Proc. Interspeech, 2018, pp. 2394–2398.
[Settle19] S. Settle, K. Audhkhasi, K. Livescu, and M. Picheny, “Acoustically grounded word embeddings for improved acoustics-to-word speech recognition,” in Proc. ICASSP. IEEE, 2019, pp. 5641–5645.
[Sriram18] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” in Proc. Interspeech, 2018, pp. 387–391.
[Tjandra17] A. Tjandra, S. Sakti, and S. Nakamura, “Listening while speaking: Speech chain by deep learning,” in Proc. ASRU. IEEE, 2017, pp. 301–308.
[Variani20] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregressive transducer (HAT),” in Proc. ICASSP, 2020.
[Zhang23a] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, “LLaMA-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint arXiv:2303.16199, 2023.
[Zhang23b] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu. “SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.”, 2023 In Findings of the ACL: EMNLP 2023.
[Zheng21] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems,” in Proc. ICASSP. IEEE, 2021, pp. 5674–5678.
[Zhu23] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing visionlanguage understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.