*

Modèle articulatoire du conduit vocal générique indépendant de la langue et du locuteur

Offre de thèse

Modèle articulatoire du conduit vocal générique indépendant de la langue et du locuteur

Date limite de candidature

28-04-2025

Date de début de contrat

01-10-2025

Directeur de thèse

LAPRIE Yves

Encadrement

Une réunion de suivi aura lieu chaque semaine et chacune des deux équipes organise un séminaire scientifique hebdomadaire. Le doctorant aura aussi l'occasion de participer à une ou deux écoles d'été et aux conférences en IRM et en traitement automatique de la parole. Il sera aussi aidé pour la rédaction des articles de conférence ou de revue.

Type de contrat

Concours pour un contrat doctoral

école doctorale

IAEM - INFORMATIQUE - AUTOMATIQUE - ELECTRONIQUE - ELECTROTECHNIQUE - MATHEMATIQUES

équipe

MULTISPEECH

contexte

The work will make use of real-time MRI databases [3], which provide images of the evolution of the geometric shape of the vocal tract in the medio-sagittal plane at a frequency of 50 Hz. This frequency is sufficient to capture articulator gestures during speech production. We have data for around twenty speakers in several languages with different articulation points. The task will be to build a dynamic model of the vocal tract that can be adapted to a specific language and speaker from these data.

spécialité

Informatique

laboratoire

LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications

Mots clés

Intelligence artificielle, traitement automatique de la parole, IRM temps réel, Synthèse de la parole

Détail de l'offre

Le projet de doctorat proposé vise à améliorer la synthèse de la parole multilingue en prenant en compte la dynamique temporelle du conduit vocal. Actuellement, les méthodes existantes utilisent des représentations statiques des phonèmes qui ne capturent pas les phénomènes d'anticipation et de coarticulation, essentiels à la parole naturelle. L'objectif du projet est de créer un modèle dynamique du conduit vocal qui puisse s'adapter à n'importe quelle langue et locuteur. Le travail repose sur l'utilisation de données d'IRM temps réel, permettant de visualiser l'évolution du conduit vocal à une fréquence de 50 Hz.
Le projet se divise en trois étapes : 1) recalage anatomique des données d'IRM afin de les aligner dans un repère anatomique unique ; 2) construction d'un modèle articulatoire générique qui intègre les dynamiques des différentes langues et locuteurs ; 3) adaptation de ce modèle à une langue non présente dans la base de données initiale.


Cette offre de doctorat est proposée par le cluster ENACT AI et ses partenaires. Retrouvez toutes les offres et actions de doctorat ENACT sur https://cluster-ia-enact.ai/.

Keywords

Artificial intelligence, automatic speech processing, real-time MRI, Speech synthesis

Subject details

CONTEXT Current methods for multilingual acoustic speech synthesis [1] rely on static phoneme representations, using phonological databases. Although they allow the phonemes of all languages to be “immersed” in a single space, in order to merge acoustic databases to synthesize the speech of poor-resourced languages, they do not capture the temporal dynamics of the vocal tract corresponding to the anticipation and coarticulation phenomena of natural speech. Phenomena of anticipation and coarticulation [2] are essential for the realization of phonetic contrasts. Moreover, articulatory gestures depend on individual anatomy (shape of the hard palate for instance) and require millimetric precision to guarantee the expected acoustic properties. This PhD offer is provided by the ENACT AI Cluster and its partners. Find all ENACT PhD offers and actions on https://cluster-ia-enact.ai/. OBJECTIVE This project aims to synthesize the temporal evolution of the vocal tract for any language and any speaker. It falls within the field of articulatory synthesis, seeking to model and simulate the physical process of human speech production via advanced approaches. The work will make use of real-time MRI databases [3] (see references in the pdf or the French version), which provide images of the evolution of the geometric shape of the vocal tract in the medio-sagittal plane at a frequency of 50 Hz. This frequency is sufficient to capture articulator gestures during speech production. The task will be to build a dynamic model of the vocal tract that can be adapted to a specific language and speaker from these data. WORK The work will involve three stages: (i) anatomical registration of real-time MRI data, with the aim of representing all gestures in a single anatomical landmark. (ii) construction of a generic articulatory model merging the dynamics of the languages and speakers in the database used. (iii) adaptation of the generic model to a language not included in the original database. The first step, i.e. anatomical registration relies on the search for visible and robustly identifiable anatomical points on the MRI images. Of the numerous registration techniques available, we prefer those that explicitly identify anatomical points, so that we can link an anatomical transformation to the articulators concerned. The second step is to develop a generic dynamic model capable of taking all articulation points into account. In the model we built previously [3], we used discrete phonetic labels, which limits the model to a language whose articulation points correspond exactly to the phonemes of the database language. To obtain a generic model, we need to move on to continuous coding covering the entire vocal tract. The third step will be to adapt the generic model to a specific language described by its places of articulation and a speaker described by anatomical points. This model can be used in conjunction with multilingual acoustic synthesis, or as input for acoustic simulations. ENVIRONMENT Our two teams have already been working closely together for several years on deep learning to model articulatory gestures, making extensive use of dynamic MRI data. We are one of the leading teams in the use of real-time MRI for automatic speech processing. The PhD student will have access to the vast databases already acquired. It will also be possible to acquire complementary data using the MRI system available in the IADI laboratory. The PhD student will also have the opportunity to attend one or two summer schools and conferences on MRI and automatic speech processing.

Profil du candidat

Master en informatique ou en mathématiques appliquées
Le candidat doit avoir une solide expérience en apprentissage profond, en mathématiques appliquées et en informatique. Des connaissances en traitement de la parole et de l'IRM seront également appréciées.

Candidate profile

Master in computer science or applied mathematics
The applicant should have a solid background in deep learning, applied mathematics and computer sciences. Knowledge in speech and MRI processing will be also appreciated.

Référence biblio

[1] Do, P., Coler, M., Dijkstra, J. and Klabbers, E. 2023. Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection. 12th ISCA Speech Synthesis Workshop (SSW2023) (2023), 21–26.
[2] Farnetani, E. and Recasens, D. 2010. Coarticulation and Connected Speech Processes. The Handbook of Phonetic Sciences: Second Edition. 316–352.
[3] Isaieva, K., Laprie, Y., Leclère, J., Douros, I., Felblinger, J. and Vuissoz, P.-A. 2021. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data. 8, (2021).
[4] Ribeiro, V., Isaieva, K., Leclere, J., Vuissoz, P.-A. and Laprie, Y. 2022. Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated. Speech Communication. 141, (Apr. 2022), 1–13.