Données synthétiques pour l'entrainement de modèles multi-modaux Vision-Language-Actions pour des robots généralistes

Offre de thèse

Date limite de candidature

10-04-2025

Date de début de contrat

01-10-2025

Directeur de thèse

MOURET Jean-Baptiste

Encadrement

Enrico Mingo-Hoffman: Inria researcher

Type de contrat

Concours pour un contrat doctoral

Candidater à cette offre

école doctorale

IAEM - INFORMATIQUE - AUTOMATIQUE - ELECTRONIQUE - ELECTROTECHNIQUE - MATHEMATIQUES

équipe

LARSEN

contexte

This PhD offer is provided by the ENACT AI Cluster and its partners. Find all ENACT PhD offers and actions on https://cluster-ia-enact.ai/ # Research team The HUCEBOT team is a new team ofthe Center Inria at the University of Lorraine. The team is dedicated to advancing algorithms for human-centered robots: robots that are not working autonomously in isolation, but that instead react, interact, collaborate, and assist humans. These robots need to intertwine multi-contact whole-body control, digital simulation of the interacting humans, and machine learning models to predict and respond to human instructions, movements, and intentions. The team work on scenarios that involve humanoid robots, cobots, and exoskeletons. The application domains span from industrial robotics to space teleoperation. The main robots of the team are the Tiago++ bimanual mobile manipulator, the Unitree G1 humanoid, and the Talos humanoid robot. The team also works with Franka cobots and custom robots. The team comprises approximately 25 members, including permanent researchers, postdoctoral fellows, PhD students, and engineers. # Research environment: Loria and Inria center of the University of Lorraine The PhD will be conducted at the LORIA laboratory (https://www.loria.fr/fr/), a joint research unit (UMR 7503) of CNRS, Université de Lorraine, CentraleSupélec, and Inria. Established in 1997, the LORIA conducts basic and applied research in computer science and is part of the AM2I scientific cluster of the Université de Lorraine. The scientific work is conducted within 28 teams organized into five departments. Thirteen of these teams are shared with Inria, comprising a total of over 500 individuals. LORIA stands as one of the largest laboratories in the Grand Est region. The HUCEBOT team also belongs to the Inria center at the University of Lorraine (https://www.inria.fr/fr/centre-inria-universite-lorraine). Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact. The Inria center at the University of Lorraine hosts 17 research teams, all of them shared with the University of Lorraine and the CNRS. Inria, LORIA, and their partners invested in the Creativ'Lab platform (https://creativlab.loria.fr) which provides the experimental facilities needed for robotics and embodied AI research. Defence Security:  This position is situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST). Authorisation to enter an area is granted by the director of the unit, following a positive Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

spécialité

Informatique

laboratoire

LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications

Mots clés

Robotique, Modèles de languages, Robotique évolutionniste

Détail de l'offre

The objective of this PhD is to contribute to the next generation of Vision-Language-Actions models (VLA), which are a recent family of multi-modal large language models that make it possible to give verbal instructions to robots. They intertwine VLM (Vision Language Models), LLMs (Large Language Models), and imitation learning (often based on diffusion) to mix language, vision, actions, and sometimes force and touch sensing. Ultimately, it is foreseen that these multi-modal models will enable generalist robots that can be instructed to perform virtually any task in any environment, from factories to homes. For instance, a user could ask verbally a robot to «make breakfast with some eggs and coffee and bring it to me», or to the same robot with the same policy to «put all these items in a box, close the package, and send it to a client».

The starting point of the PhD will be the recently published OpenVLA [1] and Pi0 [2,3].

The main challenge in the development of these models is the lack of data: for now, robotics does not have access to «internet-scale» data sources, contrary to LLMs, which limits the scope and generalization abilities of the current models. The current main efforts are focused on acquiring large datasets of teleoperation data, either by large collaborative efforts [4] or by private companies hiring dedicated teleoperators (e.g., Google DeepMind or Physical Intelligence). These efforts require very large investments and are specialized to a few tasks and robots.

In this PhD, we will explore synthetic data as a more scalable alternative, replacing at least part of the real data with data generated in simulation. The dataset that is needed associates language, 3D images, forces, and motion/actions. Modern photorealistic simulators [6] can generate photorealistic images, but they need scenarios/scenes that fit the capabilities of the robot. We propose to address this challenge with Quality Diversity algorithms like MAP-Elites [7,8], which is a family of optimization algorithms that search for a large set of high-quality solutions to an optimization problem; we introduced the first algorithms of this kind in 2015 [7]. These algorithms are good at generating all the «interesting» behaviors that a robot can perform in an unsupervised way. They have been previously successfully used to generate scenarios for human-robot interaction [9] and programming problems [10]. To generate the textual descriptions, we plan to experiment with video description foundation models [11].

The second challenge of these models is that even the best dataset will never encompass all the possible scenarios a robot might encounter. Therefore, it's crucial to combine VLA models with anomaly/out-of-distribution detection algorithms that can assess the reliability of the policy's given its training set. We will tackle this challenge by employing matching flow models [12] that can model the distribution of the training set and developing new algorithms compatible with VLAs.

Keywords

Robotics, Language Models, Evolutionary Computation

Subject details

The objective of this PhD is to contribute to the next generation of Vision-Language-Actions models (VLA), which are a recent family of multi-modal large language models that make it possible to give verbal instructions to robots. They intertwine VLM (Vision Language Models), LLMs (Large Language Models), and imitation learning (often based on diffusion) to mix language, vision, actions, and sometimes force and touch sensing. Ultimately, it is foreseen that these multi-modal models will enable generalist robots that can be instructed to perform virtually any task in any environment, from factories to homes. For instance, a user could ask verbally a robot to «make breakfast with some eggs and coffee and bring it to me», or to the same robot with the same policy to «put all these items in a box, close the package, and send it to a client». The starting point of the PhD will be the recently published OpenVLA [1] and Pi0 [2,3]. The main challenge in the development of these models is the lack of data: for now, robotics does not have access to «internet-scale» data sources, contrary to LLMs, which limits the scope and generalization abilities of the current models. The current main efforts are focused on acquiring large datasets of teleoperation data, either by large collaborative efforts [4] or by private companies hiring dedicated teleoperators (e.g., Google DeepMind or Physical Intelligence). These efforts require very large investments and are specialized to a few tasks and robots. In this PhD, we will explore synthetic data as a more scalable alternative, replacing at least part of the real data with data generated in simulation. The dataset that is needed associates language, 3D images, forces, and motion/actions. Modern photorealistic simulators [6] can generate photorealistic images, but they need scenarios/scenes that fit the capabilities of the robot. We propose to address this challenge with Quality Diversity algorithms like MAP-Elites [7,8], which is a family of optimization algorithms that search for a large set of high-quality solutions to an optimization problem; we introduced the first algorithms of this kind in 2015 [7]. These algorithms are good at generating all the «interesting» behaviors that a robot can perform in an unsupervised way. They have been previously successfully used to generate scenarios for human-robot interaction [9] and programming problems [10]. To generate the textual descriptions, we plan to experiment with video description foundation models [11]. The second challenge of these models is that even the best dataset will never encompass all the possible scenarios a robot might encounter. Therefore, it's crucial to combine VLA models with anomaly/out-of-distribution detection algorithms that can assess the reliability of the policy's given its training set. We will tackle this challenge by employing matching flow models [12] that can model the distribution of the training set and developing new algorithms compatible with VLAs.

Profil du candidat

The ideal applicant loves robots and experiments with real hardware. He/She has deep knowledge and previous projects in at least one of the following topics:
- robot learning
- machine learning
- large language models
- visual language models
- evolutionary computation (especially Quality Diversity)

Technical skills:
- Proficiency in Python (PyTorch) (required)
- Docker/Singularity (not required)
- ROS (not required)

Languages: the official language of the team is English. French is not required (French classes are offered by Inria to PhD students if they are interested).

Candidate profile

Référence biblio

[1] Kim, Moo Jin, et al. 'OpenVLA: An open-source vision-language-action model.' arXiv preprint arXiv:2406.09246 (2024) - https://arxiv.org/pdf/2406.09246
[2] Black, Kevin, et al. '$\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control.' arXiv preprint arXiv:2410.24164 (2024). - https://arxiv.org/pdf/2410.24164 - https://www.physicalintelligence.company/blog/pi0
[3] Pertsch, Karl, et al. 'Fast: Efficient action tokenization for vision-language-action models.' arXiv preprint arXiv:2501.09747 (2025) - https://arxiv.org/pdf/2501.09747
[4] O'Neill, Abby, et al. 'Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0.' 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024. - https://robotics-transformer-x.github.io
[5] Gemini Robotics - https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf
[6] https://developer.nvidia.com/isaac/sim
[7] Cully A, Clune J, Tarapore D, Mouret JB. Robots that can adapt like animals. Nature. 2015 May 28;521(7553):503-7. https://arxiv.org/pdf/1407.3501
[8] Chatzilygeroudis, Konstantinos, et al. 'Quality-diversity optimization: a novel branch of stochastic optimization.' Black Box Optimization, Machine Learning, and No-Free Lunch Theorems. Cham: Springer International Publishing, 2021. 109-135. https://arxiv.org/pdf/2012.04322
[9] Fontaine, Matthew, and Stefanos Nikolaidis. 'A quality diversity approach to automatically generating human-robot interaction scenarios in shared autonomy.' Proc. of Robotics: Science and Systems. arXiv preprint arXiv:2012.04283 (2020). https://arxiv.org/pdf/2012.04283
[10] Pourcel, Julien, et al. 'ACES: Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models.' Advances in Neural Information Processing Systems 37 (2024): 67627-67662. https://proceedings.neurips.cc/paper_files/paper/2024/file/7d0c6ff18f16797b92e77d7cc95b3c53-Paper-Conference.pdf
[11] Yang, Antoine, et al. 'Vid2seq: Large-scale pretraining of a visual language model for dense video captioning.' Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. http://openaccess.thecvf.com/content/CVPR2023/papers/Yang_Vid2Seq_Large-Scale_Pretraining_of_a_Visual_Language_Model_for_Dense_CVPR_2023_paper.pdf
[12] Rouxel, Quentin, et al. 'Flow matching imitation learning for multi-support manipulation.' 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids). IEEE, 2024. https://hal.science/hal-04650144/