Papers


ECCV 2024

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

The PoseEmbroider framework combines multi-modal views of the human poses (images, textual descriptions, 3D joint rotations) to derive a rich visual-, semantic-, 3D-aware representation, which can be reused off-the-shelf in various downstream tasks (eg. any-to-any multi-modal retrieval, pose estimation, pose instruction generation). Its transformer core is trained in a contrastive fashion, and is shown to outperform the standard multi-modal alignment baseline.

ICCV 2023

PoseFix: Correcting 3D Human Poses with Natural Language

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

We introduce the PoseFix dataset, which consists in over 6k triplets of 3D human pose pairs and a text modifier describing how the source pose needs to be modified to obtain the target pose. We further train a text-based pose editing model to generate corrected 3D body poses given a query pose and a text modifier; and a correctional text generation model, where correctional instructions are generated based on the differences between two body poses.

ECCV 2022 TPAMI 2024

PoseScript: 3D Human Poses from Natural Language

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Grégory Rogez

We collect a dataset, PoseScript, pairing 3D human poses from AMASS and descriptions both written by human annotators and generated automatically by our proposed pipeline. We use PoseScript to train text-to-pose models, both for retrieval and generation. Pretraining on automatic data boost performance by a factor 2.

ICLR 2022

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Ginger Delmas, Rafael Sampaio De Rezende, Gabriela Csurka, Diane Larlus

We take inspiration from image retrieval and cross-modal retrieval to tackle the task of composed image retrieval: we design two complementary modules, each focusing on one modality of the query. The Explicit Matching module assesses how potential targets fit the textual modifier while the Implicit Similarity module compares potential target images to the reference image, assisted by the text. We validate our approach on FashionIQ, Shoes and CIRR.

Thesis


Defended in May 2025

Linking Human Poses With Natural Language

Ginger Delmas

Directors: Francesc Moreno-Noguer & Philippe Weinzaepfel

Reviewers: Javier Romero & Siyu Tang

Human pose is key to multiple human-centric applications (art, sport, embodied AI...). Until recently, researchers had addressed underlying tasks where human pose is mostly studied in conjunction with images. The arrival of efficient language models fostered the incorporation of linguistic in vision frameworks, and thereby powered multi-modal applications. This thesis fits into this dynamic. We aim to leverage Natural Language (NL) to bud human pose understanding in human-centric tasks. In contrast to prior endeavors, we juggle with static 3D human poses, images and detailed NL texts all together. We further explore novel multi-modal applications, requiring fine-grained understanding of the human pose.

First, to alleviate the lack of data, we introduce new datasets linking 3D human poses with NL texts. We notably investigate two settings. One where the text is a description of the target pose, and another where the text provides modification instructions to reach the target pose from a source pose. These datasets result both from (i) the collection of crowd-sourced annotations, and (ii) the automatic, rule-based generation of texts, which consists in the incorporation of classified pose measurements into templates sentences.

Next, we use these datasets to develop several cross-modal generation models like text-driven pose synthesis, pose captioning, text-guided pose editing and generation of textual posture feedback. Eventually, we connect 3D, text and images through a novel combinating framework, so as to derive a versatile, multi-modal pose representation, to be leveraged for downstream tasks akin to pose estimation or NL posture feedback from visual input.

In summary, we tackle multiple machine learning tasks entailing human pose understanding, thanks to the connection of human pose and Natural Language.

Talks


Biography


2025
  • 🧳 Traveled around Asia for almost 6 months: India, Vietnam, Mongolia, China, Japan.
  • 🎓 Defended my PhD!
2024
  • 🚀 Internship at Amazon (Seattle, USA).
  • ✨ Outstanding reviewer awards (CVPR, ECCV).
  • ✨ Accepted TPAMI journal extension for PoseScript.
  • ✨ Presented at ECCV.
2023
  • ✨ Presented at ICCV, DLBCN.
  • 📚 Attended ICVSS.
2022
  • ✨ Presented at ICLR, ECCV, DLBCN.
2021
  • 📚 Started my PhD at IRI-UPC-CSIC in collaboration with NAVER LABS Europe, supervised by Francesc Moreno-Noguer, Philippe Weinzaepfel and Grégory Rogez.
2020
  • 🚀 Internship at NAVER LABS Europe, supervised by Diane Larlus and Rafael Sampaio De Rezende.
  • 🎓 Obtained the MVA master's degree; from ENS Paris-Saclay and Institut Polytechnique de Paris.
  • 🎓 Graduated in Computer Science from Télécom Paris (French engineering school, equivalent to a master's degree).