PoseEmbroider blogpost

TL;DR

The PoseEmbroider builds on the idea that all modalities are only partial views of the same concept. Metaphorically, modalities are like 2D shadows of a 3D object. Each 2D shadow is informative in its own way, but is also only partial. Put together, they make it possible to reach a better understanding of the core concept. In this project, we train a model with all these three complementary, partial yet valuable, modalities so as to derive a generic 3D-, visual- and semantic- aware pose representation.

To do so, we use a subset of input modality embeddings obtained from pretrained, frozen modality-specific encoders. We feed these modality embeddings to a transformer, which outputs the generic pose representation through an added special token ("step 1"). This generic pose representation is then re-projected into uni-modal spaces, where it is compared to the original modality embeddings ("step 2"). Metaphorically, we try to approximate the 3D object in the picture from the available shadows (= step 1), then we assess the quality of the reconstruction by checking that the shadows are matching (= step 2).

We showcase the potential of our proposed representation on two downstream applications: human mesh recovery and the PoseFix corrective instruction task, where the goal is to generate a text explaining how to modify one pose into another.

Experiments on multi-modal retrieval also reveal that our proposed framework outperforms the classical "modality alignment" framework.

Introduction

We want to derive a representation for human pose that can be reused as is in tasks requiring human understanding, for instance pose instruction, pose estimation or pose generation. In this work, we propose a multi-modal model taking texts, images and 3D poses as input, aimed to produce such a representation.

Related works can be classified in two categories.

One body of work seeks to align different modalities so that all object representations live in the same joint embedding space. This is usually done thanks to contrastive objectives, and it empowers cross-modal retrieval (as in CLIP [Radford et al., ICML 2021] or ImageBind [Girdhar et al., CVPR 2023]).

Another body of work learns to translate some modalities into others, using modality-specific projector heads on top of large language models, or by tokenizing all modalities and performing masked modeling (as in Next-GPT [Wu et al., ICML 2023], ChatPose [Feng et al., CVPR 2024] or 4M [Mizrahi et al., NeurIPS 2024]).

In our case, we don’t want to learn to align or to translate modality features, but rather to enrich them. The main idea is that all modalities are only partial views of the same concept.

Metaphorically, modalities are like 2D shadows of a 3D object. Each 2D shadow is informative in its own way, but is also only partial. Put together, they make it possible to reach a better understanding of the core concept.

For instance, a 3D pose representation, that is, body joint rotations, informs about the kinematic and provides some spatial cues. But it lacks the reality anchoring that you can get from images, which show blurriness, clothing etc. At the same time, the person in an image can be occluded (you only see their back, or their upper body), so the 3D pose makes it possible to fill in the gaps. Eventually, while images and 3D poses carry intrinsic semantics, the connection with human semantics is missing. For instance, one person could have their hand over head or at shoulder height and be waving in both cases. For this reason, we also consider textual pose descriptions.

We hope that training a model with all these three complementary, partial yet valuable, modalities will help to derive a generic 3D-, visual- and semantic- aware pose representation. The goal is to obtain this representation from any single modality input at test time.

Data

Currently, there is no multi-modal datasets with images, 3D pose annotations and text descriptions at the same time. So we use BEDLAM [Black et al., CVPR 2023], which is a synthetic dataset, and extract both human image crops and their corresponding 3D poses. Then we run the automatic captioning pipeline designed in PoseScript [Delmas et al., ECCV 2022] to produce textual pose descriptions.

We hence obtain a dataset that is fully synthetic, but with all modalities being simultaneously available. We use this data to train the PoseEmbroider framework.

Method

We start by encoding each modality through frozen pretrained uni-modal encoders. We add a small modality-specific kind of positional encoding to each of these representations, and feed them to a transformer, along with a learnable token \(x\). The output of that token is going to be our 3D-, visual-, and semantic-aware generic pose representation (let's denote this process "step 1").

We learn this representation by projecting it back into each uni-modal space then applying some contrastive losses between the projections and the original modality representations ("step 2").

Metaphorically, we try to approximate the 3D object in the schema from the available shadows (= step 1), then we assess the quality of its reconstruction by checking that the projected and actual shadows are matching (= step 2).

Since we want the model to work on any kind of input, we train it with different subsets of input modality (hence the "or \( \emptyset \)"), while applying all the losses at the end. Therefore, the model does not learn to compress input information, but to infer or guess missing information, thus enriching the input modality feature.

Direct application: multi-modal retrieval

As a direct result of this design, we can perform any-to any cross-modal retrieval, and even combine multiple, complementary inputs so as to retrieve relevant elements (we feed two modalities as input to the model, and project the obtained enriched representation in the modality space of the third modality, where we perform retrieval).

In the first example, the position of the person’s legs are originally switched.

We also use this setting to compare our proposed framework to the standard multi-modal aligning baseline mentioned earlier, where the output of the frozen modality-specific encoders are fed to trainable modality-specific heads, projecting into a common latent space thanks to contrastive training (similar to CLIP). We find that our PoseEmbroider outperforms this other framework in average, especially when it comes to combining inputs.

Downstream application: Human mesh recovery

We showcase the potential of our proposed representation on two downstream applications, the first of which is human mesh recovery [Kanazawa et al., CVPR 2018].

We use our pretrained pose representation to train task-specific heads.

Since our proposed representation can seamlessly be derived from one or several input modalities, it makes it possible to train the neural head using input images only (following classical datasets for this task), while adding an optional textual cue at test time to improve the results.

Downstream application: PoseFix task

We next illustrate the value of our proposed representation on the PoseFix corrective instruction task [Delmas et al., ICCV 2023], where the goal is to generate a text explaining how to modify one pose into another, with application in automatic coaching.

We proceed as for the human mesh recovery downstream task, training solely the neural head. While we trained it with the PoseFix dataset, which contains only pairs of 3D poses and correction text, our design empowers final inference on other input modalities, such as images, without further training.

Final words

We have introduced the PoseEmbroider framework, which derives a visual-, 3D-, semantic-aware pose representation, enriching uni-modal representations, and to be used in diverse downstream tasks.
The PoseEmbroider framework outperforms the classical "modality alignment" framework.
Our proposed representation makes it possible to train downstream heads on available modalities of existing datasets, while empowering final inference on (combinations of) other modalities.

Limitation: our derived models for downstream tasks do not outperfom task-expert models.

Future works: we could use multiple partial datasets for training, and consider more modalities (eg. 3D scene...) at different granularity (eg. single pose / (dual) pose relationship / motion).

Acknowledgment

This work is supported by the Spanish government with the project MoHuCo PID2020-120049RB-I00, and by NAVER LABS Europe under technology transfer contract 'Text4Pose'.

BibTex

@inproceedings{delmas2024poseembroider, title={{PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation}}, author={{Delmas, Ginger and Weinzaepfel, Philippe and Moreno-Noguer, Francesc and Rogez, Gr\'egory}}, booktitle={{ECCV}}, year={2024} }