Predicting probabilistic body pose from egocentric cameras
International Conference on 3D Vision 2024
Our work addresses the problem of egocentric human pose estimation from downwards-facing cameras on head-mounted devices (HMD). This presents a challenging scenario, as parts of the body often fall outside of the image or are occluded. Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues. They also predict 2D heat-maps per joint and lift them to 3D space to deal with self-occlusions, but this requires large network architectures which are impractical to deploy on resource-constrained HMDs. We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame. As such, we directly regress probabilistic joint rotations represented as matrix Fisher distributions for a parameterized body model. This allows us to quantify pose uncertainties and explain out-of-frame or occluded joints. This also removes the need to compute 2D heat-maps and allows for simplified DNN architectures which require less compute. Given the lack of egocentric datasets using rectilinear camera lenses, we introduce the SynthEgo dataset, a synthetic dataset with 60K stereo images containing high diversity of pose, shape, clothing and skin tone. Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body. Our architecture also has eight times fewer parameters and runs twice as fast as the current state-of-the-art. Experiments show that training on our synthetic dataset leads to good generalization to real world images without fine-tuning.
To construct the SynthEgo dataset we render 60K stereo pairs at 1280×720 pixel resolution, building on the pipeline of Hewitt et al. This dataset is comprised of 6000 unique identities, each performing 10 different poses in 10 different lighting environments. Each identity is made up of a randomly sampled body shape, skin textures sampled from a library of 25 and randomly recolored, and clothing assets sampled from a library of 202. Lighting environments are sampled from a library of 489 HDRIs, to ensure correct disparity of the environment between the stereo pair, we project the HDRI background onto the ground plane. Poses are sampled from a library of over 2 million unique poses and randomly mirrored; sampling is weighted by the mean absolute joint angle and common poses like T-pose are significantly down-weighted to increase diversity.
Mo2Cap2 | xR-EgoPose | UnrealEgo | SynthEgo | |
---|---|---|---|---|
Unique Identities | 700 | 46 | 17 | 6000 |
Environments | Unspecified | Unspecified | 14 | 489 |
Body Model | SMPL | Unspecified | UnrealEngine | SMPL-H |
Lens Type | Fisheye | Fisheye | Fisheye | Rectilinear |
Mono/Stereo | Mono | Mono | Stereo | Stereo |
Body Shape GT | ✓ | |||
Joint Location GT | ✓ | ✓ | ✓ | ✓ |
Joint Rotation GT | ✓ | |||
Realism | Low | Medium | High | High |
We position the camera on the front of the forehead looking down at the body. The camera uses a pinhole model approximating the ZED mini stereo. We add uniform noise within ±1 cm to the location and ±10° around all axes of rotation of the camera to simulate misplacement and movement of the HMD on the head. The resulting images are typically quite challenging for pose estimation, as many parts of the body are often not seen by the camera.
The goal of our method is to estimate the probability distribution over joint rotations $\mathbf{R} = \{\mathbf{R}_i\}^{N}_{i=1}$ conditioned on input image data $\mathbf{X}$, $p(\mathbf{R}|\mathbf{X})$. Following Sengupta et al, we train a neural network to regress Fisher parameters $\mathbf{F} = \{\mathbf{F}_i\}^{N}_{i=1}$ given input image data $\mathbf{X}$. From these predicted parameters we can calculate the expected rotation, $\mathbf{\hat{R}}_i$ and the concentration parameters for each joint $i$, $\kappa_{i,j}$. The latter describes the uncertainty of the rotation distribution.
We train the neural network by minimizing loss $\mathcal{L} = \mathcal{L}_{FNLL} + \mathcal{L}_J$. $\mathcal{L}_{FNLL}$ is the matrix Fisher negative log-likelihood, promoting accurate local joint rotations. $$ \begin{aligned} \mathcal{L}_{FNLL}&=\sum_{i=1}^{N}log(c(\mathbf{F}_i))-\text{tr}(\mathbf{F}_i^\top \mathbf{R}_i) \end{aligned} $$ $\mathcal{L}_J$ supervises the 3D joint positions regressed from the parametric body model, SMPL-H, with shape parameters $\boldsymbol\beta$ with joint regressor $\mathcal{J}$. $$ \begin{aligned} J_{3D}(\mathbf{R},\boldsymbol\beta)=&\mathcal{J}(\textit{SMPL-H}(\mathbf{R}, \boldsymbol\beta))\\ \mathcal{L}_{J}=&\left \| J_{3D}(\hat{\mathbf{R}},\boldsymbol\beta)- J_{3D}(\mathbf{R},\boldsymbol\beta) \right \|^2_2 \end{aligned} $$ This causes the network to consider the effect of the predicted rotations on the final pose, as the positions of child joints are influenced by the rotation of their parents in the kinematic tree of our body model.
Input | Method | PA-MPJPE (mm) | |||
---|---|---|---|---|---|
Upper Body | Lower Body | Hands | All | ||
Monocular | xR-EgoPose | 50.18 | 76.76 | 127.34 | 97.48 |
Ours | 38.48 | 62.35 | 98.94 | 76.05 | |
Stereo | UnrealEgo | 48.06 | 77.06 | 117.85 | 91.67 |
Ours | 34.00 | 54.59 | 87.78 | 67.31 |
To evaluate the performance on real-world data, we recorded a dataset of 8378 stereo pair images from 11 different subjects performing actions like squatting, sitting, stretching, crossing arms, and interacting with small objects. Overall, our stereo network has the best performance. We observe that the extra information provided by the right image helps the network to better predict extremities. We also note that UnrealEgo and xR-EgoPose perform particularity poorly for lower body joints. This may be caused by the fact that the legs are not always visible, and that 2D heat-maps cannot provide uncertainties for joints outside of the image frame.
Qualitative results of our method compared to recent work for synthetic and real data.
Axis specific concentration for different joints. Concentration is lowest around the primary axis of rotation for a given joint.
Our paper demonstrates that the predicted uncertainty estimates capture extra information and priors about body pose, and shows empirically that the estimated uncertainties are reliable. While the former allows us to better explain the prediction of the model, the latter is of significant importance when it comes to deployment of our method in downstream tasks such as avatar animation, where uncertainty estimates can be used as a measure of reliance of the predicted poses.
Correlation of confidence with error; the higher the confidence the lower the error. Our confidence estimates are therefore reliable for downstream use.
@inproceedings{cuevas2024simpleego, title={{SimpleEgo}: Predicting probabilistic body pose from egocentric cameras}, author={Cuevas-Velasquez, Hanz and Hewitt, Charlie and Aliakbarian, Sadegh and Baltru{\v{s}}aitis, Tadas}, booktitle={2024 International Conference on 3D Vision (3DV)}, pages={1446--1455}, year={2024}, organization={IEEE} }