Abstract
We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.
| Original language | English |
|---|---|
| Number of pages | 15 |
| DOIs | |
| Publication status | Accepted/In press - 11 Nov 2025 |
| Event | IEEE/CVF Winter Conference on Applications of Computer Vision - Tucson, Tucson, United States Duration: 6 Mar 2026 → 10 Mar 2026 |
Conference
| Conference | IEEE/CVF Winter Conference on Applications of Computer Vision |
|---|---|
| Country/Territory | United States |
| City | Tucson |
| Period | 6/03/26 → 10/03/26 |
Research Groups and Themes
- Intelligent Systems Laboratory (MaVi)