We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. UniCon3R explicitly models interaction by inferring 4D contact from human pose and scene geometry, then uses contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene, establishing contact as an internal prior for physically grounded reconstruction.
Explore the 4D reconstruction results of UniCon3R on various dynamic scenes. Use the controls below to orbit the scene, zoom, and fly through the reconstruction.
UniCon3R extends Human3R with two tightly coupled mechanisms. First, a scene-aware contact prompt fuses current-frame scene features, recurrent scene memory, local metric geometry, and temporal contact history to build a physically meaningful interaction token. Second, contact-guided latent refinement feeds the refined contact token back into the human branch before SMPL-X regression, turning contact from an auxiliary readout into an internal corrective prior.
The result is a unified recurrent reconstruction pipeline that preserves feed-forward efficiency while producing more physically grounded human motion and better body-scene alignment.
Contact improves unified 4D human-scene reconstruction by correcting body-scene misalignment, reducing penetration, and improving global motion while preserving feed-forward inference.
UniCon3R predicts dense per-vertex contact probabilities and uses them internally to refine the human reconstruction rather than treating contact as a detached side output.
Qualitative comparison of contact prediction on web images. UniCon3R captures contact on challenging body regions, including the lower torso, palms, and fingertips, while DECO tends to concentrate predictions on narrower visually salient support regions.
On EMDB-2, UniCon3R improves global human trajectory estimation relative to Human3R after world-coordinate alignment.
UniCon3R preserves competitive local mesh accuracy on SLOPER4D and 3DPW while substantially reducing maximum scene penetration on SLOPER4D.
Qualitative comparison of local human mesh recovery. UniCon3R produces body pose and scene alignment closer to ground truth, especially when clear scene support is visible.
UniCon3R uses contact-guided refinement to reduce floating and penetrating bodies in world-frame reconstructions.
Qualitative comparison on RICH. Yellow boxes highlight floating or implausible Human3R reconstructions compared to the more grounded predictions of UniCon3R.
@misc{sur2026unicon3rcontactaware3dhumanscene,
title={UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video},
author={Tanuj Sur and Shashank Tripathi and Nikos Athanasiou and Ha Linh Nguyen and Kai Xu and Michael J. Black and Angela Yao},
year={2026},
eprint={2604.19923},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.19923},
}
We thank the authors of ST4RTrack for the template, which follows SD+DINO and DreamBooth. The interactive 4D visualization on this page is powered by Viser.