UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

Tanuj Sur¹ Shashank Tripathi² Nikos Athanasiou²
Ha Linh Nguyen¹ Kai Xu¹ Michael J. Black² Angela Yao¹

¹ National University of Singapore ² Max Planck Institute for Intelligent Systems, Tübingen, Germany

Abstract

We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. UniCon3R explicitly models interaction by inferring 4D contact from human pose and scene geometry, then uses contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene, establishing contact as an internal prior for physically grounded reconstruction.

Interactive 4D Visualization

Explore the 4D reconstruction results of UniCon3R on various dynamic scenes. Use the controls below to orbit the scene, zoom, and fly through the reconstruction.

Drag with left click to rotate view

Scroll to zoom in/out

Drag with right click to move view

Moving forward and backward

Moving left and right

Moving upward and downward

Temporal Contact Evolution

Input Video

Results

Loading temporal contact viewer...

Current contact Body surface

Frame 1/48 | loading

Method Overview

UniCon3R extends Human3R with two tightly coupled mechanisms. First, a scene-aware contact prompt fuses current-frame scene features, recurrent scene memory, local metric geometry, and temporal contact history to build a physically meaningful interaction token. Second, contact-guided latent refinement feeds the refined contact token back into the human branch before SMPL-X regression, turning contact from an auxiliary readout into an internal corrective prior.

The result is a unified recurrent reconstruction pipeline that preserves feed-forward efficiency while producing more physically grounded human motion and better body-scene alignment.

Qualitative Overview

Contact improves unified 4D human-scene reconstruction by correcting body-scene misalignment, reducing penetration, and improving global motion while preserving feed-forward inference.

Contact Prediction

UniCon3R predicts dense per-vertex contact probabilities and uses them internally to refine the human reconstruction rather than treating contact as a detached side output.

Qualitative comparison of contact prediction on web images. UniCon3R captures contact on challenging body regions, including the lower torso, palms, and fingertips, while DECO tends to concentrate predictions on narrower visually salient support regions.

Global Human Motion Estimation

On EMDB-2, UniCon3R improves global human trajectory estimation relative to Human3R after world-coordinate alignment.

Local Mesh Reconstruction

UniCon3R preserves competitive local mesh accuracy on SLOPER4D and 3DPW while substantially reducing maximum scene penetration on SLOPER4D.

Qualitative comparison of local human mesh recovery. UniCon3R produces body pose and scene alignment closer to ground truth, especially when clear scene support is visible.

Physical Plausibility

UniCon3R uses contact-guided refinement to reduce floating and penetrating bodies in world-frame reconstructions.

Qualitative comparison on RICH. Yellow boxes highlight floating or implausible Human3R reconstructions compared to the more grounded predictions of UniCon3R.

BibTeX

@misc{sur2026unicon3rcontactaware3dhumanscene,
      title={UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video},
      author={Tanuj Sur and Shashank Tripathi and Nikos Athanasiou and Ha Linh Nguyen and Kai Xu and Michael J. Black and Angela Yao},
      year={2026},
      eprint={2604.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.19923},
}

Acknowledgements

We thank the authors of ST4RTrack for the template, which follows SD+DINO and DreamBooth. The interactive 4D visualization on this page is powered by Viser.