Abstract

We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. UniCon3R explicitly models interaction by inferring 4D contact from human pose and scene geometry, then uses contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene, establishing contact as an internal prior for physically grounded reconstruction.

Interactive 4D Visualization

Explore the 4D reconstruction results of UniCon3R on various dynamic scenes. Use the controls below to orbit the scene, zoom, and fly through the reconstruction.

Left click Drag with left click to rotate view
Scroll wheel Scroll to zoom in/out
Right click Drag with right click to move view
Moving forward and backward
Moving left and right
Moving upward and downward

Temporal Contact Evolution

Input Video

Input frame for contact visualization

Results

Loading temporal contact viewer...
Current contact Body surface
Frame 1/48 | loading

Method Overview

UniCon3R pipeline

UniCon3R extends Human3R with two tightly coupled mechanisms. First, a scene-aware contact prompt fuses current-frame scene features, recurrent scene memory, local metric geometry, and temporal contact history to build a physically meaningful interaction token. Second, contact-guided latent refinement feeds the refined contact token back into the human branch before SMPL-X regression, turning contact from an auxiliary readout into an internal corrective prior.

The result is a unified recurrent reconstruction pipeline that preserves feed-forward efficiency while producing more physically grounded human motion and better body-scene alignment.

Qualitative Overview

Contact improves unified 4D human-scene reconstruction by correcting body-scene misalignment, reducing penetration, and improving global motion while preserving feed-forward inference.

UniCon3R qualitative overview

Contact Prediction

UniCon3R predicts dense per-vertex contact probabilities and uses them internally to refine the human reconstruction rather than treating contact as a detached side output.

UniCon3R contact comparison

Qualitative comparison of contact prediction on web images. UniCon3R captures contact on challenging body regions, including the lower torso, palms, and fingertips, while DECO tends to concentrate predictions on narrower visually salient support regions.

Global Human Motion Estimation

On EMDB-2, UniCon3R improves global human trajectory estimation relative to Human3R after world-coordinate alignment.

Global motion comparison

Local Mesh Reconstruction

UniCon3R preserves competitive local mesh accuracy on SLOPER4D and 3DPW while substantially reducing maximum scene penetration on SLOPER4D.

Qualitative local human mesh recovery

Qualitative comparison of local human mesh recovery. UniCon3R produces body pose and scene alignment closer to ground truth, especially when clear scene support is visible.

Physical Plausibility

UniCon3R uses contact-guided refinement to reduce floating and penetrating bodies in world-frame reconstructions.

Physical plausibility comparison on RICH

Qualitative comparison on RICH. Yellow boxes highlight floating or implausible Human3R reconstructions compared to the more grounded predictions of UniCon3R.

BibTeX

@misc{sur2026unicon3rcontactaware3dhumanscene,
      title={UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video},
      author={Tanuj Sur and Shashank Tripathi and Nikos Athanasiou and Ha Linh Nguyen and Kai Xu and Michael J. Black and Angela Yao},
      year={2026},
      eprint={2604.19923},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.19923},
}

Acknowledgements

We thank the authors of ST4RTrack for the template, which follows SD+DINO and DreamBooth. The interactive 4D visualization on this page is powered by Viser.