Sora3R: Can Video Diffusion Model Reconstruct 4D Geometry ?

Mar 16, 2025·
Jinjie Mai
,
Wenxuan Zhu
,
Haozhe Liu
,
Bing Li
,
Cheng Zheng
,
Jürgen Schmidhuber
,
Bernard Ghanem
· 24 mins read
Abstract
Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) We adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces. (2) We finetune a diffusion backbone in a combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Unlike previous approaches, Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.
Type
Publication
arXiv preprint

Visualization

Note that our method doesn’t output a dynamic or confidence mask by default. For fair comparison, we apply groundtruth background masks on each frame for all methods (Sora3R, DUSt3R, MonST3R), if the dataset provides, to filer out the background regions for all the visualization results.

For completeness, MonST3R w dyn seg is the default visualization setting of MonST3R in their source repo, with their confidence-based dynamic segmentation mask, reserving static background regions, instead of per-frame groundtruth background mask.

Comparison on Sintel Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Comparison on TUM-dynamic Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Comparison on ScanNet Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Visualization of Depth maps and Camera Trajectories compared to Groundtruth
Visualization of Depth maps and Camera Trajectories compared to Groundtruth

Additional Visualization

Besides the results we present in the test dataset, we also show some additional results on the test set of our training datasets, including unseen objects or indoor / outdoor videos.

To align with camera pose visualization, fused 4D pointmaps are obtained by estimating the camera poses and depth maps from raw pointmap prediction first, then re-backprojecting again.

RealEstate10K Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

Objaverse Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

TartanAir Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

Dynamic Replica Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

Point Odyssey Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Fused 4D pointmaps

Method

Training pipeline

We describe our model design and training in Sec. \ref{sec:xyz_vae} and Sec. \ref{sec:4d_dit}, while model inference is detailed in Sec. \ref{sec:inference}. We also present post-optimization to infer intrinsic, extrinsic, and depth from 4D pointmaps in Sec. \ref{sec:post_optimization}.

Temporal Pointmap Latent and VAE

Given a raw video input $\mathbf{V} \in \mathbb{R}^{N\times H\times W\times C}$, a pretrained temporal VAE with encoder $\mathcal{E}_{\text{RGB}}$ and decoder $\mathcal{D}_{\text{RGB}}$ models the video latent distribution.

A pointmap represents pixel-wise 3D coordinates per frame, establishing one-to-one pixel-point correspondences in the world frame. Unlike prior methods that freeze pretrained VAEs, we argue fine-tuning is necessary for transferring temporal RGB images to temporal pointmaps. Depth values can be highly unbounded, making naive encoders ineffective. We propose a temporal pointmap latent space that remains compatible with video latents while capturing 4D geometry.

To learn this representation, we fine-tune RGB VAE $\{\mathcal{E}_{\text{RGB}},\mathcal{D}_{\text{RGB}}\}$ into XYZ VAE $\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}$ using known groundtruth camera poses $\{\mathbf{T}_i\}_{i=1}^N$, where $\mathbf{T}_i \in \mathfrak{se}(3)$. We always fix the first frame as the world coordinate frame, i.e., $\mathbf{T}_1=\mathbf{I}$. Given depth maps $\mathbf{D}$ and intrinsic matrix $\mathbf{K}$, the global pointmap is computed as:

\[ \mathbf{P}_i(u, v) = \mathbf{T}_i \cdot \mathbf{K}^{-1} \begin{bmatrix} u \cdot \mathbf{D}_i(u, v) \\ v \cdot \mathbf{D}_i(u, v) \\ \mathbf{D}_i(u, v) \end{bmatrix}, \quad \forall i \in \{1, 2, \dots, N\} \]

To normalize large coordinate variations, we apply a norm scale factor:

\[ \mathbf{P}_i(u, v) = \frac{\mathbf{P}_i(u, v)}{\frac{1}{N \cdot H \cdot W} \sum_{i=1}^{N} \sum_{u=1}^{H} \sum_{v=1}^{W} \|\mathbf{P}_i(u, v)\|} \]

For training, we observe that $L1$ loss lacks sensitivity, while $L2$ loss over-regularizes outliers. Instead, we adopt Huber loss:

\[ \mathcal{L}_{\text{rec}}(\mathbf{\hat{P}}, \mathbf{P}) = \begin{cases} 0.5 \cdot \|\mathbf{\hat{P}} - \mathbf{P}\|_2^2, & \text{if } \|\mathbf{\hat{P}} - \mathbf{P}\|_2 < \beta \\ \|\mathbf{\hat{P}} - \mathbf{P}\|_1 - 0.5 \cdot \beta, & \text{otherwise} \end{cases} \]

We calculate $\mathcal{L}_{\text{rec}}$ only on valid depth points, masking out infinite depth values (e.g., sky regions). The final VAE training loss is:

\[ \mathcal{L}_{\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}}=\mathcal{L}_{\text{rec}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}} \]

4D Geometry DiT

Once XYZ VAE is trained, we leverage pretrained video diffusion models to denoise temporal pointmap latents, similar to RGB video latents. We use a Transformer-based DiT instead of UNet due to its scalability and spatiotemporal attention.

To leverage spatiotemporal priors from pretrained video diffusion models, we fine-tune on pointmap latents using rectified flow:

\[ \textbf{H}_{XYZ}^t = t \textbf{H}_{XYZ} + (1-t) \epsilon, \quad \epsilon \sim \mathcal{N}(0,1) \]

The 4D DiT model $\mathcal{F}$ predicts the velocity $\boldsymbol{\nu}_\epsilon$:

\[ \mathbb{E}_{t,\mathbf{H}_{XYZ},\textbf{H}_{RGB},\epsilon} || \mathcal{F}(\textbf{H}_{XYZ}^t,\textbf{H}_{RGB},t)- \boldsymbol{\nu}_\epsilon||^2 \]

where RGB video latent $\textbf{H}_{RGB}$ acts as an additional condition for denoising. Since XYZ VAE is fine-tuned from RGB VAE, we hypothesize that both representations share internal scene features, aiding transfer learning.

4D Pointmap Inference

At inference, we require only $\mathcal{E}_{\text{RGB}}$ and $\mathcal{D}_{\text{XYZ}}$. We sample random noise $\epsilon\sim \mathcal{N}(0,1)$ and concatenate it with video latent $\textbf{H}_{RGB}=\mathcal{E}_{\text{RGB}} (\textbf{V})$. The denoised pointmap latent $\textbf{H}_{XYZ}$ is then decoded:

\[ \mathbf{\hat{P}}=\mathcal{D}_{\text{XYZ}} (\textbf{H}_{XYZ}) \]

Since our method processes all video frames at once, we capture global spatiotemporal dependencies, ensuring temporally consistent and spatially coherent 4D pointmaps.

Inference pipeline

Post-Optimization for Downstream Tasks

Our 4D pointmaps $\mathbf{\hat{P}}$ naturally support various geometry tasks via simple post-optimization.

Intrinsic Estimation

Assuming all frames share the same camera intrinsics, we set the principal point:

\[ c_x={W}/{2}, \quad c_y={H}/{2} \]

Since the first frame is fixed as the coordinate frame, we optimize the focal length $f$ via Weiszfeld’s algorithm:

\[ \hat{f} = \arg \min_{f} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u-c_x, v-c_y) - f \frac{(\mathbf{P}_1(u,v,0), \mathbf{P}_1(u,v,1)}{\mathbf{P}_1(u,v,2)} \right\| \]

Camera Pose Estimation

Given estimated intrinsics and fixed first-frame pose ($\mathbf{\hat{T}}_1=\mathbf{I}$), we infer remaining camera poses using RANSAC PnP:

\[ \mathbf{\hat{T}}_i = \arg \min_{\mathbf{\hat{T_i}}} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u,v) - \pi \left( \mathbf{K} \mathbf{\hat{T}}_i \mathbf{P}_i(u,v) \right) \right\|_2 \]

where $\mathbf{\hat{T}}_i \in \mathfrak{se}(3)$ for all frames $i$.

Video Depth Estimation

Depth maps are extracted via simple pinhole projection:

\[ \mathbf{\hat{D}}_i = \mathbf{\hat{K}} \mathbf{\hat{T}}\mathbf{P}_i \]

Acknowledgements

We build our method on top of these awesome repositories:

Sincere thanks to the authors for their great works!