Sora3R: Can Video Diffusion Model Reconstruct 4D Geometry ?

Visualization

Note that our method doesn’t output a dynamic or confidence mask by default. For fair comparison, we apply groundtruth background masks on each frame for all methods (Sora3R, DUSt3R, MonST3R), if the dataset provides, to filer out the background regions for all the visualization results.

For completeness, MonST3R w dyn seg is the default visualization setting of MonST3R in their source repo, with their confidence-based dynamic segmentation mask, reserving static background regions, instead of per-frame groundtruth background mask.

Comparison on Sintel Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Comparison on TUM-dynamic Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Comparison on ScanNet Dataset

Sora3R (Ours)

DUSt3R

MonST3R

MonST3R w dyn seg

source video

Visualization of Depth maps and Camera Trajectories compared to Groundtruth

Additional Visualization

Besides the results we present in the test dataset, we also show some additional results on the test set of our training datasets, including unseen objects or indoor / outdoor videos.

To align with camera pose visualization, fused 4D pointmaps are obtained by estimating the camera poses and depth maps from raw pointmap prediction first, then re-backprojecting again.

RealEstate10K Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Objaverse Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

TartanAir Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Dynamic Replica Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Point Odyssey Dataset

source video

Sora3R (Ours)

Fused 4D pointmaps

source video

Sora3R (Ours)

Fused 4D pointmaps

Method

We describe our model design and training in Sec. \ref{sec:xyz_vae} and Sec. \ref{sec:4d_dit}, while model inference is detailed in Sec. \ref{sec:inference}. We also present post-optimization to infer intrinsic, extrinsic, and depth from 4D pointmaps in Sec. \ref{sec:post_optimization}.

Temporal Pointmap Latent and VAE

Given a raw video input $\mathbf{V} \in \mathbb{R}^{N\times H\times W\times C}$, a pretrained temporal VAE with encoder $\mathcal{E}_{\text{RGB}}$ and decoder $\mathcal{D}_{\text{RGB}}$ models the video latent distribution.

A pointmap represents pixel-wise 3D coordinates per frame, establishing one-to-one pixel-point correspondences in the world frame. Unlike prior methods that freeze pretrained VAEs, we argue fine-tuning is necessary for transferring temporal RGB images to temporal pointmaps. Depth values can be highly unbounded, making naive encoders ineffective. We propose a temporal pointmap latent space that remains compatible with video latents while capturing 4D geometry.

To learn this representation, we fine-tune RGB VAE $\{\mathcal{E}_{\text{RGB}},\mathcal{D}_{\text{RGB}}\}$ into XYZ VAE $\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}$ using known groundtruth camera poses $\{\mathbf{T}_i\}_{i=1}^N$, where $\mathbf{T}_i \in \mathfrak{se}(3)$. We always fix the first frame as the world coordinate frame, i.e., $\mathbf{T}_1=\mathbf{I}$. Given depth maps $\mathbf{D}$ and intrinsic matrix $\mathbf{K}$, the global pointmap is computed as:

\[ \mathbf{P}_i(u, v) = \mathbf{T}_i \cdot \mathbf{K}^{-1} \begin{bmatrix} u \cdot \mathbf{D}_i(u, v) \\ v \cdot \mathbf{D}_i(u, v) \\ \mathbf{D}_i(u, v) \end{bmatrix}, \quad \forall i \in \{1, 2, \dots, N\} \]

To normalize large coordinate variations, we apply a norm scale factor:

\[ \mathbf{P}_i(u, v) = \frac{\mathbf{P}_i(u, v)}{\frac{1}{N \cdot H \cdot W} \sum_{i=1}^{N} \sum_{u=1}^{H} \sum_{v=1}^{W} \|\mathbf{P}_i(u, v)\|} \]

For training, we observe that $L1$ loss lacks sensitivity, while $L2$ loss over-regularizes outliers. Instead, we adopt Huber loss:

\[ \mathcal{L}_{\text{rec}}(\mathbf{\hat{P}}, \mathbf{P}) = \begin{cases} 0.5 \cdot \|\mathbf{\hat{P}} - \mathbf{P}\|_2^2, & \text{if } \|\mathbf{\hat{P}} - \mathbf{P}\|_2 < \beta \\ \|\mathbf{\hat{P}} - \mathbf{P}\|_1 - 0.5 \cdot \beta, & \text{otherwise} \end{cases} \]

We calculate $\mathcal{L}_{\text{rec}}$ only on valid depth points, masking out infinite depth values (e.g., sky regions). The final VAE training loss is:

\[ \mathcal{L}_{\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}}=\mathcal{L}_{\text{rec}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}} \]

4D Geometry DiT

Once XYZ VAE is trained, we leverage pretrained video diffusion models to denoise temporal pointmap latents, similar to RGB video latents. We use a Transformer-based DiT instead of UNet due to its scalability and spatiotemporal attention.

To leverage spatiotemporal priors from pretrained video diffusion models, we fine-tune on pointmap latents using rectified flow:

\[ \textbf{H}_{XYZ}^t = t \textbf{H}_{XYZ} + (1-t) \epsilon, \quad \epsilon \sim \mathcal{N}(0,1) \]

The 4D DiT model $\mathcal{F}$ predicts the velocity $\boldsymbol{\nu}_\epsilon$:

\[ \mathbb{E}_{t,\mathbf{H}_{XYZ},\textbf{H}_{RGB},\epsilon} || \mathcal{F}(\textbf{H}_{XYZ}^t,\textbf{H}_{RGB},t)- \boldsymbol{\nu}_\epsilon||^2 \]

where RGB video latent $\textbf{H}_{RGB}$ acts as an additional condition for denoising. Since XYZ VAE is fine-tuned from RGB VAE, we hypothesize that both representations share internal scene features, aiding transfer learning.

4D Pointmap Inference

At inference, we require only $\mathcal{E}_{\text{RGB}}$ and $\mathcal{D}_{\text{XYZ}}$. We sample random noise $\epsilon\sim \mathcal{N}(0,1)$ and concatenate it with video latent $\textbf{H}_{RGB}=\mathcal{E}_{\text{RGB}} (\textbf{V})$. The denoised pointmap latent $\textbf{H}_{XYZ}$ is then decoded:

\[ \mathbf{\hat{P}}=\mathcal{D}_{\text{XYZ}} (\textbf{H}_{XYZ}) \]

Since our method processes all video frames at once, we capture global spatiotemporal dependencies, ensuring temporally consistent and spatially coherent 4D pointmaps.

Post-Optimization for Downstream Tasks

Our 4D pointmaps $\mathbf{\hat{P}}$ naturally support various geometry tasks via simple post-optimization.

Intrinsic Estimation

Assuming all frames share the same camera intrinsics, we set the principal point:

\[ c_x={W}/{2}, \quad c_y={H}/{2} \]

Since the first frame is fixed as the coordinate frame, we optimize the focal length $f$ via Weiszfeld’s algorithm:

\[ \hat{f} = \arg \min_{f} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u-c_x, v-c_y) - f \frac{(\mathbf{P}_1(u,v,0), \mathbf{P}_1(u,v,1)}{\mathbf{P}_1(u,v,2)} \right\| \]

Camera Pose Estimation

Given estimated intrinsics and fixed first-frame pose ($\mathbf{\hat{T}}_1=\mathbf{I}$), we infer remaining camera poses using RANSAC PnP:

\[ \mathbf{\hat{T}}_i = \arg \min_{\mathbf{\hat{T_i}}} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u,v) - \pi \left( \mathbf{K} \mathbf{\hat{T}}_i \mathbf{P}_i(u,v) \right) \right\|_2 \]

where $\mathbf{\hat{T}}_i \in \mathfrak{se}(3)$ for all frames $i$.

Video Depth Estimation

Depth maps are extracted via simple pinhole projection:

\[ \mathbf{\hat{D}}_i = \mathbf{\hat{K}} \mathbf{\hat{T}}\mathbf{P}_i \]

Acknowledgements

We build our method on top of these awesome repositories:

Sincere thanks to the authors for their great works!