Sora3R: Can Video Diffusion Model Reconstruct 4D Geometry ?

Visualization
Note that our method doesn’t output a dynamic or confidence mask by default. For fair comparison, we apply groundtruth background masks on each frame for all methods (Sora3R, DUSt3R, MonST3R), if the dataset provides, to filer out the background regions for all the visualization results.
For completeness, MonST3R w dyn seg is the default visualization setting of MonST3R in their source repo, with their confidence-based dynamic segmentation mask, reserving static background regions, instead of per-frame groundtruth background mask.
Comparison on Sintel Dataset
Sora3R (Ours)
DUSt3R
MonST3R
MonST3R w dyn seg
source video
Comparison on TUM-dynamic Dataset
Sora3R (Ours)
DUSt3R
MonST3R
MonST3R w dyn seg
source video
Comparison on ScanNet Dataset
Sora3R (Ours)
DUSt3R
MonST3R
MonST3R w dyn seg
source video

Additional Visualization
Besides the results we present in the test dataset, we also show some additional results on the test set of our training datasets, including unseen objects or indoor / outdoor videos.
To align with camera pose visualization, fused 4D pointmaps are obtained by estimating the camera poses and depth maps from raw pointmap prediction first, then re-backprojecting again.
RealEstate10K Dataset
source video
Sora3R (Ours)

Fused 4D pointmaps
source video
Sora3R (Ours)

Fused 4D pointmaps
Objaverse Dataset
source video
Sora3R (Ours)

Fused 4D pointmaps
source video
Sora3R (Ours)

Fused 4D pointmaps
TartanAir Dataset
source video
Sora3R (Ours)

Fused 4D pointmaps
source video
Sora3R (Ours)

Fused 4D pointmaps
Dynamic Replica Dataset
source video
Sora3R (Ours)

Fused 4D pointmaps
source video
Sora3R (Ours)

Fused 4D pointmaps
Point Odyssey Dataset
source video
Sora3R (Ours)

Fused 4D pointmaps
source video
Sora3R (Ours)

Fused 4D pointmaps
Method
We describe our model design and training in Sec. \ref{sec:xyz_vae} and Sec. \ref{sec:4d_dit}, while model inference is detailed in Sec. \ref{sec:inference}. We also present post-optimization to infer intrinsic, extrinsic, and depth from 4D pointmaps in Sec. \ref{sec:post_optimization}.
Temporal Pointmap Latent and VAE
Given a raw video input $\mathbf{V} \in \mathbb{R}^{N\times H\times W\times C}$, a pretrained temporal VAE with encoder $\mathcal{E}_{\text{RGB}}$ and decoder $\mathcal{D}_{\text{RGB}}$ models the video latent distribution.
A pointmap represents pixel-wise 3D coordinates per frame, establishing one-to-one pixel-point correspondences in the world frame. Unlike prior methods that freeze pretrained VAEs, we argue fine-tuning is necessary for transferring temporal RGB images to temporal pointmaps. Depth values can be highly unbounded, making naive encoders ineffective. We propose a temporal pointmap latent space that remains compatible with video latents while capturing 4D geometry.
To learn this representation, we fine-tune RGB VAE $\{\mathcal{E}_{\text{RGB}},\mathcal{D}_{\text{RGB}}\}$ into XYZ VAE $\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}$ using known groundtruth camera poses $\{\mathbf{T}_i\}_{i=1}^N$, where $\mathbf{T}_i \in \mathfrak{se}(3)$. We always fix the first frame as the world coordinate frame, i.e., $\mathbf{T}_1=\mathbf{I}$. Given depth maps $\mathbf{D}$ and intrinsic matrix $\mathbf{K}$, the global pointmap is computed as:
\[ \mathbf{P}_i(u, v) = \mathbf{T}_i \cdot \mathbf{K}^{-1} \begin{bmatrix} u \cdot \mathbf{D}_i(u, v) \\ v \cdot \mathbf{D}_i(u, v) \\ \mathbf{D}_i(u, v) \end{bmatrix}, \quad \forall i \in \{1, 2, \dots, N\} \]To normalize large coordinate variations, we apply a norm scale factor:
\[ \mathbf{P}_i(u, v) = \frac{\mathbf{P}_i(u, v)}{\frac{1}{N \cdot H \cdot W} \sum_{i=1}^{N} \sum_{u=1}^{H} \sum_{v=1}^{W} \|\mathbf{P}_i(u, v)\|} \]For training, we observe that $L1$ loss lacks sensitivity, while $L2$ loss over-regularizes outliers. Instead, we adopt Huber loss:
\[ \mathcal{L}_{\text{rec}}(\mathbf{\hat{P}}, \mathbf{P}) = \begin{cases} 0.5 \cdot \|\mathbf{\hat{P}} - \mathbf{P}\|_2^2, & \text{if } \|\mathbf{\hat{P}} - \mathbf{P}\|_2 < \beta \\ \|\mathbf{\hat{P}} - \mathbf{P}\|_1 - 0.5 \cdot \beta, & \text{otherwise} \end{cases} \]We calculate $\mathcal{L}_{\text{rec}}$ only on valid depth points, masking out infinite depth values (e.g., sky regions). The final VAE training loss is:
\[ \mathcal{L}_{\{\mathcal{E}_{\text{XYZ}},\mathcal{D}_{\text{XYZ}}\}}=\mathcal{L}_{\text{rec}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}} \]4D Geometry DiT
Once XYZ VAE is trained, we leverage pretrained video diffusion models to denoise temporal pointmap latents, similar to RGB video latents. We use a Transformer-based DiT instead of UNet due to its scalability and spatiotemporal attention.
To leverage spatiotemporal priors from pretrained video diffusion models, we fine-tune on pointmap latents using rectified flow:
\[ \textbf{H}_{XYZ}^t = t \textbf{H}_{XYZ} + (1-t) \epsilon, \quad \epsilon \sim \mathcal{N}(0,1) \]The 4D DiT model $\mathcal{F}$ predicts the velocity $\boldsymbol{\nu}_\epsilon$:
\[ \mathbb{E}_{t,\mathbf{H}_{XYZ},\textbf{H}_{RGB},\epsilon} || \mathcal{F}(\textbf{H}_{XYZ}^t,\textbf{H}_{RGB},t)- \boldsymbol{\nu}_\epsilon||^2 \]where RGB video latent $\textbf{H}_{RGB}$ acts as an additional condition for denoising. Since XYZ VAE is fine-tuned from RGB VAE, we hypothesize that both representations share internal scene features, aiding transfer learning.
4D Pointmap Inference
At inference, we require only $\mathcal{E}_{\text{RGB}}$ and $\mathcal{D}_{\text{XYZ}}$. We sample random noise $\epsilon\sim \mathcal{N}(0,1)$ and concatenate it with video latent $\textbf{H}_{RGB}=\mathcal{E}_{\text{RGB}} (\textbf{V})$. The denoised pointmap latent $\textbf{H}_{XYZ}$ is then decoded:
\[ \mathbf{\hat{P}}=\mathcal{D}_{\text{XYZ}} (\textbf{H}_{XYZ}) \]Since our method processes all video frames at once, we capture global spatiotemporal dependencies, ensuring temporally consistent and spatially coherent 4D pointmaps.
Post-Optimization for Downstream Tasks
Our 4D pointmaps $\mathbf{\hat{P}}$ naturally support various geometry tasks via simple post-optimization.
Intrinsic Estimation
Assuming all frames share the same camera intrinsics, we set the principal point:
\[ c_x={W}/{2}, \quad c_y={H}/{2} \]Since the first frame is fixed as the coordinate frame, we optimize the focal length $f$ via Weiszfeld’s algorithm:
\[ \hat{f} = \arg \min_{f} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u-c_x, v-c_y) - f \frac{(\mathbf{P}_1(u,v,0), \mathbf{P}_1(u,v,1)}{\mathbf{P}_1(u,v,2)} \right\| \]Camera Pose Estimation
Given estimated intrinsics and fixed first-frame pose ($\mathbf{\hat{T}}_1=\mathbf{I}$), we infer remaining camera poses using RANSAC PnP:
\[ \mathbf{\hat{T}}_i = \arg \min_{\mathbf{\hat{T_i}}} \sum_{u=1}^{W} \sum_{v=1}^{H} \left\| (u,v) - \pi \left( \mathbf{K} \mathbf{\hat{T}}_i \mathbf{P}_i(u,v) \right) \right\|_2 \]where $\mathbf{\hat{T}}_i \in \mathfrak{se}(3)$ for all frames $i$.
Video Depth Estimation
Depth maps are extracted via simple pinhole projection:
\[ \mathbf{\hat{D}}_i = \mathbf{\hat{K}} \mathbf{\hat{T}}\mathbf{P}_i \]Acknowledgements
We build our method on top of these awesome repositories:
Sincere thanks to the authors for their great works!