EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries

Oct 1, 2023·
Jinjie Mai
,
Abdullah Hamdi
,
Silvio Giancola
,
Chen Zhao
,
Bernard Ghanem
· 4 mins read
An overview of VQ3D task.
Abstract
With the recent advances in video and 3D understanding, novel 4D spatio-temporal methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by unprojecting the 2D localization results of the sibling task Visual Queries with 2D Localization (VQ2D) into 3D predictions. Yet, we point out that the low number of camera poses caused by camera re-localization from previous VQ3D methods severally hinders their overall success rate. In this work, we formalize a pipeline (we dub EgoLoc) that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. Our approach involves estimating more robust camera poses and aggregating multi-view 3D displacements by leveraging the 2D detection confidence, which enhances the success rate of object queries and leads to a significant improvement in the VQ3D baseline performance. Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task. We provide a comprehensive empirical analysis of the VQ3D task and existing solutions, and highlight the remaining challenges in VQ3D.
Type
Publication
IEEE/CVF International Conference on Computer Vision

Overview

Overview of VQ3D Task

We formalize the VQ3D task as defined in the Ego4D Episodic Memory Benchmark. Given an egocentric video $\mathcal{V}$, a query object $o$ defined by a single visual crop $v$, and a query frame $q$, the objective is to estimate the relative displacement vector $\Delta d = (\Delta x, \Delta y, \Delta z)$ defining the 3D location where the query object $o$ was last seen in the environment, with respect to the reference system defined by the 3D pose of the query frame $q$.

Method

To localize a given image query in the video geometrically, we propose a multi-stage pipeline to entangle 2D information with 3D geometry.

First, we perform Structure from Motion (SfM), which estimates the 3D poses $\{T_0, ..., T_{N-1}\}$ for all the $N$ video frames $\{k_{0},...,k_{N-1}\}$.

Second, we feed the frames of an egocentric video $\mathcal{V}$ and the visual crop $v$ with the query object $o$ to a model that retrieves peak response frames $\{k_{p_0}, k_{p_1}..., \}$ with corresponding 2D bounding boxes $\{b_{p_0}, b_{p_1}..., \}$ of the query object $o$.

Finally, for each response frames $k_{p_i}$, we estimate the depth and back-project the object centroid to 3D using estimated pose $T_{k_{p_i}}$. We recover the world 3D location $[\hat{x},\hat{y},\hat{z}]$ of the object by aggregating per response frame ${p_i}$’s prediction $[x_{p_i},y_{p_i},z_{p_i},s_{p_i}]$. The final relative displacement vector $\Delta d$ is obtained by projection using $T_q$ with respect to the query frame $q$.

Overview of our EgoLoc pipeline
Figure: Overview of our EgoLoc pipeline. Given an egocentric video and a visual query, we first perform Structure from Motion to estimate 3D poses for all frames. Then, we retrieve frames containing the query object with corresponding 2D bounding boxes. Finally, we estimate depth and back-project to 3D to recover the object’s location.