MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

TL;DR: Metric Anything: First to demonstrate clear scaling trends in metric depth estimation. Uses Sparse Metric Prompt (randomly masked depth maps) to decouple spatial reasoning from sensor biases, trained on ~20M image-depth pairs. The pretrained model excels at depth completion, super-resolution, and radar-camera fusion, while its distilled prompt-free student achieves SOTA on 7 downstream tasks including monocular depth estimation, camera intrinsics recovery, multi-view 3D reconstruction, and VLA planning.

Abstract

Scaling has powered recent advances in vision foundation models; however, extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using ∼20M image–depth pairs spanning reconstructed, captured, and rendered 3D data across 10,000+ camera models, we demonstrate—for the first time—a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution, and radar–camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single- and multi-view metric 3D reconstruction, and VLA planning. We also show that using the pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception.

Pipeline

Overview of Metric Anything. (I) We aggregate diverse open-source 3D data into per-pixel metric depth maps, forming a ~20M image–depth dataset captured by over 10,000 cameras across heterogeneous scenes. (II) Sparse Metric Prompts, generated by randomly masking depth maps, provide a minimal interface that decouples spatial reasoning from sensor and camera biases, enabling metric depth learning from noisy, heterogeneous sources. (III) The pretrained model and its distilled prompt-free student generalize robustly across multiple downstream tasks, revealing a clear scaling trend and establishing a solid foundation for versatile, data-driven metric perception.

Results

Video Demo on Real-world Application in a Zero-shot Setting

Blind spot completion test for left and right views uncovered by LiDAR

Input video (top) and output comparison (bottom). Drag the slider to compare left and right views.

Input

Output

Robustness under Environmental Degradation

In rainy environment, our model utilizes visual information to overcome significant LiDAR degradation, ensuring accurate depth super-resolution and completion as demonstrated in real-world vehicle testing.

Robustness under rainy environment

In low-light conditions, as scene brightness drops and visual signals deteriorate, our model maintains remarkable robustness despite severe degradation.

Robustness under night environment

Spatial Understanding of MLLMs

Our model achieves robust performance on the VSI-Bench spatial reasoning tasks, proving that its learned representations can significantly improve existing VLMs in critical areas such as navigation, object sizing, and route planning.

VLA planning

By distilling depth perception capabilities into a VLA model to predict depth tokens, our approach achieves superior manipulation performance on the LIBERO benchmark compared to Depth Anything V2, without requiring depth inputs at test time.

Citation


@article{metricanything2026,
    title={MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources},
    author={Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Xie Yan and Wei Chen},
    journal={arXiv preprint},
    year={2026}
}

MetricAnything

Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources