Abstract
Scaling has powered recent advances in vision foundation models; however, extending this paradigm to metric depth estimation remains
challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce
Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources
without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse
Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from
sensor and camera biases. Using ∼20M image–depth pairs spanning reconstructed, captured, and rendered 3D data across 10,000+ camera models,
we demonstrate—for the first time—a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such
as depth completion, super-resolution, and radar–camera fusion, while its distilled prompt-free student achieves state-of-the-art results on
monocular depth estimation, camera intrinsics recovery, single- and multi-view metric 3D reconstruction, and VLA planning. We also show that
using the pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities
in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models,
establishing a new path toward scalable and efficient real-world metric perception.