3 million sample pairs, 2 million real-shot pairs: the data scarcity in depth estimation has finally been broken.

MaticHoleFiller · 2026-04-05T18:14:48+00:00

(Source: Machine Heart)Machine Heart Editorial TeamPeople who work on depth estimation and depth completion probably have experienced this moment.The model achieves impressive scores on classic benchmarks like NYU Depth V2, and the metrics look good. But once you deploy the same model on a real robot, the problems immediately surface: depth map edges are blurry, distant objects drift, and it almost completely fails when encountering reflective materials.Your first instinct is often to think there's a bug in the implementation, so you check the code and training process thoroughly. But in the end, you find that the code is fine.The problem lies in the data.Actually, this is not an isolated case but one of the long-standing challenges in this field.Progress in academic research on depth estimation and depth completion has, to some extent, been constrained by the limitations of datasets. Over the past decade, the community

MaticHoleFiller

2026-04-05 18:14:48

(Source: Muck and Minds)

Muck and Minds Editorial Office

People who do deep estimation and deep completion have probably all had a moment like this.

On a classic benchmark like NYU Depth V2, the model scores impressively and the metrics are good-looking. But the moment you deploy the same model onto a real robot, the problem shows up immediately: the depth map edges blur, objects in the distance drift, and when it encounters reflective materials, it almost completely fails.

Your first reaction is often that there must be a bug in the implementation, so you trace things from the code all the way through the training pipeline. But in the end, you find the code is fine.

The issue is with the data.

In fact, this isn’t an isolated case—it’s one of the long-standing dilemmas in this field.

Academic progress in depth estimation and depth completion, to a certain extent, has been constrained by the “ceiling” of available datasets. Over the past decade-plus, the community has relied heavily on a few classic datasets: NYU Depth V2 focuses mainly on apartment and office scenes, with limited indoor coverage; KITTI targets autonomous driving and has solid outdoor road scenes, but it offers little direct value for embodied intelligence; ScanNet makes a huge contribution to indoor reconstruction, but its frame-sequence format was not designed for paired depth training; ETH3D and DIML each have their own focus, and their scale is not sufficient to support today’s large-model training needs.

Although synthetic datasets can now help address data scarcity, there’s a visible gap between synthetic data in rendering materials and real-world scenes. The depth priors a model learns on synthetic data often break down immediately when faced with real-world reflective metal, transparent glass, and complex textures.

Without large-scale real data, it’s difficult to implement a systematic solution to this gap. Until the end of March, this situation finally began to loosen.

Antgroup LingBo completed a long-awaited move in this area: it has open-sourced about 3 million pairs of high-quality RGB–depth data in one go—LingBot-Depth-Dataset. Each sample includes an RGB image, the sensor’s original depth, and the corresponding ground-truth depth, providing a complete set of paired signals for training.

The entire dataset is 2.71TB in size, including about 2 million pairs of real captured RGB-D data and 1 million pairs of high-quality rendered data; for the real data portion, it covers 6 mainstream depth cameras on the market—Orbbec 335, 335L, RealSense D405, D415, D435, D455—to reproduce the real sensing distribution under different hardware conditions as much as possible.

The dataset is released under the CC BY-NC-SA 4.0 license, allowing free use and re-creation in academic and non-commercial scenarios.

ModelScope Model Hub Community: https://modelscope.cn/datasets/Robbyant/LingBot-Depth-Dataset
HuggingFace: https://huggingface.co/datasets/robbyant/mdm_depth

In fact, the dataset’s effectiveness has long been validated at the model level. LingBot-Depth, an embodied intelligence perception model open-sourced by Antgroup LingBo in January this year, is trained based on this dataset.

In terms of real-world performance, without changing hardware, LingBot-Depth can significantly improve the quality of depth outputs in scenes with complex materials such as transparency and reflections. Moreover, on two core metrics—depth accuracy and pixel coverage—it has already comprehensively outperformed the current top industrial RGB-D cameras on the market.

Against this backdrop, Antgroup LingBo chose to fully open-source this dataset, making the internally validated data available to the entire community.

Built on

LingBot-Depth, constructed from the LingBot-Depth-Dataset dataset, can still output high-precision depth results in real-world scale, even in complex scenes where traditional depth sensors are prone to fail

LingBot-Depth related links:

Hugging Face: https://huggingface.co/robbyant/lingbot-depth
ModelScope: https://modelscope.cn/models/robbyant/lingbot-depth
Tech Report: https://arxiv.org/abs/2601.17895

Why is the scale of real data so critical?

To understand the value of LingBot-Depth-Dataset, you first need to understand why real-captured depth data is so hard to obtain.

Data collection cost is the first hurdle. To capture high-quality RGB-D data, you need to synchronize the RGB camera with the depth sensor in time and perform spatial calibration; calibration precision directly affects how well the depth map and the colored image pixels align. Deploying multiple devices at scale and conducting systematic data capture across multiple scenes is far more engineering complex than ordinary video capture. In addition, different scenes (strong light, weak light, reflective surfaces, transparent materials) have significantly different impacts on sensor performance, requiring targeted processing.

Raw depth maps have inherent flaws. Depth images captured by structured-light and ToF sensors usually contain many invalid pixels (holes); near the edges there are flying pixels; and on reflective or transparent surfaces, the depth values fail. This means raw sensor depth maps cannot be used directly as training ground truth and require additional steps to generate dense, accurate ground-truth depth maps—and this processing itself is a technical challenge.

Obtaining ground-truth labels is hard. Unlike image classification, where human labeling can be used or weak supervision from networks can help, depth ground truth must rely on physical measurement or precise multi-sensor fusion. LiDAR can provide high-precision sparse point clouds, but it needs precise calibration with the camera and time synchronization. Structured-light systems have limited precision and are sensitive to lighting. Stereo matching can provide dense depth, but it often fails in texture-flat regions. No single solution is perfect; large-scale collection requires trade-offs among precision, cost, and coverage.

Copyright and willingness to open up are another implicit hurdle. The industry invests a lot of resources into large-scale data collection, but data is often viewed as a competitive moat rather than a public resource. Many teams have sizeable internal datasets but have never considered releasing them. This creates a peculiar situation: academia’s desire for data is far apart from the industry’s possession of data, and the datasets academia depends on are often side products created by some team years ago.

Because of the reasons above, large-scale real-world RGB-D datasets still remain scarce resources in the open-source community.

3 million RGB-D pairs: a step-change in scale

Antgroup LingBo has released 3 million RGB-D samples at once. In the current open-source community, this is already one of the largest real-world RGB-D datasets.

The entire dataset is not simply an aggregation of data. Instead, it was structurally designed around the real-world depth perception task, consisting of four subsets:

RobbyReal: 1,400,000 pairs of real indoor scene data captured with multiple devices, forming the core body of the dataset.

This portion covers 6 mainstream depth cameras on the market: Orbbec 335, 335L, RealSense D405, D415, D435, and D455. These devices differ significantly in ranging range, noise patterns, edge performance, and responses to different materials. The significance of this design is: introduce cross-device differences into the training distribution in advance.

Conventional datasets are often tied to a single device. Models perform well on that device, but once transferred to other hardware environments, performance drops noticeably. By using multi-device data, the LingBot-Depth-Dataset lets the model encounter different sensor characteristics during training, thereby improving cross-device generalization.

For models that need to be deployed on robots, AR devices, or industrial systems in practice, this point directly determines their engineering usability.

Example from the RobbyReal dataset

RobbyVla: 580,960 pairs of data captured during a robot’s execution of real tasks in vision–language–action (VLA).

The logic of capturing traditional depth datasets is that a person holds a camera to scan the scene—viewpoints are naturally continuous, and objects are typically at mid-to-far distances. But robot manipulation tasks are different: when photographing target objects, the distance is often only 20–50cm; depth accuracy at object edges determines whether grasping succeeds or fails. In tabletop manipulation scenes, lighting is complex, and depth measurement itself is difficult for materials such as metal, glass, and transparent plastics.

These characteristics give RobbyVla unique value that existing datasets cannot replace: it is depth data collected under real embodied task constraints, with a scene distribution highly aligned with robot learning tasks. For researchers who want to train spatial perception capabilities to serve manipulation tasks, this batch of data can directly reduce the loss from out-of-distribution generalization.

RobbyVla

Dataset examples

RobbySim: 999,264 pairs of simulated rendering data, generated from dual-camera viewpoints.

Single-camera rendering can easily introduce systematic viewpoint bias. The dual-camera setup introduces a stereo disparity constraint during generation, making the generated depth maps more reliable in terms of geometric consistency.

RobbySim

Dataset examples

The RobbySimVal validation set (38,976 pairs) provides a standardized simulated-scene evaluation benchmark, enabling researchers to quickly assess model performance in the simulation domain without consuming real data.

Example from the RobbySimVal validation set

Beyond its large quantity, Antgroup LingBo also set extremely high standards for dataset quality. From raw capture to ground-truth construction, LingBot-Depth-Dataset does not simply rely on sensor outputs; it performs systematic processing and calibration on the depth data.

Each sample includes one RGB image, the sensor’s raw depth map, and the ground-truth depth map.

By providing a complete paired signal of raw observations plus ground truth, the model can not only learn depth prediction, but also learn how to recover real structure from noisy data.

At the same time, during the labeling process, the data follows unified standards with strict controls on precision and consistency, avoiding training bias caused by label noise. This is especially critical in deep learning: incorrect depth labels are often more destructive than having no labels at all.

With such assurances in both quantity and quality, the value of LingBot-Depth-Dataset is no longer just a usable dataset—it starts to carry more fundamental significance.

In the past few years, industry attention has focused more on models: larger parameter scales, more complex architectures, and stronger inference capabilities. But an increasingly clear consensus is that the upper bound of model capability is being determined more and more by data. Especially as AI moves from language to the physical world, the importance of data is amplified: world models need environment data that can be interacted with; robots rely on long-tail and truly real scene distributions; and multimodal systems must align signals coming from different sensing channels. In this context, large-scale, high-quality, structured datasets are becoming a new core of competitive advantage.

The emergence of LingBot-Depth-Dataset, at its core, drives a deeper shift: depth perception is gradually moving from a lab problem dependent on ideal conditions toward an engineering problem that is practical and reusable.

Closing remarks

For the direction of depth estimation and completion, it has long been in an awkward state: downstream demand (robots, AR, autonomous driving) is growing fast, but the openness of foundational data resources far lags behind areas like visual recognition and NLP. After more than a decade, NYUv2 is still a standard evaluation set—partly because there hasn’t been a better alternative, rather than because it is inherently good enough.

Just as ImageNet reshaped vision, simulation environments have pushed forward autonomous driving. For embodied intelligence, high-quality spatial perception data may be that gap which has not yet been sufficiently filled—and LingBot-Depth-Dataset could very well become the next-generation benchmark foundation in the field of depth estimation / depth completion.

Open-sourcing may not immediately cause a performance explosion. But it is changing something more fundamental: we are finally starting to have depth data that is close enough to the real world.

Antgroup LingBo’s open-sourcing investment in this foundational infrastructure means that for every research team that doesn’t need to collect data from scratch, they can focus on higher-level problems.

Massive information, precise interpretation—available in the Sina Finance app

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.