The MonoDepth model, which calculates depth from a single picture, was studied by researchers from the Technical University of Delft in the Netherlands. A single image necessitates Deep Neural Networks’ relying on visual cues, which necessitates awareness of the surroundings, a fundamentally assumption-laden process.
The current focus of monocular depth estimation research has been on feature visualization and attribution. This is useful, however, the researchers at Delft used a different approach:
We treat the neural network as a black box, only measuring the responses (in this case depth maps) to certain inputs. [..] We modify or disturb the images, for instance by adding conflicting visual cues, and look for a correlation in the resulting depth maps.
In other words, they manipulate pictures to observe how the model alters the depth map. This allows them to guess which characteristics the model uses to calculate depth.
Extensive study on human vision yields a broad number of potential DNN characteristics, including:
• Position in the image. Objects that are further away tend to be closer to the horizon. When resting on the ground, the objects also appear higher in the image.
• Occlusion. Objects that are closer occlude those that lie behind them. Occlusion provides information on depth order, but not distance.
• Texture density. Textured surfaces that are further away appear more fine-grained in the image.
• Linear perspective. Straight, parallel lines in the physical world appear to converge in the image.
• Apparent size of objects. Objects that are further away appear smaller.
• Shading and illumination. Surfaces appear brighter when their normal points towards a light source. Light is often assumed to come from above. Shading typically provides information on depth changes within a surface, rather than relative to other parts of the image.
• Focus blur. Objects that lie in front or behind the focal plane appear blurred.
• Aerial perspective. Very far away objects (kilometers) have less contrast and take on a blueish tint.



In their studies, the researchers discovered that MonoDepth estimates the depth of items based on their vertical location rather than their apparent size. This can be influenced by camera location – roll and pitch – causing the model to overestimate distance. Furthermore, MonoDepth is untrustworthy when confronted with items that were not part of its training set.
While this work is confined to a single DNN trained on a single dataset, it highlights the importance of profiling machine learning models. With the advancement of deep neural networks, experimenting with novel network frameworks such as 3D convolution, graph convolution, attentional mechanism, and knowledge distillation may provide superior outcomes.
How do neural networks see depth in single images?, Tom van Dijk, Guido de Croon
Published: May 2019
DOI: https://arxiv.org/pdf/1905.07005.pdf