Now Reading
Depth and coordinates processing in neural networks

Depth and coordinates processing in neural networks

The MonoDepth model, which calculates depth from a single picture, was studied by researchers from the Technical University of Delft in the Netherlands. A single image necessitates Deep Neural Networks’ relying on visual cues, which necessitates awareness of the surroundings, a fundamentally assumption-laden process.

The current focus of monocular depth estimation research has been on feature visualization and attribution. This is useful, however, the researchers at Delft used a different approach:

We treat the neural network as a black box, only measuring the responses (in this case depth maps) to certain inputs. [..] We modify or disturb the images, for instance by adding conflicting visual cues, and look for a correlation in the resulting depth maps.

In other words, they manipulate pictures to observe how the model alters the depth map. This allows them to guess which characteristics the model uses to calculate depth.

Extensive study on human vision yields a broad number of potential DNN characteristics, including:

Position in the image. Objects that are further away tend to be closer to the horizon. When resting on the ground, the objects also appear higher in the image.
Occlusion. Objects that are closer occlude those that lie behind them. Occlusion provides information on depth order, but not distance.
Texture density. Textured surfaces that are further away appear more fine-grained in the image.
Linear perspective. Straight, parallel lines in the physical world appear to converge in the image.
Apparent size of objects. Objects that are further away appear smaller.
Shading and illumination. Surfaces appear brighter when their normal points towards a light source. Light is often assumed to come from above. Shading typically provides information on depth changes within a surface, rather than relative to other parts of the image.
Focus blur. Objects that lie in front or behind the focal plane appear blurred.
Aerial perspective. Very far away objects (kilometers) have less contrast and take on a blueish tint.

True object size H and position Y , Z in the camera frame, vertical image position y and apparent size h. Image coordinates (e.g. y) are measured from the center of the image.
Example test images and resulting disparity maps. The white car on the left is inserted into the image at a relative distance of 1.0 (left column), 1.5 (middle column) and 3.0 (right column), where a distance of 1.0 corresponds to the same scale and position at which the car was cropped from its original image. In the top row, both the position and scale of the car vary with distance, in the middle row only the position changes and the scale is kept constant, and in the bottom row the scale is varied while the position remains constant. The measurement region from which the estimated distance is obtained is indicated by a white outline in the disparity maps.
Objects do not need to have a familiar texture nor shape to be detected. The distance towards these non-existent obstacles appears to be determined by the position of their lower extent.

In their studies, the researchers discovered that MonoDepth estimates the depth of items based on their vertical location rather than their apparent size. This can be influenced by camera location – roll and pitch – causing the model to overestimate distance. Furthermore, MonoDepth is untrustworthy when confronted with items that were not part of its training set.

While this work is confined to a single DNN trained on a single dataset, it highlights the importance of profiling machine learning models. With the advancement of deep neural networks, experimenting with novel network frameworks such as 3D convolution, graph convolution, attentional mechanism, and knowledge distillation may provide superior outcomes.

How do neural networks see depth in single images?, Tom van Dijk, Guido de Croon

Published: May 2019

Scroll To Top