# Language, trees, and geometry in neural networks Language is made of discrete structures, yet neural networks operate on continuous data: vectors in high-dimensional space. A successful language-processing network must translate this symbolic information into some kind of geometric representation—but in what form? Word embeddings provide two well-known examples: distance encodes semantic similarity, while certain directions correspond to polarities (e.g. male vs. female).

A fresh and interesting finding suggests a completely new form of representation. The syntactic structure of a sentence is an important piece of linguistic information. This structure may be represented as a tree, with nodes corresponding to sentence words. In A structural probe for discovering syntax in word representations, Hewitt and Manning demonstrate that various language-processing networks build geometric replicas of such syntax trees. Words are assigned places in a high-dimensional space, and the Euclidean distance between these sites translates to tree distance (through a transformation).

This finding, however, is accompanied with an intriguing dilemma. The relationship between tree distance and Euclidean distance is not a straight line. Hewitt and Manning discovered that tree distance equals the square of Euclidean distance. They wonder whether squaring distance is required and if there are alternative potential translations.

Squared-distance mappings of trees are particularly intuitive from a mathematical standpoint. Certain randomized tree embeddings will also follow an approximate squared-distance law. Furthermore, simply understanding the squared-distance connection enables us to provide a straightforward, clear description of the overall form of a tree embedding. A simple Pythagorean embedding into the vertices of a unit cube. A chain of four points also has a Pythagorean embedding into the vertices of a unit cube.

Any two Pythagorean embeddings of the same tree are isometric—and related by a rotation or reflection—since distances between all pairs of points are the same in both. So we may speak of the Pythagorean embedding of a tree, and this theorem tells us exactly what it looks like. Left: Assigning basis vectors to edges. Middle: two example embeddings. Right: Distance squared equals tree distance.

Many people have studied these embeddings to see what sort of information they might contain. In our terminology, the context embeddings approximate a Pythagorean embedding of a sentence’s dependency parse tree. That means we have a good idea—simply from the squared-distance property and Theorem 1.1—of the overall shape of the tree embedding.

To begin, the context embeddings must be transformed using a specific matrix B, known as a structural probe. However, the square of the Euclidean distance between the context embeddings of two words approximates the parse tree distance between the two words.

“Here is where the math in the previous section pays off. In our terminology, the context embeddings approximate a Pythagorean embedding of a sentence’s dependency parse tree. That means we have a good idea—simply from the squared-distance property and Theorem 1.1—of the overall shape of the tree embedding.”

The precise form is unknown because the embedding is only roughly Pythagorean. However, the disparity between the desired form and the actual shape has the potential to be quite fascinating. Systematic discrepancies between empirical embeddings and their mathematical idealization may give further information about how BERT analyzes language.

A visualization tool uses a sentence with associated dependency parse tree.  to examine these discrepancies. The program collects context embeddings from BERT for this sentence, which is then converted using Hewitt and Manning’s “structural probe” matrix, giving a collection of points in 1,024-dimensional space.
PCA is then used to project these points to two dimensions. Pairs of dots representing words with a dependency relation are linked to illustrate the underlying tree structure. The outcome for a sample text is shown below, along with PCA projections for the same data of an exact Pythagorean embedding for comparison. a) PCA view of a BERT parse tree embedding. b) an exact Pythagorean embedding. c) different random-branch embeddings. d) different embeddings where node positions are chosen independently at random. Mouse over nodes in any image to compare across embeddings.

Visualizing and Measuring the Geometry of BERT, Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg

Published: June 2019
https://arxiv.org/abs/1906.02715v2