Theoretical Machine Learning Seminar

A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Three fundamental factors determine the quality of a statistical learning algorithm: expressiveness, optimization and generalization. The classic strategy for handling these factors is relatively well understood. In contrast, the radically different approach of deep learning, which in the last few years has revolutionized the world of artificial intelligence, is shrouded by mystery. This talk will describe a series of works aimed at unraveling some of the mysteries behind expressiveness and optimization. I will begin by establishing an equivalence between convolutional networks - the most successful deep learning architecture to date, and hierarchical tensor decompositions. The equivalence will be used to answer various questions concerning the expressiveness of convolutional networks. I will then turn to discuss recent work analyzing optimization of deep linear networks. Surprisingly, in stark contrast with conventional wisdom, we find that depth, despite its non-convex nature, can accelerate optimization.Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, the standard sparse recovery tool, which may explain why they lead to better representations in practice. Joint work with Sanjeev Arora, Nikunj Saunshi, and Kiran Vodrahalli.

Date & Time

April 05, 2018 | 12:15pm – 1:45pm

Location

White-Levy Room

Speakers

Mikhail Khodak

Affiliation

Princeton University