Theoretical Machine Learning Seminar

The current era of large scale machine learning powered by Deep Learning methods has brought about tremendous advances, driven by the lightweight Stochastic Gradient Descent (SGD) method. Despite relying on a simple algorithmic primitive, this era brings together a multitude of algorithmic challenges including issues based on non-convexity, parallel/distributed computing, speeding up our learning algorithms and automating these (learning) procedures. Focusing on three of these issues, this talk will present a mix of theory results and their (qualitative) implications in practice.

The first part of this talk begins with presenting a precise understanding of mini-batch SGD, which is used commonly as a means to parallelize SGD. Our result indicates linear parallelization speedups offered by mini-batch SGD until a certain (efficiently computable) problem dependent batch size threshold, corroborating with practical observations noted in the context of deep learning.

This talk then considers momentum methods, which are commonly used in practice in conjunction with SGD. We present a counterpoint to the widely held belief that the momentum technique allows SGD to converge faster, especially when used with small batch sizes. We then present Accelerated SGD, which is an algorithm that allows SGD to converge provably faster across small or large batch sizes. Empirically, Accelerated SGD is demonstrated to improve over standard momentum for tasks involving training of practically useful deep learning models.

The final part of this talk considers learning rate schedules employed in conjunction with SGD. There is, in fact, a stark sense of disparity between the theoretically prescribed polynomially decaying stepsizes (with iterate averaging) and what is used in practice, which are variants of "cut the learning rates by a constant factor every constant number of epochs" (which resembles an exponentially decaying stepsize scheme). This talk presents reasons that support the use of schemes that mimic the latter, both from a theory perspective, as well as its implications in optimization for Deep Learning problems.

Bio: Rahul Kidambi is a PhD candidate at the University of Washington Seattle working with Sham Kakade. He is interested in the design and implementation of provably effective algorithms for Machine Learning problems. Prior to UW, he was at Microsoft Research India working on problems at the intersection of Structured Prediction, Semi-Supervised Learning and Active Learning. More information about him can be found in his webpage: https://rahulkidambi.github.io/

Date & Time

February 13, 2019 | 12:15pm – 1:15pm

Location

Princeton University, CS 302

Speakers

Rahul Kidambi

Affiliation

University of Washington