MathsDLspring19
Topics course Mathematics of Deep Learning, NYU, Spring 19. CSCIGA 3033.
Logistics

Mondays from 7.10pm9pm. CIWW 102

Tutoring Session with Parallel Curricula (optional): Fridays 11am12:15pm CIWW 101.

Piazza: signup here

Office Hours: Tuesdays 4:30pm6:00pm, office 612, 60 5th ave.
Instructors
Lecture Instructor: Joan Bruna (bruna@cims.nyu.edu)
Tutor (Parallel Curricula): Luca Venturi (lv800@nyu.edu)
Tutor (Parallel Curricula): Aaron Zweig (az831@nyu.edu)
Syllabus
This Graduatelevel topics course aims at offering a glimpse into the emerging mathematical questions around Deep Learning. In particular, we will focus on the different geometrical aspects surounding these models, from input geometric stability priors to the geometry of optimization, generalisation and learning. We will cover both the background and the current open problems.
Besides the lectures, we will also run a parallel curricula (optional), following the Depth First Learning methodology. We will start with an inverse curriculum on the Neural ODE paper by Chen et al.
Detailed Syllabus

Introduction: the Curse of Dimensionality
 Part I: Geometry of Data
 Euclidean Geometry: transportation metrics, CNNs , scattering.
 NonEuclidean Geometry: Graph Neural Networks.
 Unsupervised Learning under Geometric Priors (Implicit vs explicit models, microcanonical, transportation metrics).
 Applications and Open Problems: adversarial examples, graph inference, inverse problems.
 Part II: Geometry of Optimization and Generalization
 Stochastic Optimization (Robbins & Munro, Convergence of SGD)
 Stochastic Differential Equations (FokkerPlank, Gradient Flow, Langevin Dynamics, links with SGD; open problems)
 Dynamics of Neural Network Optimization (Mean Field Models using Optimal Transport, Kernel Methods)
 Landscape of Deep Learning Optimization (Tensor/Matrix factorization, Deep Nets; open problems).
 Generalization in Deep Learning.
 Part III (time permitting): Open qustions on Reinforcement Learning
Prerequisites
Multivariate Calculus, Linear Algebra, Probability and Statistics at solid undergraduate level.
Notions of Harmonic Analysis, Differential Geometry and Stochastic Calculus are nicetohave, but not essential.
Grading
The course will be graded with a final project – consisting in an indepth survey of a topic related to the syllabus, plus a participation grade. The detailed abstract of the project will be graded at the midterm.
Final Project is due May 1st by email to the instructors
Lectures
Week  Lecture Date  Topic  References 

1  1/28  Guest Lecture: Arthur Szlam (Facebook)  References 
2  2/4  Lec2 Euclidean Geometric Stability. Slides  References 
3  2/11  Guest Lecture: Leon Bottou (Facebook/NYU) Slides  References 
4  2/18  Lec3 Scattering Transforms and CNNs Slides  References 
5  2/25  Lec4 NonEuclidean Geometric Stability. GromovHausdorff distances. Graph Neural Nets Slides  References 
6  3/4  Lec5 Graph Neural Network Applications Slides  References 
7  3/11  Lec6 Unsupervised Learning under Geometric Priors. Implicit vs Explicit models. Optimal Transport models. Microcanonical Models. Open Problems Slides  References 
8  3/18  Spring Break  References 
9  3/25  Lec7 Discrete vs Continuous Time Optimization. The Convex Case. Slides  References 
10  4/1  Lec8 Discrete vs Continuous Time Optimization. Stochastic and Nonconvex case Slides  References 
11  4/8  Lec9 Gradient Descent on Nonconvex Optimization. Slides  References 
12  4/15  Lec10 Gradient Descent on Nonconvex Optimization. Escaping Saddle Points efficiently. Slides  References 
13  4/22  Lec11 Landscape of Deep Learning Optimization. Spin Glasses, KacRice, RKHS, Topology. Slides  References 
14  4/29  Lec12 Guest Lecture: Behnam Neyshabur (IAS/NYU): Generalization in Deep Learning Slides  References 
15  5/6  Lec13 Stability. Open Problems.  References 
Lab sessions / Parallel Curricula
DistributionalRL: Living document
 Class 1: Basics of RL and Q learning
 Required Reading:
 Sutton and Barto (Ch 3, Ch 4, Ch 5, Ch 6.5)
 The standard introduction to RL. Focus in Chapter 3 on getting used to the notation we’ll use throughout the module, and an introduction to the Bellman operator and fixed point equations. In Chapter 4 the most important idea is value iteration (and exercise 4.10 will ask you to show why iterating the Q function is basically the same algorithm).
 Chapter 5 considers using full rollouts to estimate our value / Q function, rather than the DP updates. Focus on the difference between onpolicy and offpolicy, which will be relevant to the final algorithm.
 Including 6.5 is an introduction to Qlearning in practice, updating one stateaction pair at a time (without worrying about function approximation yet).
 Contraction Mapping Theorem (3.1)
 We’ll need the notion of contractions repeatedly throughout the module. Their essential property is a unique fixed point, and you should have a clear understanding of the constructive proof of this fixed point (don’t worry about the ODE applications).
 Sutton and Barto (Ch 3, Ch 4, Ch 5, Ch 6.5)
 Questions:
 Exercise 3.14, Exercise 4.10 in S & B
 Prove the Bellman operator contracts Q functions with regard to the infinity norm
 What is a sanitycheck lower bound on complexity for Q learning? Why might this be infeasible for RL problems in the wild?
 Required Reading:
NeuralODE: Living document
 Class 6: Neural ODEs
 Motivation: Let’s read the paper!
 Required Reading:
 Optional Reading:
 A followup paper by the authors on scalable continuous normalizing flows: Freeform Continuous Dynamics for Scalable Reversible Generative Models
 Class 5: The adjoint method (and autodiff)
 Motivation: The adjoint method is a numerical method for efficiently computing the gradient of a function in numerical optimization problems. Understanding this method is essential to understand how to train ‘continuous depth’ nets. We also review the basics of Automatic Differentiation, which will help us understand the efficiency of the algorithm proposed in the NeuralODE paper.
 Required Reading:
 Section 8.7 from Computational Science and Engineering (CSE)
 Sections 2,3 from Automatic Differentiation in Machine Learning: a Survey
 Optional Reading:
 Questions:
 Exercises 1,2,3 from Section 8.7 of CSE
 Consider the problem of optimizing a realvalued function g over the solution of the ODE y’ = Ay , y(0) = y_0 at time T>0: min_{y0, A} g(y(T)). What is the solution of the adjoint equation?
 How do you get eq. (14) in Section 8.7 of CSE?
 Class 4: Normalizing Flow
 Motivation: In this class we take a little detour through the topic of Normalizing Flows. This is used for density estimation and generative modeling, and it is another model which can be seen a timediscretization of its continuoustime counterpart.
 Required Reading:
 Optional Reading:
 Questions:
 In DE, what is the difference between t and t, i.e. what do they represent?
 In DE, why does eq. (4.2) imply convergence t as t ?
 What is the computational complexity of evaluating a determinant of a N x N matrix, and why is that relevant in this context?
 Class 3: ResNets
 Motivation: The introduction of Residual Networks (ResNets) made possible to train very deep networks. In this section we study some residual architectures variants and their properties. We then look into how ResNets approximates ODEs and how this interpretation can motivate neural net architectures and new training approaches.
 Required Reading:
 ResNets: ResNets and An Overview of ResNet and its Variants
 ResNets and ODEs:
 Optional Reading:
 The original ResNets paper: Deep Residual Learning for Image Recognition
 Another blog post on ResNets: Understanding and Implementing Architectures of ResNet and ResNeXt for stateoftheart Image Classification
 Invertible ResNets: The Reversible Residual Network: Backpropagation Without Storing Activations
 Stable Architectures for Deep Neural Networks
 Questions:
 Can you think of any other neural network architectures which can be seen as discretizations of some ODE?
 Do you understand why adding ‘residual layers’ should not degrade the network performance?
 How do the authors of (Multilevel […]) explain the phenomena of still having almost as good performances in residual networks when removing a layer?
 Implement your favourite variant ResNet variant
 Class 2: Numerical solution of ODEs II
 Motivation: In the previous class we introduced some simple schemes to numerically solve ODEs. In this class we go through some more involved schemes and their convergence analysis.
 Required Reading:
 RungeKutta methods: Section 11.8 from NM or Sections 12.{5,12} from NA
 Multistep methods: Sections 12.69 from NA or Section 11.56 from NM
 System of ODEs: Sections 11.910 from NM or Sections 12.1011 from NA
 Optional Reading:
 Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.1
 Predictorcorrector methods: Section 11.7 from NM
 Richardson extrapolation: Section 16.4 from Numerical Recipes
 Automatic Selection of Methods for Solving Stiff and Nonstiff Systems of Ordinary Differential Equations
 Questions:
 From NA, Section 12: Exercises 12.11, 12.12, 12.19
 Class 1: Numerical solution of ODEs I
 Motivation: ODEs are used to mathematically model a number of natural processes and phenomena. The study of their numerical simulations is one of the main topics in numerical analysis and of fundamental importance in applied sciences.
 Required Reading:
 Sections 12.14 from An Introduction to Numerical Analysis (NA) or Sections 11.13 from Numerical Mathematics (NM)
 Optional Reading:
 Section 12.5 from NM
 Prof. Trefethen’s class ODEs and Nonlinear Dynamics 4.2
 Questions:
 From NM, Section 11.12: Exercise 1
 From NA, Section 12: Exercises 12.3,12.4, 12.7
 Consider the following method for solving y’ = f(y):
y_{n+1} = y_n + h(thetaf(y_n) + (1theta)*f(y_{n+1}))
Assuming sufficient smoothness of y and f, for what value of 0 <= theta <= 1 is the truncation error the smallest? What does this mean about the accuracy of the method?  Notebook