MathsDLspring18
Topics course Mathematics of Deep Learning, NYU, Spring 18. CSCIGA 3033.
Logistics

Tuesdays from 7.10pm9pm. CIWW 102

Tutoring Session with Parallel Curricula (optional): Fridays 11am12:30pm CIWW 101. The first week only it is 10:30am12pm.

Piazza: signup here

Office Hours: Tuesdays 9:30am11:00am
Instructors
Lecture Instructor: Joan Bruna (bruna@cims.nyu.edu)
Tutor (Parallel Curricula): Cinjon Resnick (cinjon@nyu.edu)
Syllabus
This Graduatelevel topics course aims at offering a glimpse into the emerging mathematical questions around Deep Learning. In particular, we will focus on the different geometrical aspects surounding these models, from input geometric stability priors to the geometry of optimization, generalisation and learning. We will cover both the background and the current open problems.
Besides the lectures, we will also run a parallel curricula (optional), which, starting from a landmark recent DL paper (AlphaGo), will trace back the fundamentals of Dynammic Programming, Policy Learning and MonteCarlo Tree Search through the literature and lab materials.
Detailed Syllabus

Introduction: the Curse of Dimensionality
 Part I: Geometry of Data
 Euclidean Geometry: transportation metrics, CNNs , scattering.
 NonEuclidean Geometry: HausdorffGromov distances, Graph Neural Networks.
 Unsupervised Learning under Geometric Priors (Implicit vs explicit models, microcanonical, transportation metrics).
 Applications and Open Problems: adversarial examples, graph inference, inverse problems.
 Part II: Geometry of Optimization and Generalization
 Stochastic Optimization (Robbins & Munro, Convergence of SGD)
 Stochastic Differential Equations (FokkerPlank, Gradient Flow, Langevin Dynamics, links with SGD; open problems)
 Information Geometry and Optimal Transport (Amari, FisherRao metric, Wasserstein)
 Reproducing Kernel Hilbert Spaces
 Landscape of Deep Learning Optimization (Tensor/Matrix factorization, Deep Nets; open problems).
 Generalization in Deep Learning.
Prerequisites
Multivariate Calculus, Linear Algebra, Probability and Statistics at solid undergraduate level.
Notions of Harmonic Analysis, Differential Geometry and Stochastic Calculus are nicetohave, but not essential.
Grading
The course will be graded with a final project – consisting in an indepth survey of a topic related to the syllabus, plus a participation grade. The detailed abstract of the project will be graded at the midterm.
Final Project is due May 1st by email to the instructors
Lectures
Week  Lecture Date  Topic  References 

1  1/23  Lec1 Introduction: The Curse of Dimensionality in ML Slides  References 
2  1/30  Lec2 Euclidean Geometric Stability. Slides  References 
3  2/6  Guest Lecture: Leon Bottou (Facebook/NYU) Slides  References 
4  2/13  Lec3 Scattering Transforms and CNNs Slides  References 
5  2/20  Lec4 NonEuclidean Geometric Stability. GromovHausdorff distances. Graph Neural Nets Slides  References 
6  2/27  Lec5 Graph Neural Network Applications Slides  References 
7  3/6  Lec6 Unsupervised Learning under Geometric Priors. Implicit vs Explicit models. Optimal Transport models. Microcanonical Models. Open Problems Slides  References 
8  3/13  Spring Break  References 
9  3/20  Lec7 Discrete vs Continuous Time Optimization. The Convex Case. Slides  References 
10  3/27  Lec8 Discrete vs Continuous Time Optimization. Stochastic and Nonconvex case Slides  References 
11  4/3  Lec9 Gradient Descent on Nonconvex Optimization. Slides  References 
12  4/10  Lec10 Gradient Descent on Nonconvex Optimization. Escaping Saddle Points efficiently. Slides  References 
13  4/17  Lec11 Landscape of Deep Learning Optimization. Spin Glasses, KacRice, RKHS, Topology. Slides  References 
14  4/24  Lec12 Guest Lecture: Behnam Neyshabur (IAS/NYU): Generalization in Deep Learning Slides  References 
15  5/1  Lec13 Landscape of Deep Learning Optimization. Positive and Negative results. Open Problems. Slides  References 
Lab sessions / Parallel Curricula
DeepStack living document: https://goo.gl/zzMzoz
 Resources:
 Class 6: DeepStack
 Motivation: Let’s read the paper!
 Required Reading:
 Optional Reading:
 Questions:
 What are the differences between the approaches taken in DeepStack and in Libratus?
 Do you understand “Continual Resolving”?
 Do you understand AIVAT?
 Class 5: Counterfactual Regret Minimization #2
 Motivation: We saw last week the practical side of CFR and how effective it can be. This week we’ll be diving more into the theory underlying it. This will culminate with Blackwell’s Approachability Theorem, a generalization of repeated twoplayer zerosum games.
 Required:
 PLG: Section 7.37.7, 7.9
 Optional:
 Questions:
 Prove Lemma 7.1.
 It’s brushed over in the proof of Theorem 7.5 (PLG), but prove that if set S is approachable, then every halfspace H containing S is approachable.
 Class 4: Counterfactual Regret Minimization #1
 Motivation: Counterfactual Regret Minimization (CFR) is only a decade old but has already achieved huge success as the foundation underlying DeepStack and Libratus. In the first of two weeks dedicated to CFR, we learn how it works algorithmically.
 Required Reading:
 ICRM: 2.12.4, 3.13.4
 LT: 2.2
 Original Paper –> Regret Minimization in Games with Incomplete Information
 Optional Reading: The two below are CFR extensions used in DeepStack.
 Questions:
 What is the difference between internal regret, external regret, and counterfactual regret?
 Implement CFR (or CFR+ / CFRD) in your favorite programming language to play Leduc Poker or Liar’s Dice.
 How do you know if you’ve implemented CFR correctly?
 Class 3: ExtensiveForm Games
 Motivation: What happens when players don’t act simultaneously? ExtensiveForm Games are an answer to this question. While this representation of a game always has a comparable NormalForm, it’s much more natural to reason about in this format.
 Required Reading:
 MAS 5.1.{1,2,3}
 MAS 5.2.{1,2,3}
 Accelerating Best Response Calculation in Large Extensive Games –> Important for understanding how to evaluate Poker algorithms.
 Optional Reading:
 LT Section 2.1.2
 Questions:
 What is the intuition for why not all normal form games can be transformed into perfectform extensive games?
 How are the set of behavioral strategies different from the set of mixed strategies?
 Succinctly describe the technique demonstrated in the Accelerating Best Response paper.
 Class 2: Optimality and Equilibrium
 Motivation: How do you reason about games? The best strategy in multiagent scenario depends on the choices of others. Game theory deals with this problem by identifying subsets of outcomes called solution concepts, of which fundamental ones are the Nash Equilibrium, Pareto Optimality, and Correlated Equilibrium.
 Required Reading:
 MAS Sections 3.3, 3.4.5, 3.4.7, 4.1, 4.2.4, 4.3, 4.6
 LT Section 2.1.1
 Optional Reading:
 The rest of section 3.4 in MAS.
 Questions:
 Why must every game have a Pareto Optimal strategy?
 Why must there always exist at least one Pareto Optimal Strategy in which all players adopt pure strategies?
 Why in commonpayoff games do all Pareto optimal strategies have the same payoff?
 Why does definition 3.3.12 imply that the vertices of a simplex must all receive different labels?
 Why in definition 3.4.12 does it not matter that the mapping is to pure strategies rather than a mixed strategy?
 Take your favorite normalform game, find a Nash Equilibrium, and then find a corresponding Correlated Equilibrium.
 Class 1: NormalForm Games & Poker
 Motivation: NormalForm games are the backbone for many of the techniques that later were used in DeepStack and Libratus. Understanding them will be a necessary foundation to understanding the innovations they presented.
 Required Reading:
 MAS sections 3.1 & 3.2
 LT pages 57
 The Game of Poker (Supplementary #1 on pages 1617)
 Optional Reading:
 The State of Solving Large IncompleteInformation Games, and Application to Poker (2010)
 Why Poker is Difficult (very good video by Noam Brown, the main author on Libratus. The first 18 minutes are most relevant for now.)
 Questions:
 Prove that in a zerosum game, the nash equilibrium strategies are interchangeable. (LT)
 Prove that in a zerosum game, the expected payoff to each player is the same for every equilibrium. (LT)
 Can you prove Lemma 3.1.6?
 Can you prove Theorem 3.1.8 (which is a really cool result)?
AlphaGoZero living document: https://goo.gl/iFZ4XD
 Class 6: The Paper
 Motivation: Let’s read the paper!
 Required Reading:
 Optional Reading:
 Questions:
 What were the differences between “Mastering the Game of Go …” and “Thinking Fast and Slow …”?
 What was common to both “Mastering the Game of Go …” and “Thinking Fast and Slow …”?
 Class 5: MCTS & RL
 Motivation: Up to this point, we’ve learned a lot about how games can be solved and how RL works on a foundational level. Before we jump into the paper, one last foray contrasting and unifying the games vs learning perspective is worthwhile for understanding the domain more fully.
 Required Reading:
 Vodopivec:
 Connection between MCTS and RL → 3.13.4
 Integrating MCTS and RL → 4.14.3
 Why did TDGammon Work?
 Vodopivec:
 Optional Reading:
 Vodopivec: Survey of research inspired by both fields → 5.
 Questions:
 What are key differences between MCTS and RL?
 UCT can be described in RL terms as the following “The original UCT searches identically as an offline onpolicy everyvisit MC control algorithm that uses UCB1 as the policy.” What do each of these terms mean?
 What is a Representation Policy? Give an example not described in the text.
 What is a Control Policy? Give an example not described in the text.
 Class 4: MCTS & UCT
 Motivation: Monte Carlo Tree Search (MCTS) forms the backbone of AlphaGoZero. It’s what lets it reliably explore and then hone in on the best policy. UCT (UCB for Trees) builds on top of what we’ve been learning and, paired with MCTS, is integral to the training process.
 Required Reading:
 Sutton: Section 8.11
 Browne: Sections 2.2, 2.4, 3.13.5, 8.28.4.
 Silver Thesis: Sections 1.4.2 and 3.
 Optional Reading:
 Jess Hamrick on Browne.
 Original MCTS Paper.
 Original UCT Paper.
 Browne:
 4.8: MCTS applied to Stochastic or Imperfect Information Games.
 7.2, 7.3, 7.5, 7.7: Applications of MCTS.
 Questions:
 Can you detail each of the four parts of the MCTS algorithm?
 What characteristics make MCTS a good choice?
 What are examples of domain knowledge default policies in Go?
 Why is UCT optimal? Can you prove that the failure probability at the root converges to zero at a polynomial rate in the number of games simulated?
 Class 3: Policy and Value Functions
 Motivation: The Policy and Value Functions are at the core of Reinforcement Learning. The Policy function is the set of probabilities you give to each possible move. The Value function is your estimate of how good is the current state. In AlphaGoZero, a single network calculates both a value and a policy, then later updates its weights based off of the difference between those figures and the empirical results.
 Required Reading (note: Sutton from here out refers to the final version. Make sure it says COMPLETE DRAFT):
 Value Function:
 Sutton 3.5, 3.6, 3.7
 Sutton: 9.1, 9.2, 9.3 (important!)
 Policy Function:
 Sutton: 4.1, 4.2, 4.3
 Sutton: 13.1, 13.2 (important!), 13.3, 13.4
 Value Function:
 Optional Reading:
 Sergey Levine: Berkeley Fall’17: Policy Gradients → This is really good.
 Sergey Levine: Berkeley Fall’17: Value Functions → This is really good.
 Karpathy does Pong.
 David Silver on PG.
 David Silver on Value.
 Questions:
 Why does policy gradient have such high variance?
 What is the difference between offpolicy and onpolicy?
 Sutton:
 3.13: What is the Bellman equation for action values, that is, for qπ? …
 3.14: In the gridworld example … are the signs of these rewards important, or only the intervals between them? Prove …
 3.15: Now consider adding a constant c to all the rewards in an episodic task … would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.
 3.20: Give the Bellman equation for q∗ for the recycling robot.
 4.3: What are the equations analogous to (4.3), (4.4), and (4.5) for the actionvalue function qπ and its successive approximation by a sequence of functions q0, q1, q2, . . . ?
 4.6 (important!): How would policy iteration be defined for action values? Give a complete algorithm for computing q∗, analogous to that on page 65 for computing v∗. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.
 13.2 (important!): Prove (13.7) using the definitions and elementary calculus.
 Class 2: MultiArmed Bandits and Upper Confidence Bounds
 Motivation: Bandits and UCB are key components of how MCTS was originally formalized. The node selection during the search is achieved through the UCB approach, which is analogues to how its performed in a multiarmed bandit scenario.
 Required Reading:
 Sutton: Sections 2.1  2.6 (Find on newclasses.nyu.edu in the class materials)
 Jeremy Kun: Optimizing in the Face of Uncertainty
 Optional Reading:
 Questions:
 Sutton Exercises 2.1, 2.2, 2.4, 2.5
 Sutton: What are the pros and cons of the optimistic initial values method?
 Kun: In the proof for the expected cumulative regret of UCB1, why is delta(T) a trivial regret bound if the deltas are all the same?
 Kun: Do you understand the argument for why the regret bound is O(sqrt(KTlog(T)))?
 Can you reproduce the UCB1 algorithm?
 Class 1: Minimax and Alpha Beta Pruning
 Motivation: These original core ideas did so much for the study of games. They spurred the field forward starting in the 50s and still to this day have mindshare in how to build a computer engine that beats games, including in popular chess engines like Stockfish.
 Required Reading:
 Cornell Recitation on Minimax & AB Pruning
 Knuth: 6 (Theorems 1&2, Corollaries 1&3).
 Optional Reading:
 Questions:
 (Knuth) Show that AlphaBetaMin(p, alpha, beta) = AlphaBetaMax(p, beta, alpha). (p. 300)
 (Knuth) For Theorem 1.(1), why are the successor positions of type 2? (p. 305)
 (Knuth) For Theorem 1.(2), why is it that p’s successor position is of type 3 if p is not terminal?
 (Knuth) For Theorem 1.(3), why is it that p’s successor positions are of type 2 if p is not terminal?
 (Knuth) Show that the subparts of Theorem 2, are correct.