
CS6780  Advanced Machine Learning
Spring 2019
Prof. Thorsten Joachims
Cornell University, Department of Computer Science & Department of Information Science



Time and Place
First lecture: January 29, 2019
Last meeting: May 7, 2019
Time: Tuesday/Thursday, 2:55pm  4:10pm
Room: Gates G01 / Bloomberg 91
Exam: April 25
Project Report: May 13


Course Description
This course gives a graduatelevel introduction to machine learning and indepth coverage of new and advanced methods in machine learning, as well as their underlying theory. It emphasizes approaches with practical relevance and discusses a number of recent applications of machine learning in areas like information retrieval, recommender systems, data mining, computer vision, natural language processing and robotics. An open research project is a major part of the course.
In particular, the course will cover the following topics:
 Supervised Batch Learning: model, decision theoretic foundation, model selection, model assessment, empirical risk minimization
 Decision Trees : TDIDT, attribute selection, pruning and overfitting
 Statistical Learning Theory : generalization error bounds, VC dimension
 LargeMargin Methods and Kernels: linear Rules, margin, Perceptron, SVMs, duality, nonlinear rules through kernels
 Deep Learning : multilayer perceptrons, deep networks, stochastic gradient
 Probabilistic Models: generative vs. discriminative, maximum likelihood, Bayesian inference
 Structured Output Prediction : undirected graphical models, structural SVMs, conditional random fields
 Latent Variable Models: kmeans clustering, mixture of Gaussians, expectationmaximization algorithm, matrix factorization, embeddings
 Online Learning : experts, bandits, online convex optimization
 Causal Inference : interventional vs. observational data, treatment effect estimation
The prerequisites for the class are: programming skills (at the level of CS 2110) and basic knowledge of linear algebra (at the level of MATH 2940) and probability theory (at the level of MATH 4710) and multivariable calculus (at the level of MATH 1920).
Enrollment is limited to PhD students.


Syllabus
 01/29: Introduction [slides] [slides 6up] [video]
 What is learning?
 What is machine learning used for?
 Overview of course, course policies, and contact info.
 01/31: Supervised Batch Learning [slides] [slides 6up] [video]
 Supervised learning for binary classification, multiclass classification, regression, and stuctured output prediction
 Instance space, target function, training examples
 Hypothesis space, consistency, and version space
 Classifying with a decision tree
 Representational power of decision trees
 TDIDT decision tree learning algorithm
 02/05: Empirical Risk Minimization [slides] [slides 6up] [video]
 Independently identically distributed (iid) data
 Risk, empirical risk, and Empirical Risk Minimization (ERM)
 Bayes decision rule and Bayes risk
 Model selection and overfitting
 02/07: Statistical Learning Theory 1: Generalization Error Bounds [slides] [slides 6up] [video]
 Model assessment (see slides from last lecture)
 Generalization error bound for finite H and zero error
 02/12: Statistical Learning Theory 2: Generalization Error Bounds [slides from last lecture] [video]
 Generalization error bound for finite H and nonzero error
 Probably Approximately Correct (PAC) learning
 Theoretical characterization of overfitting
 02/14: Linear Rules and Perceptron [slides] [slides 6up] [video]
 Linear classification rules
 Batch Perceptron learning algorithm
 Online Perceptron learning algorithm
 Margin of linear classifier
 Perceptron mistake bound
 02/19: Optimal Hyperplanes and SVMs [slides] [slides 6up] [video]
 Optimal hyperplanes and margins
 Hardmargin Support Vector Machine
 VapnikChervonenkis Dimension
 VC dimension of hyperplanes
 Symmetrization and error bound
 02/21: SoftMargin and Duality [slides] [slides 6up] [video]
 Softmargin Support Vector Machine
 Dual Perceptron
 Primal and dual SVM optimization problem
 Connection between primal and dual
 02/28: Kernels [slides] [slides 6up] [video]
 Bounds on the leaveoneout error of an SVM (see slides from last lecture)
 Input space and feature space
 Kernels as a method for nonlinear learning
 Kernels for learning with nonvectorial data
 03/05: Regularized Linear Models [slides] [slides 6up] [video]
 Conditional maximum likelihood
 Maximum a posteriori estimates
 Logistic regression
 Ridge regression
 03/07: Deep Network Models [slides] [slides 6up] [video]
 Multilayer perceptrons
 Relationship to kernels
 Stochastic gradient descent
 Convolution and pooling
 03/12: Generative Models for Classification [slides] [slides 6up] [video]
 Modeling the data generating process
 Multivariate naive Bayes classifier
 Linear discriminant analysis
 Multinomial naive Bayes classifier
 03/14: Generative Models for Sequence Prediction [slides] [slides 6up] [video]
 Hidden Markov Model (HMM)
 Estimation of HMM parameters
 Viterbi algorithm
 03/19: Discriminative Training for Structured Output Prediction [slides] [slides 6up] [video]
 Structural SVMs
 Conditional Random Fields
 Encoder/Decoder Networks
 03/21: Online Learning: Expert Setting [slides] [slides 6up] [video]
Expert online learning model
 Halving algorithm and optimal mistake bound
 Regret
 Weighted Majority algorithm
 03/26: Online Learning: Bandit Setting [slides] [slides 6up] [video]
 Exponentiated Gradient algorithm (see slides from last lecture)
 Adversarial Bandit model
 EXP3 algorithm
 Stochastic Bandit model
 UCB1 algorithm
 03/28: Clustering [slides] [slides 6up] [video]
 Unsupervised learning tasks
 Hierarchical agglomerative clustering
 kMeans
 04/09: Latent Variable Models [slides] [slides 6up] [video]
 Mixture of Gausssians
 Expectationmaximization (EM) algorithm
 Derivation of EM
 Latent variable problems beyond mixtures
 04/11: Recommender Systems [slides] [slides 6up] [video]
 Uses of recommender systems
 Content vs. collaborative recommendation
 Lowrank matrix completion
 Selection biases in preference data
 04/16: Counterfactual Evaluation and Learning [slides] [slides 6up] [video]
 Contextualbandit feedback
 Policy and utility
 Potential outcomes model
 Onpolicy vs. offpolicy estimators
 04/21: Counterfactual Evaluation and Learning [slides] [slides 6up] [video]
 Counterfactual Risk Minimization
 POEM
 BanditNet
 04/30: Learning to Rank [slides] [slides 6up] [video]
 Behavior and ranking metrics
 LTR with expert labels
 LTR with implicit feedback
 Connection to counterfactual evaluation and learning
 05/07: WrapUp [slides] [slides 6up] [video]
 Main themes of class
 What we did not cover
 Followup courses


Staff and Office Hours
Please use the CS6780 Piazza Forum as the primary channel for questions and discussions.
For peer feedback, we are using this CMT Instance for this course.
For grades, we are using CMS.
Office hours (join via Zoom):
 Mondays, 10:30am11:30am: Ashudeep Singh (Rhodes 402)
 Mondays, 1:30pm2:30pm: Aman Agarwal (Rhodes 412)
 Thursdays, 11:30am12:30pm: Ashudeep Singh (Rhodes 400)
 Fridays, 11:00am12:00pm: Thorsten Joachims (Gates 418)
 Fridays, 2:00pm3:00pm: Aman Agarwal (Rhodes 402)


Assignments and Exams
Homework assignments can be downloaded from CMS and they are submitted using the mechanism specified in the homework handout.
All assignments are due at the beginning of class on the due date. Assignments turned in late will be charged a 1 percentage point reduction of the cumulated final homework grade for each period of 24 hours for which the assignment is late. However, every student has a budget of 5 late days (i.e. 24 hour periods after the time the assignment was due) throughout the semester for which there is no late penalty. So, if you have perfect scores of 100 on all 4 homeworks and a total of 8 late days, your final homework score will be 97. No assignment will be accepted after the solution was made public, which is typically 35 days after the time it was due. You can submit late assignments via CMS.
Regrade requests can be submitted within 7 days after the grades have been made available using the mechanism specified in the homework handout.


Grading
This is a 4credit course. Grades will be determined based on a written exam, a research project, homework assignments, and class participation.
 40%: Exam
 35%: Research Project
 20%: Homework (~4 assignments)
 5%: Class Participation (e.g., lecture, piazza)
To eliminate outlier grades for homeworks, the lowest grade is replaced by the second lowest grade when grades are cumulated at the end of the semester.
All assignment, exam, and final grades (including + and  of that grade) are roughly on the following scale: A=92100; B=8288; C=7278; D=6068; F= below 60.
For the research project, we will use peer review analogous to how scientific papers are reviewed for conferences and journals. This means that you will read and comment on other students work, which provides input for the TA's and the professor to determine the project grades. The quality of your reviewing also becomes a component of your own grade.
Students taking the class S/U do all work, except for the project, and need to receive at least a grade equivalent to a D to pass the course.


Reference Material
The main textbooks for the class is:
 Kevin Murphy, "Machine Learning  a Probabilistic Perspective", MIT Press, 2012. (online via Cornell Library)
We will also read original research papers and other sources from the following list:
 Tom Mitchell, "Machine Learning", McGraw Hill, 1997.
 Cristianini, ShaweTaylor, "Introduction to Support Vector Machines", Cambridge University Press, 2000. (online via Cornell Library)
 Schoelkopf, Smola, "Learning with Kernels", MIT Press, 2001. (online)
 Bishop, "Pattern Recognition and Machine Learning", Springer, 2006.
 Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004.
 Devroye, Gyoerfi, Lugosi, "A Probabilistic Theory of Pattern Recognition", Springer, 1997.
 Duda, Hart, Stork, "Pattern Classification", Wiley, 2000.
 Hastie, Tibshirani, Friedman, "The Elements of Statistical Learning", Springer, 2001.
 Imbens, Rubin, Causal Inference for Statistical Social Science, Cambridge, 2015. (online via Cornell Library)
 Leeds Tutorial on HMMs (online)
 Manning, Schuetze, "Foundations of Statistical Natural Language Processing", MIT Press, 1999. (online via Cornell Library)
 Manning, Raghavan, Schuetze, "Introduction to Information Retrieval", Cambridge, 2008. (online)
 Vapnik, "Statistical Learning Theory", Wiley, 1998.


Academic Integrity
This course follows the Cornell University Code of Academic Integrity. Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work. Collaborations are allowed only if explicitly permitted. Violations of the rules (e.g. cheating, copying, nonapproved collaborations) will not be tolerated. Respectful, constructive and inclusive conduct is expected of all class participants.
