CSE 544, Spring 2020: Probability & Statistics for Data Science

News:
02/02: A1 is now up on BB, due 2/12 in class.
01/15: Welcome to CSE 544! Our first class will be on 1/27.

CSE 544: Probability & Statistics for Data Science
Spring 2020

When: Mon Wed, 2:30pm - 3:50pm
Where: Engineering 143
Instructor: Anshul Gandhi
Instructor Office Hours: Tues, Thurs, 3-4pm
347, New CS building
Course TA and Graders: Supreeth Narasimhaswamy, Sayontan Ghosh
TA Office Hours: Fri, 12-1pm
109, New CS building

### Course Info

This grad-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, and Time Series Analysis. For more details, refer to the syllabus below.

The class is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will be based on assignments, exams, a semester-end mini project, and in-class quizzes. For more details, see the section on grading below.

### Syllabus & Schedule

Jan 27 (Mon)
[Lec 01]
Course introduction, class logistics
Jan 29 (Wed)
[Lec 02]
Probability review - 1
• Basics: sample space, outcomes, probability
• Events: mutually exclusive, independent
• Calculating probability: sets, counting, tree diagram
• AoS 1.1 - 1.5
MHB 3.1 - 3.4
Feb 03 (Mon)
[Lec 03]
Probability review - 2
• Conditional probability
• Law of total probability
• Bayes' theorem
• AoS 1.6, 1.7
MHB 3.3 - 3.6
assignment 1 out
Feb 05 (Wed)
[Lec 04]
Random variables - 1: Overview and Discrete RVs
• Discrete and Continuous RVs
• Mean, Moments, Variance
• pmf, pdf, cdf
• Discrete RVs: Bernoulli, Binomial, Geometric, Indicator
• AoS 2.1 - 2.3, 3.1 - 3.4
MHB 3.7 - 3.9
Python scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric
Feb 10 (Mon)
[Lec 05]
Random variables - 2: Continuous RVs
• Uniform(a, b)
• Exponential(λ)
• Normal(μ, σ2), and its several properties
• AoS 2.4, 3.1 - 3.4
MHB 3.7 - 3.9, 3.14.1
Python scripts:
draw_Uniform, draw_Exponential, draw_Normal
Feb 12 (Wed)
[Lec 06]
Random variables - 3: Joint distributions & conditioning
• Joint probability distribution
• Linearity and product of expectation
• Conditional expectation
• AoS 2.5 - 2.8
MHB 3.10 - 3.13, 3.15
assignment 2 out
assignment 1 due
Feb 17 (Mon)
[Lec 07]
Probability Inequalities
• Weak Law of Large Numbers
• Central Limit Theorem
• AoS 5.3, 5.4
MHB 3.14.2, 5.2
Feb 19 (Wed)
[Lec 08]
Markov chains
• Stochastic processes
• Setting up Markov chains
• Balance equations
• AoS 23.1 - 23.3
MHB 8.1 - 8.7
Feb 24 (Mon)
[Lec 09]
Non-parametric inference - 1
• Basics of inference
• Simple examples
• Empirical PMF
• Sample mean
• bias, se, MSE
• AoS 6.1 - 6.2, 6.3.1 assignment 3 out. Required data: q2.dat, weather.dat
assignment 2 due
Feb 26 (Wed)
[Lec 10]
Non-parametric inference - 2
• Empirical Distribution Function (or eCDF)
• Kernel Density Estimation (KDE)
• Statistical Functionals
• Plug-in estimator
• AoS 7.1 - 7.2 Python scripts:
sample_Bernoulli, sample_Binomial, sample_Geometric,
sample_Uniform, sample_Exponential, sample_Normal, draw_eCDF
Mar 02 (Mon)
[Lec 11]
Confidence intervals
• Percentiles, quantiles
• Normal-based confidence intervals
• DKW inequality
• AoS 6.3.2, 7.1
Mar 04 (Wed)
[Lec 12]
Parametric inference - 1
• Consistency, Asymptotic Normality
• Basics of parametric inference
• Method of Moments Estimator (MME)
• AoS 6.3.1 - 6.3.2, 9.1 - 9.2 assignment 3 due
Mar 09 (Mon)
[Lec 13]
Parametric inference - 2
• Properties of MME
• Basics of MLE
• Maximum Likelihood Estimator (MLE)
• Properties of MLE
• AoS 9.3, 9.4, 9.6
Mar 11 (Wed) Mid-term 1 This will be in-class, closed notes, closed book.
Mar 16 (Mon) Spring Break No class. Stay safe and healthy.
Mar 18 (Wed) Spring Break No class. Stay safe and healthy.
Mar 23 (Mon) Extended Spring Break No class. Stay safe and healthy.
Mar 25 (Wed) Extended Spring Break No class. Stay safe and healthy.
Mar 30 (Mon)
[Lec 14]
Hypothesis testing - 1
• Basics of hypothesis testing
• Wald test
• AoS 10 - 10.1
DSD 5.3.1
assignment 4 out
Required data: acceleration, model, mpg, q6_X.dat, q6_Y.dat
Apr 01 (Wed)
[Lec 15]
Hypothesis testing - 2
• Type I and Type II errors
• Wald test
• AoS 10 - 10.1
DSD 5.3.1
Apr 06 (Mon)
[Lec 16]
Hypothesis testing - 3
• Z-test
• t-test
• AoS 10.10.2
DSD 5.3.2
Apr 08 (Wed)
[Lec 17]
Hypothesis testing - 4
• Kolmogorov-Smirnov test (KS test)
• p-values
• AoS 15.4, 10.2
DSD 5.3.3, 5.5
assignment 5 out
assignment 4 due on Friday (4/10) 2:30pm via google forms
Apr 13 (Mon)
[Lec 18]
Hypothesis testing - 5
• p-values
• Permutation test
• AoS 10.2, 10.5
DSD 5.5
Apr 15 (Wed)
[Lec 19]
Hypothesis testing - 6
• Pearson correlation coefficient
• Chi-square test for independence
• AoS 3.3, 10.3 - 10.4
DSD 2.3
Apr 20 (Mon)
[Lec 20]
Bayesian inference - 1
• Bayesian reasoning
• Bayesian inference
• AoS 11.1 - 11.2, 11.6
DSD 5.6
assignment 6 out
Required data: q2_sigma3.dat, q2_sigma100.dat, q4.dat, q5.csv
assignment 5 due by 2:30pm via google forms
Apr 22 (Wed)
[Lec 21]
Bayesian inference - 2
• Priors
• Conjugate priors
• AoS 11.1 - 11.2, 11.6
DSD 5.6
Apr 27 (Mon)
[Lec 22]
Regression - 1
• Basics of Regression
• Simple Linear Regression
• AoS 13.1, 13.3 - 13.4
DSD 9.1
Apr 29 (Wed)
[Lec 23]
Regression - 2
• Multiple Linear Regression
• AoS 13.5
DSD 9.1
assignment 6 due by 2:30pm via google forms
May 04 (Mon)
[Lec 24]
Time Series Analysis
• EWMA Time Series modeling
• AR Time Series modeling

• ### Resources

• Required text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.
• Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).
• Suggested for data science topics in the second half of the course.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 45%
• 5-6 assignments during the semester. Expect 6-8 questions per assignment, including some programming questions.
• Collaboration is allowed (max group size 5). Submit one solution per group.
• Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

• Exams: 40%
• Two in-class exams.
• Mid-term 1: 15%.
• Mid-term 2: 25%.
• Easier than the assignments but the questions will be on the same lines.

• Mini group project: 10%
• One semester-end project to be done in groups. The project work is expected to begin around mid-term 2.
• Further details on the project will be provided in class around mid-March.

• In-class mini-quizzes: 5%
• Echo360-based in-class quizzes, roughly one per class.
• Will count as attendance as well.

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty.
No exceptions will be made for any student and no special circumstances will be entertained.