CSE 357, Fall 2019: Statistical Methods for Data Science

News:
09/02: Course schedule updated.
08/27: Lecture 1 slides have now been posted.
08/19: Our first lecture will be on Aug 27th (Tu) at 4pm in Frey 309.

CSE 357: Statistical Methods for Data Science
Fall 2019

When: Tue Thu, 4:00pm - 5:20pm
Where: Frey Hall 309

Instructor: Anshul Gandhi
Instructor Office Hours: Tue 1:30-2:30pm, Wed 3:30-4:30pm; location: 347, New CS building

Course TAs: Sai Sivalenka, Vamsikrishna K M, Rohit Patil

### Course Info

This undergraduate-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course will involve theoretical topics and some programming assignments. The course is targeted primarily for junior and senior undergraduate students who are comfortable with concepts relating to probability and are comfortable with basic programming. Undergraduates from Computer Science, Applied Mathematics and Statistics, and Electrical and Computer Engineering would be well suited for taking this class. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, and Regression. The class is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will tentatively be based on assignments, exams, and class participation. For more details, refer to the section on grading below.

### Syllabus & Schedule

Aug 27 (Tue)
[Lec 01]
Course introduction, class logistics
Aug 29 (Thu)
[Lec 02]
Probability review - 1
• Basics: sample space, outcomes, probability
• Events: mutually exclusive, independent
• Calculating probability: sets, counting, tree diagram
• AoS 1.1 - 1.5
MHB 3.1 - 3.4
Sep 03 (Tue)
[Lec 03]
Probability review - 2
• Conditional probability
• Law of total probability
• Bayes' theorem
• AoS 1.6, 1.7
MHB 3.3 - 3.6
assignment 1 out
Sep 05 (Thu)
[Lec 04]
Random variables - 1
• Mean, Moments, Variance
• pmf, pdf, cdf
• Bernoulli(p)
• Indicator RV
• Binomial(n, p)
• Geometric(p)
• AoS 2.1 - 2.3, 3.1 - 3.4
MHB 3.7 - 3.9
Python scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric
Sep 10 (Tue)
[Lec 05]
Random variables - 2
• Uniform(a, b)
• Exponential(λ)
• Normal(μ, σ2), and its several properties
• AoS 2.4, 3.1 - 3.4
MHB 3.7 - 3.9, 3.14.1
Python scripts:
draw_Uniform, draw_Exponential, draw_Normal
Sep 12 (Thu)
[Lec 06]
Random variables - 3
• Joint probability distribution
• Linearity and product of expectation
• Central Limit Theorem
• AoS 2.5 - 2.7, 5.3 - 5.4
MHB 3.10, 3.13, 3.14.2
assignment 1 due
assignment 2 out
Sep 17 (Tue)
[Lec 07]
Non-parametric inference - 1
• Basics of inference
• Empirical PMF
• Sample mean
• bias, se, MSE
• AoS 6.1, 6.2, 6.3.1 Python scripts:
sample_Bernoulli, sample_Binomial, sample_Geometric
Sep 19 (Thu)
[Lec 08]
Non-parametric inference - 2
• Empirical Distribution Function (or eCDF)
• Statistical Functionals
• Plug-in estimator
• AoS 6.3.1, 7.1 - 7.2 Python scripts:
sample_Uniform, sample_Exponential, sample_Normal,
eCDF
Sep 24 (Tue)
[Lec 09]
Confidence intervals
• Percentiles, quantiles
• Normal-based confidence intervals
• DKW inequality
• AoS 6.3.2, 7.1 assignment 2 due
assignment 3 out
Required collisions.csv dataset for A3.
Sep 26 (Thu)
[Lec 10]
Parametric inference - 1
• Basics of parametric inference
• Method of Moments Estimator (MME)
• AoS 6.3.1 - 6.3.2, 9.1 - 9.2
Oct 01 (Tue)
[Lec 11]
Parametric inference - 2
• Method of Moments Estimator (MME)
• Properties of MME
• AoS 9.1 - 9.2
Oct 03 (Thu)
[Lec 12]
Parametric inference - 3
• Likelihood
• Maximum Likelihood Estimator (MLE)
• Properties of MLE

• Practice mid-term 1 solutions
AoS 9.3 - 9.4, 9.6 assignment 3 due
assignment 4 out
Required datasets: Freedom, Generosity, Trust.
Oct 08 (Tue) Mid-term 1
Oct 10 (Thu)
[Lec 13]
Python programming tutorial
Python scripts: basic.py, test_plot.py, matrix.py
Oct 15 (Tue) Fall break No class
Oct 17 (Thu)
[Lec 14]
Hypothesis testing - 1
• Basics of hypothesis testing
• The Wald test
• AoS 10 - 10.1
DSD 5.3 - 5.3.1
Oct 22 (Tue) Instructor traveling No class
Oct 24 (Thu) Instructor traveling No class
Oct 29 (Tue)
[Lec 15]
Hypothesis testing - 2
• The Wald test
• Type I and Type II errors
• AoS 10 - 10.1 assignment 4 due
Oct 31 (Thu)
[Lec 16]
Hypothesis testing - 3
• t-test
• Kolmogorov-Smirnov test (KS test)
• AoS 10.10.2, 15.4
DSD 5.3.2 - 5.3.3
Nov 05 (Tue)
[Lec 17]
Hypothesis testing - 4
• p-values
• Permutation test
• AoS 10.2, 10.5
DSD 5.5
assignment 5 out
Nov 07 (Thu)
[Lec 18]
Hypothesis testing - 5
• Pearson correlation coefficient
• Chi-square test for independence
• AoS 3.3, 10.3 - 10.4
DSD 2.3
Nov 12 (Tue)
[Lec 19]
Bayesian inference - 1
• Bayesian reasoning
• Bayesian inference
• AoS 11.1 - 11.2
DSD 5.6
Nov 14 (Thu)
[Lec 20]
Statistics in Medicine Guest lecture by Dr. Shrivastava
Nov 19 (Tue)
[Lec 21]
Bayesian inference - 2
• Bayesian inference
• Conjugate priors
• AoS 11.1 - 11.2
DSD 5.6
assignment 5 due
assignment 6 out
Required datasets: q2_sigma3.dat, q2_sigma100.dat.
Nov 21 (Thu)
[Lec 22]
Regression - 1
• Basics of Regression
• Simple Linear Regression
• AoS 13.1, 13.3 - 13.4
DSD 9.1
Nov 26 (Tue)
[Lec 23]
Regression - 2
• Multiple Linear Regression
• AoS 13.5
DSD 9.1
Nov 28 (Thu) Thanksgiving break No class
Dec 03 (Tue) Mid-term 2 review assignment 6 due
Dec 05 (Thu) Mid-term 2

### Resources

• Recommended text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.
• Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).
• Suggested for data science topics in the second half of the course.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 50%
• 6 assignments during the semester. Expect 5-7 questions per assignment, including some programming questions (after mid-term 1).
• Collaboration is allowed (max group size 3). Submit one solution per group.
• Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

• Exams: 45%
• Two in-class exams.
• Mid-term 1: 20%.
• Mid-term 2: 25%.
• Easier than the assignments and no long derivations or programming questions.
• Make-up exams will only be given at the discretion of the instructor and only for medical emergencies, with required evidence.

• Class interaction: 5%
• The basic idea is to get you to talk in the class and contribute to discussions.
• By the end of the semester, if I can recognize you based on your contributions to the class discussion, you should get a good score on this.

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty. The instructor is obligated to uphold these policies.
No exceptions will be made for any student and no special circumstances will be entertained.