CSE 390, Fall 2017: Probability & Statistics for Data Science

News:
10/29: Mid-term 2 date has been added (11/30).
10/18: Syllabus & Schedule updated.
09/19: Lecture 6 slides and py scripts posted.
09/14: Lecture 5 slides and py scripts posted.
09/12: Lecture 4 slides posted.
08/29: Lecture 1 slides posted.
08/05: Our first lecture will be on Aug 29th (Tues) at 4pm in Frey 205.

CSE 390: Probability & Statistics for Data Science
Fall 2017

When: Tue Thu, 4:00pm - 5:20pm
Where: Frey Hall 205
Instructor: Anshul Gandhi
Instructor Office Hours: Tue 3-4pm and Thu 5:30-6:30pm
347, New CS building
Course TA: Caitao Zhan, Kunal Shah
TA Office Hours: By appointment (please email the TA(s) to schedule)

### Course Info

This undergraduate-level special topics course covers probability and statistics topics required for data scientists to analyze and interpret data. The course will involve theoretical topics and some programming assignments. The course is targeted primarily for junior and senior undergraduate students who are comfortable with concepts relating to probability and are comfortable with basic programming. Undergraduates from Computer Science, Applied Mathematics and Statistics, and Electrical and Computer Engineering would be well suited for taking this class. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, Classification, and Clustering. For more details, refer to the syllabus below.

The class is expected to be interactive and students are encouraged to participate in class discussions.
Grading will be on a curve, and will tentatively be based on assignments, exams, and class participation. For more details, refer to the section on grading below.

### Syllabus & Schedule

Date Topic Readings Notes
Aug 29 (Tue)
[Lec 01]
Course introduction, class logistics
Aug 31 (Thu)
[Lec 02]
Probability review - 1
• Basics: sample space, outcomes, probability
• Events: mutually exclusive, independent
• Calculating probability: sets, counting, tree diagram
• AoS 1.1 - 1.6
MHB 3.1 - 3.5
Sep 05 (Tue) Labor Day No class
Sep 07 (Thu)
[Lec 03]
Probability review - 2
• Conditional probability
• Law of total probability
• Bayes' theorem
• AoS 1.7
MHB 3.6, 3.10 - 3.11
assignment 1 out
Sep 12 (Tue)
[Lec 04]
Random variables - 1: Overview
• Discrete and Continuous RVs
• Mean, Moments, Variance
• pmf, pdf, cdf
• AoS 2.1 - 2.3
MHB 3.7 - 3.9
Sep 14 (Thu)
[Lec 05]
Random variables - 2: Discrete RVs
• Bernoulli(p)
• Binomial(n, p)
• Geometric(p)
• Indicator RV
• AoS 2.4
MHB 3.7 - 3.9, 3.14.1
Python scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric,
sample_Bernoulli, sample_Binomial, sample_Geometric
Sep 19 (Tue)
[Lec 06]
Random variables - 3: Continuous RVs
• Uniform(a, b)
• Exponential(λ)
• Normal(μ, σ2), and its several properties
• AoS 2.7
MHB 3.14.1, 3.10, 3.13
assignment 1 due
assignment 2 out
Python scripts:
draw_Uniform, draw_Exponential, draw_Normal,
sample_Uniform, sample_Exponential, sample_Normal
Sep 21 (Thu) Instructor traveling No class
Sep 26 (Tue)
[Lec 07]
Random variables - 4: Joint distributions & conditioning
• Joint probability distribution
• Linearity (and product) of expectation
• Conditional expectation
• Sum of a random number of RVs
• AoS 2.8
MHB 3.11 - 3.12, 3.15
Sep 28 (Thu)
[Lec 08]
Probability inequalities - 1
• Markov's Inequality
• Chebyshev's inequality
• AoS 4.1 - 4.2, 23.1 - 23.3
MHB 3.14.2, 8.1 - 8.7
Oct 03 (Tue)
[Lec 09]
Probability inequalities - 2
• Weak Law of Large Numbers
• Central Limit Theorem
• A2 solutions
AoS 4.1 - 4.2, 23.1 - 23.3
MHB 3.14.2, 8.1 - 8.7
assignment 2 due
Oct 05 (Thu)
[Lec 10]
Mid-term 1 review
Oct 10 (Tue) Mid-term 1
Oct 12 (Thu)
[Lec 11]
Non-parametric inference - 1
• Basics of inference
• Simple examples
• AoS 6.1 - 6.2
Oct 17 (Tue)
[Lec 12]
Non-parametric inference - 2
• Empirical PMF
• Sample mean
• bias, se, MSE
• AoS 6.3.1
Oct 19 (Thu)
[Lec 13]
Non-parametric inference - 3
• Empirical Distribution Function (or eCDF)
• Statistical Functionals
• Plug-in estimator
• AoS 7.1 - 7.2 assignment 3 out
Required weather.dat dataset for A3.

Python script used in class: eCDF
Oct 24 (Tue)
[Lec 14]
Confidence intervals
• Percentiles, quantiles
• Normal-based confidence intervals
• DKW inequality
• AoS 6.3.2, 7.1
Oct 26 (Thu)
[Lec 15]
Parametric inference - 1
• Basics of parametric inference
• Method of Moments Estimator (MME)
• AoS 6.3.1 - 6.3.2, 9.1 - 9.2
Oct 31 (Tue)
[Lec 16]
Parametric inference - 2
• Method of Moments Estimator (MME)
• Properties of MME
• AoS 9.1 - 9.2 assignment 3 due

assignment 4 out

Required q4.dat dataset for A4.
Nov 02 (Thu)
[Lec 17]
Parametric inference - 3
• Likelihood
• Maximum Likelihood Estimator (MLE)
• Properties of MLE
• AoS 9.3 - 9.4, 9.6
Nov 07 (Tue)
[Lec 18]
Hypothesis testing - 1
• Basics of hypothesis testing
• The Wald test
• AoS 10 - 10.1, 10.10.2
DSD 5.3.1 - 5.3.2
Nov 09 (Thu)
[Lec 19]
Hypothesis testing - 2
• The Wald test
• t-test
• Kolmogorov-Smirnov test (KS test)
• AoS 15.4
DSD 5.3.3
assignment 4 due
assignment 5 out

Required q3_X.dat, q3_Y.dat, and gamma.dat datasets for A5.
Nov 14 (Tue)
[Lec 20]
Hypothesis testing - 3
• p-values
• Permutation test
• AoS 10.2, 10.5
DSD 5.5
Nov 16 (Thu)
[Lec 21]
Bayesian inference
• Bayesian reasoning
• Bayesian inference
• AoS 11.1 - 11.2
DSD 5.6
Nov 21 (Tue)
[Lec 22]
Regression - 1
• Basics of Regression
• Simple Linear Regression
• A5 solutions
AoS 13.1, 13.3 - 13.4 assignment 5 due
assignment 6 out, due Dec 11, to Caitao Zhan.

Required q2.dat dataset for A6.
Nov 23 (Thu) Thanksgiving Break No class.
Nov 28 (Tue)
[Lec 23]
Mid-term 2 review
Nov 30 (Thu) Mid-term 2
Dec 05 (Tue)
[Lec 24]
Regression - 2
• Multiple Linear Regression
• AoS 13.5
Dec 07 (Thu) Instructor traveling

### Resources

• Recommended text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.
• Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).
• Suggested for data science topics in the second half of the course.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 48%
• Roughly 6 assignments during the semester. Expect 5-6 questions per assignment, including some programming questions (after mid-term 1).
• Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

• Exams: 42%
• Two in-class exams.
• Mid-term 1: 17%.
• Mid-term 2: 25%.
• Roughly as difficult as the assignments.

• Class interaction: 10%
• The basic idea is to get you to talk in the class and contribute to discussions.
• By the end of the semester, if I can recognize you based on your contributions to the class discussion, you should get a good score on this.
• Very helpful for bumping your grade if you are on the border.

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty. The instructor is obligated to uphold these policies.
No exceptions will be made for any student and no special circumstances will be entertained.