CSE 544, Spring 2023: Probability & Statistics for Data Science

News:
01/13: Welcome to CSE 544! We will have our first lecture in Engineering 145 on Jan 24th (Tuesday) at 9:45am

CSE 544: Probability & Statistics for Data Science
Spring 2023

When: Tu Th, 9:45am - 11:05am
Where: Engineering 145

Instructor: Anshul Gandhi
Instructor Office Hours: Tu 11am, Th 11am (NCS 347)

TA Office Hours: TBD

### Course Description

The course will cover core concepts of probability theory and an assortment of standard statistical techniques. Specific topics will include random variables and distributions, quantitative research methods (correlation and regression), and modern techniques of optimization and matching learning (clustering and prediction).

More informally, this 3-credit, grad-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, and Time Series Analysis. For more details, refer to the syllabus below.

The class is in-person, and is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will be based on assignments, exams, and a semester-end mini data analysis project. For more details, see the section on grading below.

Prerequisites: There are no formal prerequisites, but comfort in probability theory and proficiency with Python (since programming assignments tasks will be in Python) will be helpful.

Learning Objectives: Students taking this course will learn the necessary probability and statistical techniques and skills required for data scientists and quantitative analysts. At the completion of the course, students will be able to answer questions such as "what distribution does the data follow?", "how to estimate the parameters of a data distribution?", "do the two data populations come from the same distribution?". Additionally, students will be able to provide a measure of confidence or statistical significance when answering these questions. Finally, students will be able to implement techniques that answer the above questions using Python programming.

### Syllabus & Schedule

Jan 24 (Tu)
[Lec 01]
Course introduction, class logistics
Jan 26 (Th)
[Lec 02]
Probability review - 1
• Basics: sample space, outcomes, probability
• Events: mutually exclusive, independent
• Calculating probability: sets, counting, tree diagram
• AoS 1.1 - 1.5
MHB 3.1 - 3.4
Jan 31 (Tu)
[Lec 03]
Probability review - 2
• Conditional probability
• Law of total probability
• Bayes' theorem
• AoS 1.6, 1.7
MHB 3.3 - 3.6
assignment 1 out, due Feb 8
Feb 02 (Th)
[Lec 04]
Random variables - 1: Overview and Discrete RVs
• Discrete and Continuous RVs
• Mean, Moments, Variance
• pmf, pdf, cdf
• Discrete RVs: Bernoulli, Binomial, Geometric, Indicator
• AoS 2.1 - 2.3, 3.1 - 3.4
MHB 3.7 - 3.9
Feb 07 (Tu)
[Lec 05]
Random variables - 2: Continuous RVs
• Uniform(a, b)
• Exponential(λ)
• Normal(μ, σ2), and its several properties
• AoS 2.4, 3.1 - 3.4
MHB 3.7 - 3.9, 3.14.1
Python scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric,
draw_Uniform, draw_Exponential, draw_Normal
Feb 09 (Th)
[Lec 06]
Random variables - 3: Joint distributions & conditioning
• Joint probability distribution
• Linearity of expectation
• AoS 2.5 - 2.8
MHB 3.10 - 3.13, 3.15
assignment 2 out, due Feb 23
Feb 14 (Tu)
[Lec 07]
Random variables - 4: Joint distributions & conditioning
• Independent random variables
• Product of expectation
• Conditional expectation
• AoS 2.5 - 2.8
MHB 3.10 - 3.13, 3.15
Feb 16 (Th)
[Lec 08]
Probability Inequalities
• Weak Law of Large Numbers
• Central Limit Theorem
• AoS 5.3, 5.4
MHB 3.14.2, 5.2
Feb 21 (Tu)
[Lec 09]
Markov chains
• Stochastic processes
• Setting up Markov chains
• Balance equations
• AoS 23.1 - 23.3
MHB 8.1 - 8.7
Feb 23 (Th)
[Lec 10]
Non-parametric inference - 1
• Basics of inference
• Simple examples
• Empirical PMF
• Sample mean
• bias, se, MSE
• AoS 6.1 - 6.2, 6.3.1 assignment 3 out, due March 5
Required data: a3_q2_data.csv, a3_q2_ecdf.ipynb, a3_q6_kde_estimate_bimodal.ipynb, a3_q5_parametric_bootstrapping.ipynb

Feb 28 (Tu)
[Lec 11]
Non-parametric inference - 2
• Empirical Distribution Function (or eCDF)
• Kernel Density Estimation (KDE)
• Statistical Functionals
• Plug-in estimator
• AoS 7.1 - 7.2 Python scripts:
sample_Bernoulli, sample_Binomial, sample_Geometric,
sample_Uniform, sample_Exponential, sample_Normal, draw_eCDF
Mar 02 (Th)
[Lec 12]
Non-parametric inference - 3
• Percentiles, quantiles
• Normal-based confidence intervals
• DKW inequality
• Bootstrapping
• AoS 6.3.2, 7.1, 8 - 8.3
Mar 07 (Tu)
[Lec 13]
Parametric inference - 1
• Consistency, Asymptotic Normality
• Basics of parametric inference
• Method of Moments Estimator (MME)
• AoS 6.3.1 - 6.3.2, 9.1 - 9.2
Mar 09 (Th) Mid-term 1 In-class
Mar 14 (Tu) No class Spring Break
Mar 16 (Th) No class Spring Break
Mar 21 (Tu)
[Lec 14]
Parametric inference - 2
• Properties of MME
• Basics of MLE
• Maximum Likelihood Estimator (MLE)
• Properties of MLE
• AoS 9.3, 9.4, 9.6 assignment 4 out, due April 3
Required data: iris dataset, q7_a.csv, q7_b_X.csv, q7_b_Y.csv
Mar 23 (Th)
[Lec 15]
Hypothesis testing - 1
• Basics of hypothesis testing
• Wald test
• AoS 10 - 10.1
DSD 5.3.1
Mar 28 (Tu)
[Lec 16]
Hypothesis testing - 2
• Type I and Type II errors
• Wald test
• AoS 10 - 10.1
DSD 5.3.1
Mar 30 (Th)
[Lec 17]
Hypothesis testing - 3
• Z-test
• t-test
• AoS 10.10.2
DSD 5.3.2
Apr 04 (Tu)
[Lec 18]
Hypothesis testing - 4
• Kolmogorov-Smirnov test (KS test)
• p-values
• Permutation test
• AoS 15.4, 10.2, 10.5
DSD 5.3.3, 5.5
assignment 5 out, due April 17
Required data: a5_q5
Apr 06 (Th)
[Lec 19]
Hypothesis testing - 5
• Pearson correlation coefficient
• Chi-square test for independence
• AoS 3.3, 10.3 - 10.4
DSD 2.3
Apr 11 (Tu)
[Lec 20]
Bayesian inference - 1
• Bayesian reasoning
• Bayesian inference
• AoS 11.1 - 11.2, 11.6
DSD 5.6
Example plots
Apr 13 (Th)
[Lec 21]
Bayesian inference - 2
• Priors
• Conjugate priors
• AoS 11.1 - 11.2, 11.6
DSD 5.6
Apr 18 (Tu)
[Lec 22]
Regression - 1
• Basics of Regression
• Simple Linear Regression
• AoS 13.1, 13.3 - 13.4
DSD 9.1
assignment 6 out, due April 30
Required data: q2.dat, q4.dat, q5.csv, q6.csv
Apr 20 (Th)
[Lec 23]
Normality Testing
• Q-Q plot
• Shapiro-Wilk test
• Guest lecture: Anurag Dutt Python scripts:
q-q_plot.ipynb, sw_std_normal.ipynb, sw_theoretical.ipynb
Apr 25 (Tu)
[Lec 24]
Regression - 2
• Multiple Linear Regression
• AoS 13.5
DSD 9.1
Apr 27 (Th)
[Lec 25]
Time Series Analysis
• EWMA Time Series modeling
• AR Time Series modeling
• May 02 (Tu)
[Lec 26]
M2 review Solved m2_s23_prac
May 04 (Th) Mid-term 2 In-class

### Resources

• Required text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.
• Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).
• Suggested for data science topics in the second half of the course.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 36%
• 6 assignments during the semester, each accounting for 6% of the grade. Expect 5-7 questions per assignment, including some programming questions.
• Collaboration is allowed (max group size 4). You are free to form your own groups, and group membership can change between assignments.
• Submit one softcopy (e.g., PDF or doc) solution per group, typed or handwritten, but should be legible. Should include all required plots and work. Include group member names and ID numbers on the first page.
• Assignments are due via Brightspace, at 11:59pm. For example, if A1 has a due date of Feb 6, that means it is due by 11:59pm on Feb 6th. No late submissions allowed. Brightspace will automatically flag your submission as late if it is received after the 11:59pm deadline.
• One person per group (not all) should submit the softcopy on Brightspace under the designated assigment number.

• Exams: 56%
• Two exams, in-person.
• Mid-term 1: 23%.
• Mid-term 2: 33%.
• Easier than the assignments but the questions will be on the same lines.
• Make-up exams will be given only in extenuating circumstances (e.g., doctor's note stating that you were ill and unfit to take the exam). Students who miss an exam for a valid reason must contact the instructor immediately to take a make-up exam at the earliest possible time; specific arrangements will be made on a case-by-case basis.

• Mini group project: 8%
• One semester-end project to be done in groups. The project work is expected to begin in the last month of the course and will be due during finals exam week.
• Further details on the project will be provided in class around mid-March.

• Attendance: 0%
• Attendance is not required but strongly encouraged.
• Lectures will not be recorded.
• Exam questions are often based on class discussions, so attendance is helpful!

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty.
No exceptions will be made for any student and no special circumstances will be entertained.