CSE 544, Spring 2017: Probability and Statistics for Data Science

News:
4/25: Assignemnt 5 out, due in CS 347 on 5/11 (Thursday).

CSE 544: Probability and Statistics for Data Science
Spring 2017

When: Mon Wed, 2:30pm - 3:50pm
Where: Old CS 2129
Instructor: Anshul Gandhi
Instructor Office Hours: Mon 5-6pm and Wed 4-5pm
Also, by appointment (email instructor to schedule)
347, New CS building
Course TA: Xi Zhang (xizhang1 [at] cs [dot] stonybrook [dot] edu)

### Course Info

This course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, Classification, and Clustering. For more details, refer to the syllabus below.

The class is expected to be interactive and students are encourages to participate in class discussions.
Grading will be on a curve, and will tentatively be based on assignments, exams, a group project, and class participation. For more details, refer to the section on grading below.

### Syllabus & Schedule

Jan 23 (Mon) Course introduction, class logistics
Jan 25 (Wed) Probability review - 1 AoS 1.1 - 1.6
MHB 3.1 - 3.5
Jan 30 (Mon) Probability review - 2 AoS 1.7
MHB 3.6, 3.10 - 3.11
Feb 01 (Wed) Random variables - 1 AoS 2.1 - 2.3
MHB 3.7 - 3.9
assignment 1 out
Feb 06 (Mon) Random variables - 2 AoS 2.4
MHB 3.7 - 3.9, 3.14.1
MATLAB scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric,
sample_Bernoulli, sample_Binomial, sample_Geometric
Feb 08 (Wed) Random variables - 3 AoS 2.7
MHB 3.14.1, 3.10, 3.13
MATLAB scripts:
draw_Uniform, draw_Exponential, draw_Normal,
sample_Uniform, sample_Exponential, sample_Normal
Feb 13 (Mon) Conditioning and Expectations AoS 2.8
MHB 3.11 - 3.12, 3.15
Feb 15 (Wed) Probability inequalities,
Stochastic processes,
Markov chains
AoS 4.1 - 4.2, 23.1 - 23.3
MHB 3.14.2, 8.1 - 8.7
assignment 1 due
assignment 2 out
Feb 20 (Mon) Non-parametric inference - 1 AoS 6.1 - 6.2, 7.1 - 7.2 MATLAB scripts:
draw_ecdf_random, draw_ecdf_normal, draw_ecdf_exponential
Feb 22 (Wed) Non-parametric inference - 2 AoS 20.2 MATLAB scripts:
draw_histogram_random, draw_histogram_normal
Feb 27 (Mon) Non-parametric inference - 3
Confidence intervals
AoS 20.3, 6.3.2, 7.1 MATLAB scripts:
draw_kde
Mar 01 (Wed) Parametric inference - 1 AoS 6.3.1 - 6.3.2, 9.1 - 9.2 assignment 2 due
Mar 06 (Mon) Mid-term 1
Mar 08 (Wed) Parametric inference - 2 AoS 9.3 - 9.4, 9.9 assignment 3 out
Required data file for Q8.
Mar 13 (Mon) Spring break No class
Mar 15 (Wed) Spring break No class
Mar 20 (Mon) Hypothesis testing - 1 AoS 10 - 10.1, 10.10.2
Mar 22 (Wed) Project discussion - 1 Finalize project dataset.
Meet in CS 220.
Mar 27 (Mon) Hypothesis testing - 2 AoS 10.2, 10.5 assignment 3 due
assignment 4 out

Required data: q5_sigma3.dat, q5_sigma100.dat, q7_X.dat, q7_Y.dat.
Mar 29 (Wed) Bayesian inference AoS 11.1 - 11.2, 11.7
Apr 03 (Mon) Statistical Models
Apr 05 (Wed) Project discussion - 2 Finalize project deliverables.
Apr 10 (Mon) Mid-term prep No class.
assignment 4 due
Apr 12 (Wed) Mid-term 2
Apr 17 (Mon) Regression
Apr 19 (Wed) Time series analysis
Apr 24 (Mon) Project discussion - 3 Mid-project review.
Apr 26 (Wed) Classification assignment 5 out.
Required data: q2.dat, q4.dat, q5.dat.
May 01 (Mon) Clustering
May 03 (Wed) Project prep
May 08 (Mon) Project final ppts
May 10 (Wed) Project final ppts

### Resources

• Recommended text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 35%
• Roughly 5 assignments during the semester. Expect 5-8 questions per assignment, including some programming questions.
• Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

• Exams: 35%
• Two in-class exams.
• Mid-term 1: 15%.
• Mid-term 2: 20%.
• Roughly as difficult as the assignments.

• Group project: 20%
• One semester-long group project.
• Suggested group size of 4-8; the idea is to have around 10 teams total.
• See further details in the project section below.
• Grading (tentative) for the project consists of:
• 10% data cleaning and processing (final dataset to be submitted).
• 10% hypotheses to be tested (range of hypotheses involved and logic/reasoning behind it).
• 20% techniques used (range of techniques involved and how thoroughly they were applied and evaluated).
• 30% discretionary (largely based on group discipline/timeliness, non-triviality of project, and effort involved).
• 15% final in-class project presentation.
• 15% final 5-page project report.

• Class interaction: 10%
• The basic idea is to get you to talk in the class and contribute to discussions.
• By the end of the semester, if I can recognize you based on your contributions to the class discussion, you should get a good score on this.

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty. The instructor is obligated to uphold these policies.
No exceptions will be made for any student and no special circumstances will be entertained.

### Group Project

• The basic idea behind the project is for each team to:
1. Pick a raw dataset, process it, clean it (correct any errors or omit any outliers), and read it into a program.
2. Form multiple different hypotheses about the dataset and provide references for supporting the hypotheses, if applicable.
3. Analyze the processed data using multiple techniques to accept or reject each hypothesis.
4. Provide a measure of confidence for each case, as applicable.
5. Conclude with findings based on the data analysis.
• For item #1, you can pick your own dataset (either based on your projects or any publicly available dataset that you are interested in), or pick one from the suggested list found here. In case of the former, please first discuss with the instructor. At the very least, the dataset must be large enough so that it provides multiple columns of data to play with and is non-trivial to process. In case of the latter, the datasets are typically large enough, some ideas about hypotheses have been provided, and references already exist that analyze or discuss some of the data.
• For item #2, the hypotheses could simply be that the data (or part of the data) follows a given distribution, or that the data can modeled using some function, like linear or exponential, or combinations thereof.
• For item #3, the techniques will depend on step #2. If forming hypotheses about the distribution of data, then use parametric or non-parametric inference techniques; if forming hypotheses about the model, then use ML techniques like regression, etc. There might be projects where both sets of techniques might be required.
• The project should include multiple hypotheses, likely one for each column of data or sub-series of data, and must involve multiple techniques of analysis.
• The tentative schedule for the project is:
1. March 22: Form groups and pick a dataset to work on. Meet with instructor on 22nd to finalize the dataset(s).
2. April 03: Preliminary analysis of the data should be complete so that you have an idea of what the dataset is about. Form appropriate hypotheses and finalize the set of techniques to use for data analysis. Meet with instructor on April 3rd to finalize the deliverables.
3. April 24: Mid-project review to discuss progress and results.
4. May 08-10: In-class final project presentations.