CSE 544, Spring 2018: Probability & Statistics for Data Science

News:
03/07: Today's lecture is canceled, as per SBU weather advisory.
03/07: A3 is now due 3/21 instead of 3/19.
02/21: M1 will be in class on Monday, Feb 26.

CSE 544: Probability & Statistics for Data Science
Spring 2018

When: Mon Wed, 2:30pm - 3:50pm
Where: Humanities 1003
Instructor: Anshul Gandhi
Instructor Office Hours: Mon Wed, 4-5pm
347, New CS building
Course TA and Graders: Eugenia Soroka, Leena Shekhar, Sai Rachana Patel Siddam

### Course Info

This grad-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, and Time Series Analysis. For more details, refer to the syllabus below.

The class is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will tentatively be based on assignments, exams, a semester-long group project, and class participation. For more details, see the section on grading below.

### Syllabus & Schedule

Date Topic Readings Notes
Jan 22 (Mon)
[Lec 01]
Course introduction, class logistics
Jan 24 (Wed)
[Lec 02]
Probability review - 1
• Basics: sample space, outcomes, probability
• Events: mutually exclusive, independent
• Calculating probability: sets, counting, tree diagram
• AoS 1.1 - 1.6
MHB 3.1 - 3.5
Jan 29 (Mon)
[Lec 03]
Probability review - 2
• Conditional probability
• Law of total probability
• Bayes' theorem
• AoS 1.7
MHB 3.6, 3.10 - 3.11
Jan 31 (Wed)
[Lec 04]
Random variables - 1: Overview and Discrete RVs
• Discrete and Continuous RVs
• Mean, Moments, Variance
• pmf, pdf, cdf
• Discrete RVs: Bernoulli, Binomial, Geometric, Indicator
• AoS 2.1 - 2.4
MHB 3.7 - 3.9, 3.14.1
assignment 1 out
Feb 05 (Mon)
[Lec 05]
Random variables - 2: Continuous RVs
• Uniform(a, b)
• Exponential(λ)
• Normal(μ, σ2), and its several properties
• AoS 2.7
MHB 3.14.1, 3.10, 3.13
Python scripts:
draw_Bernoulli, draw_Binomial, draw_Geometric,
sample_Bernoulli, sample_Binomial, sample_Geometric
Feb 07 (Wed)
[Lec 06]
Random variables - 3: Joint distributions & conditioning
• Joint probability distribution
• Linearity (and product) of expectation
• Conditional expectation
• Sum of a random number of RVs
• AoS 2.8
MHB 3.11 - 3.12, 3.15
Python scripts:
draw_Uniform, draw_Exponential, draw_Normal,
sample_Uniform, sample_Exponential, sample_Normal
Feb 12 (Mon)
[Lec 07]
Probability inequalities
• Markov's Inequality
• Chebyshev's inequality
• Weak Law of Large Numbers
• Central Limit Theorem
• Chernoff Bounds
• AoS 4.1 - 4.2, 23.1 - 23.3
MHB 3.14.2, 8.1 - 8.7
assignment 1 due
assignment 2 out
Feb 14 (Wed)
[Lec 08]
Markov chains
• Stochastic processes
• Setting up Markov chains
• Balance equations
• AoS 4.1 - 4.2, 23.1 - 23.3
MHB 3.14.2, 8.1 - 8.7
Feb 19 (Mon)
[Lec 09]
Non-parametric inference - 1
• Basics of inference
• Simple examples
• Empirical PMF
• Sample mean
• bias, se, MSE
• AoS 6.1 - 6.2, 6.3.1
Feb 21 (Wed)
[Lec 10]
Non-parametric inference - 2
• Empirical Distribution Function (or eCDF)
• Kernel Density Estimation (KDE)
• Statistical Functionals
• Plug-in estimator
• AoS 7.1 - 7.2 assignment 2 due
assignment 3 out
Required data q8.dat
Feb 26 (Mon) Mid-term 1
Feb 28 (Wed)
[Lec 11]
Confidence intervals
• Percentiles, quantiles
• Normal-based confidence intervals
• DKW inequality
• AoS 6.3.2, 7.1
Mar 05 (Mon) Instructor Traveling No class
Mar 07 (Wed) Snow day No class
Mar 19 (Mon)
[Lec 12]
Parametric inference - 1
• Consistency, Asymptotic Normality
• Basics of parametric inference
• Method of Moments Estimator (MME)
• Properties of MME
• Basics of MLE
• AoS 6.3.1 - 6.3.2, 9.1 - 9.2
Mar 21 (Wed) Snow day No class
Mar 26 (Mon)
[Lec 13]
Parametric inference - 2 and Hypothesis testing - 1
• Maximum Likelihood Estimator (MLE)
• Properties of MLE
• Basics of hypothesis testing
• The Wald test
• t-test
• AoS 9.3 - 9.4, 9.6
AoS 10 - 10.1, 10.10.2, 15.4
DSD 5.3.1 - 5.3.2
assignment 3 due
assignment 4 out
Required data q6_X.dat, q6_Y.dat, q9_sigma3.dat, q9_sigma100.dat
Mar 28 (Wed)
[Lec 14]
Hypothesis testing - 2
• Kolmogorov-Smirnov test (KS test)
• p-values
• Permutation test
• AoS 15.4, 10.2, 10.5
DSD 5.3.3, 5.5
Apr 02 (Mon)
[Lec 15]
Bayesian inference
• Bayesian reasoning
• Bayesian inference
• AoS 11.1 - 11.2
DSD 5.6
Apr 04 (Wed) Project proposal Meet in NCS 347.
Schedule of group meetings.
assignment 4 due on Apr 05 at 2:30pm (NCS 347).
Apr 09 (Mon)
[Lec 16]
Regression - 1
• Basics of Regression
• Simple Linear Regression
• AoS 13.1, 13.3 - 13.4 assignment 5 out
Required data A5_q2.dat, A5_q6.dat
Apr 11 (Wed) Mid-term 2
Apr 16 (Mon)
[Lec 17]
Regression - 2
• Multiple Linear Regression
• AoS 13.5
Apr 18 (Wed)
[Lec 18]
Time Series Analysis
• EWMA Time Series modeling
• AR Time Series modeling
• Apr 23 (Mon)
[Lec 19]
Review
Apr 25 (Wed) Project discussion assignment 5 due
Apr 30 (Mon) Project final ppts
May 02 (Wed) Project final ppts

### Resources

• Required text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).
• Students are strongly suggested to purchase a copy of this book.
• Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)
• Suggested for probability review and stochastic processes.
• There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.
• Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).
• Suggested for data science topics in the second half of the course.

• Others:
• S.M. Ross, Introduction to Probability Models, Academic Press
• S.M. Ross, Stochastic Processes, Wiley

• Assignments: 30%
• Roughly 5 assignments during the semester. Expect 5-8 questions per assignment, including some programming questions (after mid-term 1).
• Collaboration is allowed (max group size 5). Submit one solution per group.
• Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

• Exams: 30%
• Two in-class exams.
• Mid-term 1: 12%.
• Mid-term 2: 18%.
• Easier than the assignments but the questions will be on the same lines.

• Group project: 30%
• One semester-long group project. The project work is expected to begin after mid-term 1.
• Suggested group size of 4-8; the idea is to have around 15 teams total.
• Further details on the project will be added in mid-February.
• Grading (tentative) for the project consists of:
• 10% data cleaning and processing (final dataset to be submitted).
• 10% hypotheses to be tested (range of hypotheses involved and logic/reasoning behind it).
• 20% techniques used (range of techniques involved and how thoroughly they were applied and evaluated).
• 30% discretionary (largely based on group discipline/timeliness, non-triviality of project, and effort involved).
• 15% final in-class project presentation.
• 15% final 5-page project report.

• Class interaction: 10%
• The basic idea is to:
• get you to attend classes, and
• encourage you to participate in class discussions.
• By the end of the semester, if I can recognize you based on your contributions to the class discussion, you should get a good score on this.
• Very helpful for bumping your grade if you are on the border.

• Important:
• Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
• Grading will be on a curve.
• Assignment of grades by the instructor will be final; no regrading requests will be entertained.
• There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty.
No exceptions will be made for any student and no special circumstances will be entertained.

### Group Project

• Groups should be of size 4-8 students, with the hope that we end up with around 15 groups total.
• For a group of size x, the expectation is to propose and test at least x hypotheses (see below).
• The basic idea behind the project is for each team to:
1. Pick a raw dataset, process it, clean it (correct any errors or omit any outliers), and read it into a program.
2. Form multiple different hypotheses about the dataset and provide references for supporting the hypotheses, if applicable.
3. Analyze the processed data using multiple techniques to accept or reject each hypothesis.
4. Provide a measure of confidence for each case, as applicable.
5. Conclude with findings based on the data analysis.
• For item #1, you can pick your own dataset (either based on your projects or any publicly available dataset that you are interested in), or pick one from the suggested list found here. In case of the former, please first discuss with the instructor. At the very least, the dataset must be large enough so that it provides multiple columns of data to play with and is non-trivial to process. In case of the latter, the datasets are typically large enough, some ideas about hypotheses have been provided, and references already exist that analyze or discuss some of the data.
• For item #2, the hypotheses could simply be that the data (or part of the data) follows a given distribution, or that the data can modeled using some function, like linear or exponential, or combinations thereof.
• For item #3, the techniques will depend on step #2. If forming hypotheses about the distribution of data, then use parametric or non-parametric inference techniques; if forming hypotheses about the model, then use ML techniques like regression, etc. There might be projects where both sets of techniques might be required.
• The project should include multiple hypotheses, likely one for each column of data or sub-series of data, and must involve multiple techniques of analysis.
• The tentative schedule for the project is:
1. March 21: Finalize teams.
2. March 28: Finalize dataset (you may change your dataset later, but at your own risk as you may lose invaluable time).
3. April 02: In-person meeting with instructor to shortlist hypotheses for each team. Preliminary analysis of the data should be complete so that you have an idea of what the dataset is about.
4. April 25: Last-minute issues to be discussed with the instructor.
5. April 30 and May 02: In-class final project presentations.
6. May 09: Project reports due. Should be at least 5 pages, double column. Use standard formats (1 inch boundary on all sides). You can use Word or Latex, though the latter is preferred. If you go up to 6-7 pages, that is ok as well. But no less than 5 pages. Hand in the final pdf by email to the instructor no later than May 09 midnight (no extensions). Some pointers:
• The report will be graded for accuracy and readability (of text and graphs). It should roughly follow the same contents as the presentation. I would suggest having at least the following sections in the report:
1. Introduction: Start with a captivating and motivating introduction. Should contain high-level information on the dataset and the range of hypothesis you want to test. Maybe briefly summarize your results. This should be at most 1 page.
2. Dataset: Describe the dataset. Describe the data cleaning and formatting, if any. This would be at most half a page, per my estimates.
3. Hypotheses: This is the main section. A subsection for each hypotheses with (i) sentence on hypothesis, and why it makes sense to test it, (ii) method(s) used and whether it was applicable, (iii) results, (iv) CI or measures of confidence, if any, (v) conclusion for this hypothesis based on result. This whole section could be 2-3 pages, or more.
4. Prior Work: It is highly unlikely that nobody else has ever analyzed your dataset before, so please do a careful search and describe each prior work that has analyzed your dataset. Mention similarities and differences in hypotheses and methods used and results, if any. This should be about 0.5-1 page. This is where most groups lose at least some points due to lack of due diligence.
5. Future work: Mention what you would do further for your specific hypotheses if you had more time. Maybe specific results you were unhappy would and would want to tune your hypothesis or method, etc. Half a page, at best.
• Include graphs where it makes sense, but do not fill up your page limit using graphs that are not needed.
• No plagiarism, please. Write your own text. You will lose ALL points on the report if I detect plagirism.