CSE 544, Spring 2018: Probability & Statistics for Data Science

News:
03/07: Today's lecture is canceled, as per SBU weather advisory.
03/07: A3 is now due 3/21 instead of 3/19.
02/21: M1 will be in class on Monday, Feb 26.

CSE 544: Probability & Statistics for Data Science
Spring 2018

When: Mon Wed, 2:30pm - 3:50pm
Where: Humanities 1003
Instructor: Anshul Gandhi
Instructor Office Hours: Mon Wed, 4-5pm
347, New CS building
Course TA and Graders: Eugenia Soroka, Leena Shekhar, Sai Rachana Patel Siddam

Course Info

This grad-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, and Time Series Analysis. For more details, refer to the syllabus below.

The class is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will tentatively be based on assignments, exams, a semester-long group project, and class participation. For more details, see the section on grading below.

Syllabus & Schedule

Date	Topic	Readings	Notes
Jan 22 (Mon) [Lec 01]	Course introduction, class logistics
Jan 24 (Wed) [Lec 02]	Probability review - 1 Basics: sample space, outcomes, probability Events: mutually exclusive, independent Calculating probability: sets, counting, tree diagram	AoS 1.1 - 1.6 MHB 3.1 - 3.5
Jan 29 (Mon) [Lec 03]	Probability review - 2 Conditional probability Law of total probability Bayes' theorem	AoS 1.7 MHB 3.6, 3.10 - 3.11
Jan 31 (Wed) [Lec 04]	Random variables - 1: Overview and Discrete RVs Discrete and Continuous RVs Mean, Moments, Variance pmf, pdf, cdf Discrete RVs: Bernoulli, Binomial, Geometric, Indicator	AoS 2.1 - 2.4 MHB 3.7 - 3.9, 3.14.1	assignment 1 out
Feb 05 (Mon) [Lec 05]	Random variables - 2: Continuous RVs Uniform(a, b) Exponential(λ) Normal(μ, σ²), and its several properties	AoS 2.7 MHB 3.14.1, 3.10, 3.13	Python scripts: draw_Bernoulli, draw_Binomial, draw_Geometric, sample_Bernoulli, sample_Binomial, sample_Geometric
Feb 07 (Wed) [Lec 06]	Random variables - 3: Joint distributions & conditioning Joint probability distribution Linearity (and product) of expectation Conditional expectation Sum of a random number of RVs	AoS 2.8 MHB 3.11 - 3.12, 3.15	Python scripts: draw_Uniform, draw_Exponential, draw_Normal, sample_Uniform, sample_Exponential, sample_Normal
Feb 12 (Mon) [Lec 07]	Probability inequalities Markov's Inequality Chebyshev's inequality Weak Law of Large Numbers Central Limit Theorem Chernoff Bounds	AoS 4.1 - 4.2, 23.1 - 23.3 MHB 3.14.2, 8.1 - 8.7	assignment 1 due assignment 2 out
Feb 14 (Wed) [Lec 08]	Markov chains Stochastic processes Setting up Markov chains Balance equations	AoS 4.1 - 4.2, 23.1 - 23.3 MHB 3.14.2, 8.1 - 8.7
Feb 19 (Mon) [Lec 09]	Non-parametric inference - 1 Basics of inference Simple examples Empirical PMF Sample mean bias, se, MSE	AoS 6.1 - 6.2, 6.3.1
Feb 21 (Wed) [Lec 10]	Non-parametric inference - 2 Empirical Distribution Function (or eCDF) Kernel Density Estimation (KDE) Statistical Functionals Plug-in estimator	AoS 7.1 - 7.2	assignment 2 due assignment 3 out Required data q8.dat
Feb 26 (Mon)	Mid-term 1
Feb 28 (Wed) [Lec 11]	Confidence intervals Percentiles, quantiles Normal-based confidence intervals DKW inequality	AoS 6.3.2, 7.1
Mar 05 (Mon)	Instructor Traveling		No class
Mar 07 (Wed)	Snow day		No class
Mar 19 (Mon) [Lec 12]	Parametric inference - 1 Consistency, Asymptotic Normality Basics of parametric inference Method of Moments Estimator (MME) Properties of MME Basics of MLE	AoS 6.3.1 - 6.3.2, 9.1 - 9.2
Mar 21 (Wed)	Snow day		No class
Mar 26 (Mon) [Lec 13]	Parametric inference - 2 and Hypothesis testing - 1 Maximum Likelihood Estimator (MLE) Properties of MLE Basics of hypothesis testing The Wald test t-test	AoS 9.3 - 9.4, 9.6 AoS 10 - 10.1, 10.10.2, 15.4 DSD 5.3.1 - 5.3.2	assignment 3 due assignment 4 out Required data q6_X.dat, q6_Y.dat, q9_sigma3.dat, q9_sigma100.dat
Mar 28 (Wed) [Lec 14]	Hypothesis testing - 2 Kolmogorov-Smirnov test (KS test) p-values Permutation test	AoS 15.4, 10.2, 10.5 DSD 5.3.3, 5.5
Apr 02 (Mon) [Lec 15]	Bayesian inference Bayesian reasoning Bayesian inference	AoS 11.1 - 11.2 DSD 5.6
Apr 04 (Wed)	Project proposal	Meet in NCS 347. Schedule of group meetings.	assignment 4 due on Apr 05 at 2:30pm (NCS 347).
Apr 09 (Mon) [Lec 16]	Regression - 1 Basics of Regression Simple Linear Regression	AoS 13.1, 13.3 - 13.4	assignment 5 out Required data A5_q2.dat, A5_q6.dat
Apr 11 (Wed)	Mid-term 2
Apr 16 (Mon) [Lec 17]	Regression - 2 Multiple Linear Regression	AoS 13.5
Apr 18 (Wed) [Lec 18]	Time Series Analysis EWMA Time Series modeling AR Time Series modeling
Apr 23 (Mon) [Lec 19]	Review
Apr 25 (Wed)	Project discussion		assignment 5 due
Apr 30 (Mon)	Project final ppts
May 02 (Wed)	Project final ppts

Resources

Required text: (AoS) "All of Statistics : A Concise Course in Statistical Inference" by Larry Wasserman (Springer publication).

Students are strongly suggested to purchase a copy of this book.

Recommended text: (MHB) "Performance Modeling and Design of Computer Systems: Queueing Theory in Action" by Mor Harchol-Balter (Cambridge University Press)

Suggested for probability review and stochastic processes.
There is copy placed on reserve in the library. The instructor also has a few personal copies that you can borrow.

Recommended text: (DSD) "The Data Science Design Manual" by (our very own) Steven Skiena (Springer publication).

Suggested for data science topics in the second half of the course.

Others:

S.M. Ross, Introduction to Probability Models, Academic Press
S.M. Ross, Stochastic Processes, Wiley

Grading (tentative)

Assignments: 30%

Roughly 5 assignments during the semester. Expect 5-8 questions per assignment, including some programming questions (after mid-term 1).
Collaboration is allowed (max group size 5). Submit one solution per group.
Assignments are due in class, at the beginning of the lecture. No late submissions allowed. Hard-copies only, please.

Exams: 30%

Two in-class exams.
Mid-term 1: 12%.
Mid-term 2: 18%.
Easier than the assignments but the questions will be on the same lines.

Group project: 30%

One semester-long group project. The project work is expected to begin after mid-term 1.
Suggested group size of 4-8; the idea is to have around 15 teams total.
Further details on the project will be added in mid-February.
Grading (tentative) for the project consists of:
- 10% data cleaning and processing (final dataset to be submitted).
- 10% hypotheses to be tested (range of hypotheses involved and logic/reasoning behind it).
- 20% techniques used (range of techniques involved and how thoroughly they were applied and evaluated).
- 30% discretionary (largely based on group discipline/timeliness, non-triviality of project, and effort involved).
- 15% final in-class project presentation.
- 15% final 5-page project report.

Class interaction: 10%

The basic idea is to:
- get you to attend classes, and
- encourage you to participate in class discussions.
By the end of the semester, if I can recognize you based on your contributions to the class discussion, you should get a good score on this.
Very helpful for bumping your grade if you are on the border.

Important:

Academic dishonesty will immediately result in an F and the student will be referred to the Academic Judiciary. See below section on Academic Integrity.
Grading will be on a curve.
Assignment of grades by the instructor will be final; no regrading requests will be entertained.
There is a University policy on grading, as well as a set of grading guidelines agreed upon by the CS faculty.
No exceptions will be made for any student and no special circumstances will be entertained.

Group Project

Groups should be of size 4-8 students, with the hope that we end up with around 15 groups total.
- For a group of size x, the expectation is to propose and test at least x hypotheses (see below).
The basic idea behind the project is for each team to:
1. Pick a raw dataset, process it, clean it (correct any errors or omit any outliers), and read it into a program.
2. Form multiple different hypotheses about the dataset and provide references for supporting the hypotheses, if applicable.
3. Analyze the processed data using multiple techniques to accept or reject each hypothesis.
4. Provide a measure of confidence for each case, as applicable.
5. Conclude with findings based on the data analysis.
For item #1, you can pick your own dataset (either based on your projects or any publicly available dataset that you are interested in), or pick one from the suggested list found here. In case of the former, please first discuss with the instructor. At the very least, the dataset must be large enough so that it provides multiple columns of data to play with and is non-trivial to process. In case of the latter, the datasets are typically large enough, some ideas about hypotheses have been provided, and references already exist that analyze or discuss some of the data.
For item #2, the hypotheses could simply be that the data (or part of the data) follows a given distribution, or that the data can modeled using some function, like linear or exponential, or combinations thereof.
For item #3, the techniques will depend on step #2. If forming hypotheses about the distribution of data, then use parametric or non-parametric inference techniques; if forming hypotheses about the model, then use ML techniques like regression, etc. There might be projects where both sets of techniques might be required.
The project should include multiple hypotheses, likely one for each column of data or sub-series of data, and must involve multiple techniques of analysis.
The tentative schedule for the project is:
1. March 21: Finalize teams.
2. March 28: Finalize dataset (you may change your dataset later, but at your own risk as you may lose invaluable time).
3. April 02: In-person meeting with instructor to shortlist hypotheses for each team. Preliminary analysis of the data should be complete so that you have an idea of what the dataset is about.
4. April 25: Last-minute issues to be discussed with the instructor.
5. April 30 and May 02: In-class final project presentations.
6. May 09: Project reports due. Should be at least 5 pages, double column. Use standard formats (1 inch boundary on all sides). You can use Word or Latex, though the latter is preferred. If you go up to 6-7 pages, that is ok as well. But no less than 5 pages. Hand in the final pdf by email to the instructor no later than May 09 midnight (no extensions). Some pointers:
  - The report will be graded for accuracy and readability (of text and graphs). It should roughly follow the same contents as the presentation. I would suggest having at least the following sections in the report:
  - Include graphs where it makes sense, but do not fill up your page limit using graphs that are not needed.
  - No plagiarism, please. Write your own text. You will lose ALL points on the report if I detect plagirism.

Academic Integrity

Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty are required to report any suspected instances of academic dishonesty to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty, please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity. Please note that any incident of academic dishonesty will immediately result in an F grade for the student.

Critical Incident Management

Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Judicial Affairs any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment, or inhibits students' ability to learn.

Disability Support Services

If you have a physical, psychological, medical or learning disability that may impact your course work, please contact Disability Support Services, ECC (Educational Communications Center) Building, room 128, (631) 632-6748. They will determine with you what accommodations, if any, are necessary and appropriate. All information and documentation is confidential. http://studentaffairs.stonybrook.edu/dss.