CSE 544, Spring 2025: Probability & Statistics for Data Science

News:
01/09: Piazza course sign-up link
01/09: Welcome to CSE 544! We will have our first lecture in Old CS 2120 on Jan 28th (Tuesday) at 2pm

CSE 544: Probability & Statistics for Data Science
Spring 2025


When: Tu Th, 2pm - 3:20pm
Where: Old CS 2120

Instructor: Anshul Gandhi
Instructor Office Hours: Tu Th, 3:20pm - 4:20pm (NCS 347)

TA Office Hours: Wed 5-6pm, NCS 336

Course Description

The course will cover core concepts of probability theory and an assortment of standard statistical techniques. Specific topics will include random variables and distributions, quantitative research methods (correlation and regression), and modern techniques of optimization and matching learning (clustering and prediction).

More informally, this 3-credit, grad-level course covers probability and statistics topics required for data scientists to analyze and interpret data. The course is also part of the Data Science and Engineering Specialization. The course is targeted primarily at PhD and Masters students in the Computer Science Department. Topics covered include Probability Theory, Random Variables, Stochastic Processes, Statistical Inference, Hypothesis Testing, Regression, and Time Series Analysis. For more details, refer to the syllabus below.

The class is in-person, and is expected to be interactive and students are encouraged to participate in class discussions.

Grading will be on a curve, and will be based on assignments, exams, and a semester-end mini data analysis project. For more details, see the section on grading below.

Prerequisites: There are no formal prerequisites, but comfort in probability theory and proficiency with Python (since programming assignments tasks will be in Python) will be helpful.

Learning Objectives: Students taking this course will learn the necessary probability and statistical techniques and skills required for data scientists and quantitative analysts. At the completion of the course, students will be able to answer questions such as "what distribution does the data follow?", "how to estimate the parameters of a data distribution?", "do the two data populations come from the same distribution?". Additionally, students will be able to provide a measure of confidence or statistical significance when answering these questions. Finally, students will be able to implement techniques that answer the above questions using Python programming.

Syllabus & Schedule

Date Topic Readings Notes
Jan 28 (Tu)
[Lec 01]
Course introduction, class logistics
Jan 30 (Th)
[Lec 02]
Probability review - 1
  • Basics: sample space, outcomes, probability
  • Events: mutually exclusive, independent
  • Calculating probability: sets, counting, tree diagram
  • AoS 1.1 - 1.5
    MHB 3.1 - 3.4
    Feb 04 (Tu)
    [Lec 03]
    Probability review - 2
  • Conditional probability
  • Law of total probability
  • Bayes' theorem
  • AoS 1.6, 1.7
    MHB 3.3 - 3.6
    assignment 1 out, due Feb 12
    Feb 06 (Th)
    [Lec 04]
    Random variables - 1: Overview and Discrete RVs
  • Discrete and Continuous RVs
  • Mean, Moments, Variance
  • pmf, pdf, cdf
  • Discrete RVs: Bernoulli, Binomial, Geometric, Indicator
  • AoS 2.1 - 2.3, 3.1 - 3.4
    MHB 3.7 - 3.9
    Feb 11 (Tu)
    [Lec 05]
    Random variables - 2: Continuous RVs
  • Uniform(a, b)
  • Exponential(λ)
  • Normal(μ, σ2), and its several properties
  • AoS 2.4, 3.1 - 3.4
    MHB 3.7 - 3.9, 3.14.1
    Python scripts:
    draw_Bernoulli, draw_Binomial, draw_Geometric,
    draw_Uniform, draw_Exponential, draw_Normal
    Feb 13 (Th)
    [Lec 06]
    Random variables - 3: Joint distributions & conditioning
  • Joint probability distribution
  • Linearity of expectation
  • AoS 2.5 - 2.8
    MHB 3.10 - 3.13, 3.15
    assignment 2 out, due Feb 28
    Feb 18 (Tu)
    [Lec 07]
    Random variables - 4: Joint distributions & conditioning
  • Independent random variables
  • Product of expectation
  • Conditional expectation
  • AoS 2.5 - 2.8
    MHB 3.10 - 3.13, 3.15
    Feb 20 (Th)
    [Lec 08]
    Probability Inequalities
  • Weak Law of Large Numbers
  • Central Limit Theorem
  • AoS 5.3, 5.4
    MHB 3.14.2, 5.2
    Feb 25 (Tu)
    [Lec 09]
    Markov chains
  • Stochastic processes
  • Setting up Markov chains
  • Balance equations
  • AoS 23.1 - 23.3
    MHB 8.1 - 8.7
    Feb 27 (Th)
    [Lec 10]
    Non-parametric inference - 1
  • Basics of inference
  • Simple examples
  • Empirical PMF
  • Sample mean
  • bias, se, MSE
  • AoS 6.1 - 6.2, 6.3.1 assignment 3 out, due March 9
    Required data: firstFreezing.csv, a3_q6_kde_estimate_bimodal.ipynb
    a3_q5_parametric_bootstrapping.ipynb

    Mar 04 (Tu)
    [Lec 11]
    Non-parametric inference - 2
  • Empirical Distribution Function (or eCDF)
  • Kernel Density Estimation (KDE)
  • Statistical Functionals
  • Plug-in estimator
  • AoS 7.1 - 7.2 Python scripts:
    sample_Bernoulli, sample_Binomial, sample_Geometric,
    sample_Uniform, sample_Exponential, sample_Normal, draw_eCDF
    Mar 06 (Th)
    [Lec 12]
    Non-parametric inference - 3
  • Percentiles, quantiles
  • Normal-based confidence intervals
  • DKW inequality
  • Bootstrapping
  • AoS 6.3.2, 7.1, 8 - 8.3
    Mar 11 (Tu)
    M1 review
    Mar 13 (Th) Mid-term 1 In-class
    Mar 18 (Tu) No class Spring Break
    Mar 20 (Th) No class Spring Break
    Mar 25 (Tu)
    [Lec 13]
    Parametric inference - 1
  • Consistency, Asymptotic Normality
  • Basics of parametric inference
  • Method of Moments Estimator (MME)
  • AoS 6.3.1 - 6.3.2, 9.1 - 9.2
    Mar 27 (Th)
    [Lec 14]
    Parametric inference - 2
  • Properties of MME
  • Basics of MLE
  • Maximum Likelihood Estimator (MLE)
  • Properties of MLE
  • AoS 9.3, 9.4, 9.6 assignment 4 out, due April 7
    Required data: penguin dataset, q6_a.csv, q6_b_X.csv, q6_b_Y.csv
    Apr 01 (Tu)
    [Lec 15]
    Hypothesis testing - 1
  • Basics of hypothesis testing
  • Wald test
  • AoS 10 - 10.1
    DSD 5.3.1
    Apr 03 (Th)
    [Lec 16]
    Hypothesis testing - 2
  • Type I and Type II errors
  • Wald test
  • AoS 10 - 10.1
    DSD 5.3.1
    Apr 08 (Tu)
    [Lec 17]
    Hypothesis testing - 3
  • Z-test
  • t-test
  • AoS 10.10.2
    DSD 5.3.2
    assignment 5 out, due April 21
    Required data: confFileSizes.dat
    Apr 10 (Th)
    [Lec 18]
    Hypothesis testing - 4
  • Kolmogorov-Smirnov test (KS test)
  • p-values
  • Permutation test
  • AoS 15.4, 10.2, 10.5
    DSD 5.3.3, 5.5
    Apr 15 (Tu)
    [Lec 19]
    Hypothesis testing - 5
  • Pearson correlation coefficient
  • Chi-square test for independence
  • AoS 3.3, 10.3 - 10.4
    DSD 2.3
    Apr 17 (Th)
    [Lec 20]
    Bayesian inference - 1
  • Bayesian reasoning
  • Bayesian inference
  • AoS 11.1 - 11.2, 11.6
    DSD 5.6
    Example plots
    Apr 22 (Tu)
    [Lec 21]
    Bayesian inference - 2
  • Priors
  • Conjugate priors
  • AoS 11.1 - 11.2, 11.6
    DSD 5.6
    assignment 6 out, due May 02
    Required data: q2.dat, gpu_dataset, q5.csv
    Apr 24 (Th)
    [Lec 22]
    Regression - 1
  • Basics of Regression
  • Simple Linear Regression
  • AoS 13.1, 13.3 - 13.4
    DSD 9.1
    Apr 29 (Tu)
    [Lec 23]
    Regression - 2
  • Multiple Linear Regression
  • AoS 13.5
    DSD 9.1
    May 01 (Th)
    [Lec 24]
    Time Series Analysis
  • EWMA Time Series modeling
  • AR Time Series modeling
  • May 06 (Tu) M2 review
    May 08 (Th) Mid-term 2 In-class

    Resources

    Grading (tentative)

  • Important:
  • Academic Integrity Statement

    Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. Faculty in the Health Sciences Center (School of Health Professions, Nursing, Social Welfare, Dental Medicine) and School of Medicine are required to follow their school-specific procedures. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity/index.html.

    Critical Incident Management

    Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Student Conduct and Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment, or inhibits students' ability to learn. Faculty in the HSC Schools and the School of Medicine are required to follow their school-specific procedures. Further information about most academic matters can be found in the Undergraduate Bulletin, the Undergraduate Class Schedule, and the Faculty-Employee Handbook.

    Student Accessibility Support Center Statement

    If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact the Student Accessibility Support Center, Stony Brook Union Suite 107, (631) 632-6748, or at sasc@stonybrook.edu. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential.
     Please report any errors to the Instructor.