CSE 591 - Data Science

Fall 2014

Data Science is a rapidly emerging discipline at the intersection of statistics, machine learning, data visualization, and mathematical modeling. This course is designed to provide a hands-on introduction to Data Science by challenging student groups to build predictive models for upcoming events, and validating their models against the actual outcomes.

This website is from an old version of the course. Visit http://www.cs.stonybrook.edu/~skiena/519 for the site of the current offering.

  • Course Time: 10AM - 11:20AM Tuesday-Thursday
    Place: Frey Hall 316
  • Steven Skiena's office hours are 11:30AM-1PM Tuesday-Thursday, in 1417 Computer Science, and by appointment.
  • Syllabus
  • Lecture Schedule

Homework Assignments

Lecture Notes

I will give about twenty formal lectures during the semester. The other class periods will be devoted to project presentations and progress reports, three presentations per team. All classes will be filmed by Echo360 and made available on Blackboard.

Supplemental Lectures for Next Time?

If I decide not to continue the projects/video format the next time I teach this, I would add the following lecture topics to fill out the semester:

  • A: Ranking and Scoring Functions
  • B: Network Data and Analysis
  • C: Deep Learning
  • D: Hashing/Feature Extraction
  • E: Big Data Architectures
  • F: Societal Implications of Big Data
  • G: Image/Video Analysis and Sensor Data
  • H: Time Series Analysis

Video

A unique feature of this course is that I will be running it as a ``TV reality show’’ in data science, with goal of producing a professionally-edited episode for each project, showing its evolution from an ill-defined problem through the development of a principled model and its evaluation.

Student teams will be given video cameras, and each team will be charged with producing rough-edited video segments of up to 20 minutes/group at five different points in the course of the semester. Students who are reluctant to appear in these videos, or working to film/edit them should not register for the course. The final video episodes will be made available on YouTube and/or disseminated by other channels/media.

  • Filming Consent and Relase Form -- Every person appearing in a video must fill this out. Return these forms to Prof. Skiena, not University Communications.

Semester Projects

Each 3-4 person group will be assigned a single, distinct modeling challenge, drawn from the following set of representative topics:

  • Movie gross prediction (Christmas releases)
  • Golden Globe awards (January)
  • Beauty contest (Miss Universe -- November-ish)
  • Football (playoffs / Super Bowl in January, College Football bowls Dec/Jan)
  • Election Forecasting (congressional elections in November)
  • Ghoul Pools (Halloween)
  • Baby Pool (what will the birth date/weight be? Flexible, but aim for December)
  • Weather Forecasting (will it be a white Xmas or rain on a wedding date?)
  • Time Magazine's Man of the Year (who it will be? Announced in December)
  • Stock Market (closing prices of each index/stock on December 31)
  • Bankruptcy (which NYSE company will go bankrupt/be delisted first after Dec 1?)
  • Economic variables (what will the unemployment rate, consumer price CPI or Michigan Survey of Consumer Sentiment be in October, November, December?)
  • Commodity prices (what will the price of oil and gold be in October November, December?)

Visit the final projects websites.

Grades

Grades will be assigned according to the following scale:

  • Individual Homework -- Python data manipulation (10% of course grade)
  • Group project (total of 80% of the course grade)
    • In class presentations (background, progress, and final reports) (20% of grade)
    • Written reports (background, progress, and final reports) in Latex. (25% of grade)
    • Video segments (five segments: B-reel, background, progress, final model, event/post-mortem): (5% each for a total of 25% of grade)
    • Final project webpage drawn from reports and presentations (10% of grade)
  • Course participation (10% of course grade)

I will not grade you on whether your final predictions are correct. Grades will be based on the general soundness of your modeling, visualization, and evaluation, your level of effort throughout the semester, and the quality and clarity of your oral, web, video, and written presentations.

Recommended Readings

The field of data science is still emerging, and we will not use a textbook for the semester. However, there are several books which it will be useful to read and consult:

Related Links

Related Courses

Data Science is an emerging discipline, with courses and textbooks still works in progress. Here are pointers to interesting data science courses at other universities:
  • CS109 Data Science, Harvard University, Fall 2013 -- This course stresses statistical modeling and Python programming. Very interesting, well thought-out assignments
  • Introduction to Data Science University of Washington (Coursera), instructor Bill Howe -- Stresses scalability and system issues associated with Big Data (MapReduce, NoSQL Databases), but with sections on machine learning and visualization.
  • Intro to Data Science, Udacity -- Data wrangling, statistics, visualization, MapReduce

Professor

Steven S. Skiena
1417 Computer Science Building
Department of Computer Science
State University of New York at Stony Brook
Stony Brook, NY 11794-4400, USA
skiena@cs.sunysb.edu
631-632-9026