CSE 591 - Data Science
Data Science is a rapidly emerging discipline at the intersection of statistics, machine learning, data visualization, and mathematical modeling. This course is designed to provide a hands-on introduction to Data Science by challenging student groups to build predictive models for upcoming events, and validating their models against the actual outcomes.
This website is from an old version of the course.
Visit http://www.cs.stonybrook.edu/~skiena/519 for the site of the current offering.
Course Time: 10AM - 11:20AM Tuesday-Thursday
Place: Frey Hall 316
Steven Skiena's office hours are 11:30AM-1PM Tuesday-Thursday,
in 1417 Computer Science, and by appointment.
Homework 1: DUE 9/16/14 --
The country data file is available at
This data comes from a variety of sources, but primarily the CIA World FactBook.
Project 1: Video Back Reel: DUE 9/23/14 --
Start by carefully reviewing our official class video instructions.
Project 2: Background Report: DUE 10/16/14 --
Latex source/macros for report are available from background-example.zip, with the structure of skiena-template.pdf.
Use this PowerPoint template for your presentation.
The presentation schedule will be groups 1/2 (Tuesday 10/7), 3/4 (Thursday 10/9), 5/6 (Tuesday 10/14) and 7/8 (Thursday 10/16).
Project 3: Status Report: DUE 11/13/14 -- Latex source/macros for report
are available from status-example.zip, with
the structure of status-template.pdf.
Use this PowerPoint template for your
The presentation schedule will be groups 1/2/3/4 (Tuesday 11/11) and 5/6/7/8 (Thursday 11/13).
Project 4: Final Report: DUE December 12, 2014.
The presentation schedule will be groups 1/2/3/4 (Tuesday 12/2) and 5/6/7/8 (Thursday 12/4).
For your website, download the source files teamfolder.zip
Edit the content.php file with your content.
and fill them in with your project specifics (include title, your names/websites, photos, report text/links, and data) and show it to me at the final group debriefing on 12/12/14. I will then ask you to send me a zip final with the final contents, which I will link to through something like this. Thanks to Wenbin Lin for making the template.
Latex source/macros for your final report
are available from final-report.zip.
The final group debriefings will be held in my office on Friday, December 12.
The times are:
- Group 1 -- 10:00AM
- Group 2 -- 10:30AM
- Group 3 -- 11:30AM
- Group 4 -- 12:00AM
- Group 5 -- 1:00PM
- Group 6 -- 1:30PM
- Group 7 -- 2:30PM
- Group 8 -- 3:00PM
Project 5: Final Video
I will give about twenty formal lectures during the semester. The other class periods will be devoted to project presentations and progress reports, three presentations per team. All classes will be filmed by Echo360 and made available on Blackboard.
Supplemental Lectures for Next Time?
If I decide not to continue the projects/video format the next time I teach
this, I would add the following lecture topics to fill out the semester:
A: Ranking and Scoring Functions
B: Network Data and Analysis
C: Deep Learning
D: Hashing/Feature Extraction
E: Big Data Architectures
F: Societal Implications of Big Data
G: Image/Video Analysis and Sensor Data
H: Time Series Analysis
A unique feature of this course is that I will be running it as a ``TV reality show’’ in data science,
with goal of producing a professionally-edited episode for each project, showing its evolution from an ill-defined problem through the development of a principled model and its evaluation.
Student teams will be given video cameras, and each team will be charged with producing rough-edited video segments of up to 20 minutes/group at five different points in the course of the semester.
Students who are reluctant to appear in these videos, or working to film/edit them should not register for the course.
The final video episodes will be made available on YouTube and/or disseminated by other channels/media.
Filming Consent and Relase Form -- Every person appearing in a video must fill this out.
Return these forms to Prof. Skiena, not University Communications.
Each 3-4 person group will be assigned a single, distinct modeling challenge, drawn from the following set of representative topics:
Movie gross prediction (Christmas releases)
Golden Globe awards (January)
Beauty contest (Miss Universe -- November-ish)
Football (playoffs / Super Bowl in January, College Football bowls Dec/Jan)
Election Forecasting (congressional elections in November)
Ghoul Pools (Halloween)
Baby Pool (what will the birth date/weight be? Flexible, but aim for December)
Weather Forecasting (will it be a white Xmas or rain on a wedding date?)
Time Magazine's Man of the Year (who it will be? Announced in December)
Stock Market (closing prices of each index/stock on December 31)
Bankruptcy (which NYSE company will go bankrupt/be delisted first after Dec 1?)
Economic variables (what will the unemployment rate, consumer price CPI or Michigan Survey of Consumer Sentiment be in October, November, December?)
Commodity prices (what will the price of oil and gold be in October November, December?)
Visit the final projects websites.
Grades will be assigned according to the following scale:
Individual Homework -- Python data manipulation (10% of course grade)
Group project (total of 80% of the course grade)
In class presentations (background, progress, and final reports) (20% of grade)
Written reports (background, progress, and final reports) in Latex. (25% of grade)
Video segments (five segments: B-reel, background, progress, final model, event/post-mortem): (5% each for a total of 25% of grade)
Final project webpage drawn from reports and presentations (10% of grade)
Course participation (10% of course grade)
I will not grade you on whether your final predictions are correct. Grades will be based on the general soundness of your modeling, visualization, and evaluation, your level of effort throughout the semester, and the quality and clarity of your oral, web, video, and written presentations.
The field of data science is still emerging, and we will not use
a textbook for the semester.
However, there are several books which it will be useful to read and consult:
Python for Data Analysis, by Wes McKinney, O'Reilly Media, 2013 --
This book is a nuts and bolt's guide to data wrangling with Python,
including such tools/libraries as Pandas, NumPy, and IPython.
You will be expected to use these tools in doing your course project.
The Signal and the Noise: Why so many predictions fail but some don't, by Nate Silver, Penguin Press, 2012 --
This popular, easy-to-read book focuses on how effectively data can
be used to make predictions in domains like sports, science, economics,
This is exactly what we are trying to do in this course, and Silver's
book is an excellent model to build on.
Data Science is an emerging discipline, with courses and textbooks still works in progress.
Here are pointers to interesting data science courses at other universities:
CS109 Data Science, Harvard University, Fall 2013 --
This course stresses statistical modeling and Python programming.
Very interesting, well thought-out assignments
Introduction to Data Science University of Washington (Coursera), instructor Bill Howe -- Stresses scalability and system issues associated with Big Data (MapReduce, NoSQL Databases), but with sections on machine learning and visualization.
Intro to Data Science, Udacity --
Data wrangling, statistics, visualization, MapReduce
Steven S. Skiena
1417 Computer Science Building
Department of Computer Science
State University of New York at Stony Brook
Stony Brook, NY 11794-4400, USA