skip to page content SBU
Stony Brook University
Data Science Fundamentals
CSE51d - Fall 2015

Home
Syllabus
Assignments
Notes

Assignments

COURSEWORK:

Coursework mainly consist of 3 mini-projects and 1 course project on data analysis
(grading in parentheses):
NOTE: Mini-projects are to be done individually. Groups for Course Project can consist of
up to 3 students.

IMPORTANT DATES:

Assignment Due Weight
Mini-Project #1
September 17
15%
Mini-Project #2
October 8
20%
Mini-Project #3
October 27
20%
Course Project Proposal
October 29
5%
Course Project Milestone
November 17
10%
Course Project Write-up
December 3
10%
Course Project Presentation
December 3
10%
In-class Participation
At all times
10%

ASSIGNMENTS ARE DUE AT THE BEGINNING OF LECTURE ON THE DUE DATE


Projects: The class projects will challenge the students to think creatively in formulating and solving problems.

For each mini-project, you will be given the specific dataset(s) to be studied. The main tasks for the mini-projects include (1) scope problems/questions, and (2) perform analytics. For the course project, you will be free to choose/collect your own dataset.

Each mini-project report should be 8 pages maximum, 4 pages minimum, including references and figures; this page limit is strict.

Note: Students can use the same dataset(s) provided to them for the mini-projects for their course project, provided that the course project addresses orthogonal problems to those studied for their mini-projects.

Scope: You will be given 2-3 weeks for each mini-project. As such, the scope of the mini-projects is expected to be smaller than that of the course project, for which you will have more time. Moreover, the course project is more open-ended and exploratory than the mini-projects, for which you are expected use a broader tool/skill-set. Therefore, its weight in evaluation is higher (35% total, see table above). Note that students can also form teams of 1-2 students for the course project.

Project presentations: For each mini-project, the course staff will choose the top few (2-4) best ones, which we will schedule to be presented in class (along with many brownie points :-)). For the course project, every team will present their final results at the end of the semester.

MINI-PROJECT #1

The task for this mini-project is to analyze network flow traffic data. The data is simulation traces created by domain experts at Northrop Grumman Aerospace Systems (data is ITAR-Clear and Unclassified).

You will be given two versions of this data and you are free to choose which one you would like to study for this project (or both). At a high-level, these are network connections between individual stations (e.g., IP address, sensor, etc.) over time. You can find more details about each dataset in the README files.

Network Traffic Data 1
Network Traffic Data 2

Your task is to first understand the type and the nature of the data you are given, scope potentially relevant problem statements for the domain, and perform the necessary analytics to find answers to these problems.

Example questions one might ask are as follows: What does the data look like? How can I represent it? What kind of (temporal) patterns exist? Are there any anomalies? What do the anomalies look like (e.g., intrusion, failure, etc.)? Of course, you are free to come up with other interesting questions that you would like to use the data to obtain answers to.

MINI-PROJECT #2

NYC Open Data is an open source collection of 1200+ data sources, such as electric consumption, fire alerts, 311 calls, traffic tickets by zipcode in NYC, among many others.

Your task for this mini-project is to think creatively in formulating problems and develop solutions, that are ideally in the lines of trying to solve societal problems and make NYC a better place. You are free to make use of any dataset or a combination thereof for your project. You should motivate and justify how your project relate to problems in the society, and how your findings/analysis could be used toward a good solution.

As motivating examples, we provide references to work by The NYC Office of Policy and Strategic Planning in the Office of the Mayor of New York City, which recently utilized various of those data sources for problems such as illegal housing conversions, prescription drug abuse, illegal cooking oil dump, and cigarette tax evasion. See here and here.

NYC Open Data

MINI-PROJECT #3

The task for this mini-project is to analyze user reviews for products/businesses. You will be given two different review datasets (one of which you can choose). The general format of the data is { user-ID, product-ID, star-rating, review-text, timestamp }. Depending on the dataset, you may have access to additional information, such as user profile information (e.g., location) or user-user friendship relations. For more details, please refer to the specific dataset.

Review Data 1 [also see here.]
Review Data 2

Your task consists of formalizing interesting questions one would ask given this data, and perform the necessary analytics. The outcome of your project could be in the form of interesting, surprising, and/or previously-unknown findings, normal user behavior patterns, suspicious behaviors, etc.

COURSE PROJECT

The course project is similar to the mini-projects, with a few key differences:

  • (1) the students can choose their own dataset(s),
  • (2) we expect the scope of the project to be greater than those of mini-projects, and
  • (3) we expect students to use a larger tool/skill-set acquired during the course of the semester.

There are four deliverables (Click the respective deliverable to learn more):

Project proposal

The proposal should survey the most related work and identify what are strengths and weaknesses of the papers and how they may be addressed. The proposal should then focus on describing the proposed research directions and questions. How precisely do you plan to pursue them? What methods do you plan to use? When wring the proposal you should try to answer the following questions:

  • What is the problem you are solving? Give the formal problem definition.
  • Which algorithms/techniques/models you plan to use/develop? Be as specific as you can!
  • How will you evaluate your method? How will you test it? How will you measure success?
  • What data will you use (how will you get it)? Give data specifics (eg. size, format, etc.).
  • What do you expect to submit/accomplish by the end of the semester? (eg. novel algorithm, new findings, parallel implementation, etc.)
The project proposal is limited to 1 page in length; this is a strict limit.

Project milestone

  • Think of this as a draft of your final report but without your major results.
  • We expect that you have completed 50-60% of the project.
  • Provide a complete picture / overview of your project even if certain key parts have not yet been implemented/solved.
  • Include the parts of your project which have been completed so far, such as:
    • Thorough introduction of your problem
    • Review of the relevant prior work
    • Description of the data collection process
    • Description of any initial findings or summary statistics from your dataset
    • Description of any mathematical background necessary for your problem
    • Formal description of any important methods/tools used
    • Description of general difficulties with your problem which bear elaboration
  • Make sure to at least outline the parts which have not yet been completed so that it is clear specifically what you plan to do for the final version.
  • Recommended length is 3-5 pages including all figures and references.

Project writeup

Course staff will use the following guidelines when grading your final project reports.
  • Introduction/Motivation/Problem Definition (15%)
    What is it that you are trying to solve/achieve and why does it matter?
  • Prior Work (10%)
    How does your project relate to previous work? Please give a short summary on each paper you cite and include how it is relevant.
  • Main Contributions (30%)
    This is where you give a detailed description of your primary contribution (which can be in the form of new findings, models, algorithms, methods, and/or tools). It is very important that this part be clear and well written so that we can fully understand what you did.
  • Results and Findings (35%)
    How did you evaluate your solution to whatever question you have addressed and what do these evaulation methods tell you about your solution? It is not so important how well your method performs but rather how interesting and clever your experiments and analysis are. We are interested in seeing a clear and conclusive set of experiments which successfully evaluate the problem you set out to solve. Make sure to interpret the results and talk about what we can conclude and learn from your evaluations. This part should clearly demonstrate the validity or value of your project (e.g., its potential impacts).
  • Style and writing (10%)
    Overall writing, grammar, organization, figures and illustrations.
You are expected to use the ACM format to write your project reports (12 pages maximum, 4 pages minimum, including references and figures; this page limit is strict).

Project presentation

  • Think of this as an oral version of your final project writeup.
  • Present your work in a meaningful and interesting flow (eg, motivation, problem definition, approach, results and their interpretation).
  • Make sure to include enough technical details and background of your approach (similar to a conference talk).
  • See here and here for some how-to on giving a good/bad talk.
  • Be prepared to ask (tough) questions to other project groups.
  • Depending on the number of project groups, we will spend 1-2 lectures on project talks. Each group will be given 5-10 minutes (depending on the number of groups), including questions.


Stony Brook University, CSE 51d Data Science Fundamentals