Data Mining
Spring 2018

Course Information



Tuesday, May 22, 1:00pm - 4:00 pm


 LAST DAY of CLASSES  is Wednesday- the day of
LAST  Presentations

I UPDATED  the DESCRIPTION   of the Final Report
You have to write FOUR ESSAYS - and you CAN write a FITH for extra credit
Download the UPDATED Final  Report   HERE

ATTENTION: Presentations must be 30-35 minutes long- not a minute longer. I will keep a very strict time to give BOTH  Teams a fair time allocation. Practice your timing!

DAYS 1 - 9    Presentations POSTED

ATTENTION: You do not need  to submit the Presentation Report PART FOUR in class
PLEASE make your own NOTES about presentation. You will need them for your NEW  Final  Report

NEW Final Report  is now a team  report, not individual
NEW Final Report is different in content and structure from the one described in  the Syllabus

Contact TAs if you need more information or need to talk about grading
We also use BLACKBOARD for PROJECT and FINAL REPORT submissions

Time:  Monday, Wednesday  5:30 pm  - 6:50  pm

Place: JAVITS 111

Professor: Anita Wasilewska

208 CS Building; tel: 632-8458
e-mail: anita@cs.stonybrook.edu
Office Hours: Wednesday  7:00 pm - 8: 00pm,   Friday 3:30pm - 4:30pm,  and by appointment

Teaching Assistants

Amol Damare
e-mail: Amol.Damare@stonybrook.edu
Office hours: Tuesday, Thursday 3pm - 4:30 pm
Office Location: Room  2217  in Old CS Building
Fan Wang
e-mail: fanwang1@cs.stonybrook.edu
Office hours: Tuesday 3:30 pm - 5:00 pm
Office Location: Room  2217  in Old CS Building

General Course Description:

Data Mining (DM), called also Knowledge Discovery in Databases (KDD) and now called also BIG DATA is a  multidisciplinary field.
It brings together research and ideas from database technology, machine learning, neural networks, statistics, pattern recognition,
knowledge based systems, information retrieval, high-performance computing, and data visualization.
 Its main focus is the automated extraction of patterns representing knowledge implicitly stored in large databases,
data warehouses, and other massive information repositories.
The course will closely follow the book and is designed to give a broad, yet in-depth overview of the Data Mining field
and examine the most recognized techniques in a more rigorous detail.
It also will explore the newest trends and developments of the field in form of students talks based on newest research papers from the field.

Course Book

DATA MINING Concepts and Techniques
Jiawei Han, Micheline Kamber
Morgan Kaufman Publishers, 2003, 2011

Course  Structure

The course is divided into seven parts, the last one reserved for students presentations
We will cover all or parts of the following

PART 1 : Introduction; Data Preprocessing
Book chapters 1, 2 and Lectures 3 - 7
PART 2 :  Classification: Decision Tree Induction and Neural Networks

Book chapter 6 and Lectures 3 - 7
PART 3 :  Association Analysis: Apriori Algorithm, Classification by Association
Book chapter 6 and Lectures 8, 9
Test Review One  
Lecture 10
PART 4 : Genetic Algorithms: Introduction and Examples
Book chapter 6 and Lectures 11, 12
SPRING BREAK March 12 - 18
Test Review Two  
Lecture 13
It is in class test and covers material from Parts 1- 4
PART 5 : Cluster Analysis
Book chapter 7 and Lectures 14, 15
PART 6 : Foundations of Data Mining
Lecture 16
Presentations start  Monday, April 2

TEAMS consists of 5 students and the SAME for Classification Project,  Presentations, and Final Report

Project, Tests, Presentations Schedule

The schedule may change, changes will be posted the NEWS  

PRESENTATIONS - April 2 - - -  May 2

Course  Content

  • We will use my own Lecture Notes b on the BOOK and I will also post the original  Book Slides as a reference

  • We will follow the DATA MINING book very closely and in particular we will cover  A PART OR ALL OF the following chapters and subjects. Some can be assigned as reading or presented as a part of students presentations
  • The order does not need to be sequential
  • Chapter 1  Introduction. General overview: what is Data Mining, which data, what kinds of patterns can be mined
    Chapter 2  Data preprocessing: data cleaning, data integration and transformation, data reduction, discretization and concept hierarchy generation
    Chapter 5  Mining Association Rules in transactional databases and Apriori Algorithm
    Chapter 6 Classification and prediction:
    1. Decision Tree Induction ID3, C4.5
    2. Neural Networks
    3. Bayesian Classification
    4. Classification based on Concepts from Association rule mining
    5. Genetic algorithms
    Chapter 7  Cluster Analysis
    Chapter 8-11  Applications and TRENDS in DM

    Grading Components

    During the  the semester you have to complete the following
    1. Team Project  30pt
    2. Midterm/Final test  100pts
    3. Team Research Presentation   50 pts
    4. Final Report - 20pts

    Final grade computation

    NONE of grades will be CURVED

    During the semester you can earn 200pts
    The grade will be determine in the following way:  # of earned points/2 = % grade
    The % grade which is translated into letter grade in a standard way i.e.
    100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 % is C range, 69 - 60 % is D range, and F is below 60%.
    See course SYLLABUS for details 

    Records of students grades are being kept by the course TA. Contact the TA for information

    Project Description
    Project Data: - play around with the project data and familiarize yourself with it   bakarydata.xls
    Project Tools :   WEKA     


  • Here is the pdf for  Presentations Description (as in the Syllabus)
  • Read it to get an idea  of structure and rules and some of the subjects to choose from.
  • Please mail TA Amol Damare the number of members in the group, the name, e-mails and the SB Id of each group member, along with the possible  subject/ area of the presentation
  •  You CAN change the subject later
  • We will try to coordinate subject choices
  • Students CAN present the same subject but in this case MUST collaborate.
  • You can use any sources but use the terminology developed in Professor Lecture Notes and in the Book.

    TEAMS consists of 5 students and  are the same for the   Project and Presentations
    Teams have to be formed  by FEBRUARY 20,  the latest

    Each student has to deliver a 30 minutes long presentation on a chosen topic of AI as a member of a chosen Presentation Team of  five students
    It is students responsibility to form the Presentation Teams
    Each team has to have a designated Team Leader in order to communicate with Professor and the course TA


    Please e-mail TA  Amol Damare as soon as possible, and the latest by FEBRUARY 20  following:
    names and e-mails of your Team members denoti who is the designated Team Leader and a GENERAL SUBJECT of your future Presentation

    You can START with a TEAM of 2 (or even of 1); TA will put you in contact with a Team (teams) with similar GENERAL SUBJECT and help to FORM a TEAM of 5 students

    TA will assign a Team Number to each team and e- mail it to each  Team Leader   to be used for future correspondence

     All this has to be done  by February 20  as  the PROJECT  is   due March 7


    Each TEAM LEADER has to e-mail TA Amol Damare a TITLE and  a paragraph long description of the team presentation when the team is ready but not later then MARCH 16

    RESERVE the Presentation Date via e-mail to TA Amol Damare as soon as possible; "first come first serve" 
    You have to use your Team Number when  reserving the presentation date; you don't need the TITLE to reserve the  presentation date

    Presentations SCHEDULE

    D1   Monday, April 2
    Team 16:    Cluster Analysis (Introduction and Kmedoids K-means Clustering)
    Cluster Analysis
    Team 20:    Data Warehouse and OLAP (Data Cube computation and Development of data cube)
    Data Warehouse and OLAP

    D2  Wednesday, April 4
    Team 14:  Bayesian Classification

    Bayesian Classification
    Team 3:    Data Warehouse and OLAP (Data Cube Computation Methods, Processing Advanced Kinds of Queries by Exploring Cube Technology)
    Data Warehouse and OLAP

    D3   Monday, April 9 
    Team 18:
    Regression Analysis for prediction
    Regression Analysis
    Team 6:  Cluster Analysis (Distance metrics, hard and soft clustering, dimensionality reduction)
    Cluster Analysis

    D4   Wednesday, April 11
    Team 2: Data Clustering (Hierarchical Clustering, Density based clustering, Grid based clustering)
    Team 9: Genetic Algorithms:Applications
    Genetic Algorithms

    D5   Monday, April 16
    Team 1: New Advances in Data Mining (NLP topics)
    NLP 1
    Team  5: Text Mining
    Text Mining

    D6   Wednesday, April 18
    Team 15: Statistical Methods 1: Polynomial, Stepwise, ElasticNet regression
    and their applications in data mining
    Statistical Methods 1
    Team 8: Web Mining
    Web Mining

    D7   Monday, April 23
    Team 12: Opininion and Sentiment Analysis
    Opininion and Sentiment
    Team 11: Web Information Classification and Clustering (Web Mining)
    Web Mining

    D8   Wednesday, April 25
    Team 13: Statistical Methods 2:Linear, Logistic, Lasso, Ridge Regression
    Statistical Methods 2
      Team 4:  NLP
    Sentiment Analysis

    D9   Monday, April 30
    Team 17: Text Mining, NLP towards Computational Social Science
    Text Mining, NLP
    Team 19 Image Classification by Neural Networks
    Image Classification Using Convolutional NN

    D10  Wednesday, May 2
    Team 7:  Bayesian Classification
    Bayesian Classification
    Team 10  Deep Learning 
    Deep Learning

    Download the NEW Final  Report description   HERE


    Course SYLLABUS
    Syllabus Slides
    PROJECT Description
    PRESENTATIONS Desciption
    FINAL Report Format


    L1.  Chapter1: Introduction
    L2.  Chapter2: Preprocessing
    L2a.  Chapter 2: Short Preprocessing 
    L3.  Chapter 6: Classification Introduction
    L4.  Chapter 6: Classification Testing
    L4a.  Data Preparation and Metaclassifiers
      Paper:  A model Proteins SSP Metaclassifiers
    L5. Chapter 6: Decision Trees Introduction
    L6.   Chapter 6: Decision Trees Full Algorithm
    L7.   Chapter 6: Neural Networks
    L8.  Chapter 5: Association Analysis
    L9.   Classification by Association
    L10.  Chapter 6: Generic Algorithms
    L11.  Generic Algorithms Examples
    L12.   TEST REVIEW
    L13.  Chapter 7: Cluster Analysis 1
    L14.  Chapter 7: Cluster Analysis 2

    Data Mining Book Slides

    Here are  book slides
    Book Chapter 2
    Book Chapter 5
    Book Chapter 6
    Book Chapter 7


    Datasets for data mining and knowledge discovery
    Datasets for data mining competitions
    University California Irvine KDD Archive
    World Bank datasets

    Academic Integrity Statement

    Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Any suspected instance of academic dishonesty will be reported to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty, please refer to the academic judiciary website at Academic Judiciary Website

    Stony Brook University Syllabus Statement

    If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact Disability Support Services at (631) 632-6748 or Disability Support ServicesWebsite They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential. Students who require assistance during emergency evacuation are encouraged to discuss their needs with their professors and Disability Support Services. For procedures and information go to the following website: Disability Support Services Website