Data Mining
Spring 2019

Course Information


Project Description POSTED

Midterm will be given  as scheduled,  Tuesday, MARCH 12
Midterm covers Lecture L1 -  L4 and  L5 - L7 and related chapters of the Book

Contact  TA SAYONTAN GHOSH about  formation of  PROJECT TEAMS
Teams  must  consist of  2-5 students. He will help you if you do not have a partner(s)

PROJECT submission is via Blackboard

Contact TAs if you need more information or need to talk about grading
We also use BLACKBOARD for PROJECT and FINAL REPORT submissions

Time:  Tuesday, Thursday  4:00 pm  - 5:20  pm

Place: Engineering 145

Professor: Anita Wasilewska

208 CS Building; tel: 632-8458
e-mail: anita@cs.stonybrook.edu
Office Hours: Tuesday, Thursday   5:30 pm - 7: 00pm,  and by appointment

Teaching Assistants

 Sayontan Ghosh
e-mail: sayontan.ghosh@stonybrook.edu
Office hours:   Wednesday: 5:30 - 7:00 pm
Office Location: Room  2217  in Old CS Building
 Radhika Gaonkar
e-mail: rgaonkar@cs.stonybrook.edu
Office hours:  Thursday: 4:00 - 5:30 pm
Office Location: Room  2217  in Old CS Building
  Parth Limbachiya
e-mail: plimbachiya@cs.stonybrook.edu
Office hours:  Monday, 2:30pm - 4:00pm
Office Location: Room  2217  in Old CS Building
 Sushma Indluru
e-mail: sindluru@cs.stonybrook.edu
Office hours:  Monday, 2:30pm - 4:00pm
Office Location: Room  2217  in Old CS Building

General Course Description:

Data Mining (DM), called also Knowledge Discovery in Databases (KDD) and now called also BIG DATA is a  multidisciplinary field.
It brings together research and ideas from database technology, machine learning, neural networks, statistics, pattern recognition,
knowledge based systems, information retrieval, high-performance computing, and data visualization.
 Its main focus is the automated extraction of patterns representing knowledge implicitly stored in large databases,
data warehouses, and other massive information repositories.
The course will closely follow the book and is designed to give a broad, yet in-depth overview of the Data Mining field
and examine the most recognized techniques in a more rigorous detail.

Course Book

DATA MINING Concepts and Techniques
Jiawei Han, Micheline Kamber
Morgan Kaufman Publishers, 2003, 2011
Second or Third Edition

Course  Structure

The course is divided into six parts. Course Lectures slides  are written by me, except when  other sourses  are indicated
We list here Chapters numbers from 2nd edition followed by respective  Chapters numbers from 3rd edition put between parenthesis

 In particular we will cover all or part of the following subjects

PART 1 : Introduction; Data Preprocessing, Data Warehouse
Book chapters 1 - 3 (1- 40  and Lectures 3 - 7
PART 2 :  Classification:
Decision Tree Induction and Neural Networks

Book chapter 6  (8-9) and Lectures 3 - 7
Midterm 1  Review
Classification Project
PART 3 :  Association Analysis: Apriori Algorithm, Classification by Association
Book chapters 5, 6  (6, 9)  and Lectures 9, 10
PART 4 : Other Classification Models
Genetic Algorithms
Bayesian Classification
Book chapter 6  (9) and Lectures 11 -13
Midterm 2 Review
PART 5 : Cluster Analysis
Book chapter 7  (10, 11) and Lectures 14, 15
PART 6 : Other DM Areas,   Foundations of Data Mining
Book chapters 9-10 (13) Lectures 16, 17

Project and Final Report CAN to be conducted in Teams
Teams  consist  of 2- 5 students and must be the SAME for BOTH. All members of the Team receive the same grade

Preliminary Tests Schedule

MIDTERM 1March 12
Spring Break    March 18-24
Project  -  March 26
 MIDTERM 2  -  April 23
Final Report   - May 9

Course  Content

We will use my own Lecture Notes  and I will also post the original  Book Slides as a reference

We will follow the  BOOK  very closely and in particular we will cover  A PART OR ALL OF the following chapters and subjects. Chapters numbers below are from 2nd edition. Respective Chapters numbers in 3rd edition are listed in the Course Structure section
The order does not need to be sequential

Chapter 1  Introduction. General overview: what is Data Mining, which data, what kinds of patterns can be mined
Chapter 2  Data preprocessing: data cleaning, data integration and transformation, data reduction, discretization and concept hierarchy generation
Chapter 5  Mining Association Rules in transactional databases and Apriori Algorithm
Chapter 6 Classification and prediction:
1. Decision Tree Induction ID3, C4.5
2. Neural Networks
3. Bayesian Classification
4. Classification based on Concepts from Association rule mining
5. Genetic algorithms
Chapter 7  Cluster Analysis
 A Categorization of major Clustering methods
Chapter 10  Text Mining
Chapter 11 Other DM Areas and Foundations of DM

Grading Components

During the  the semester you have to complete the following
1. Midterm 1  70pt
2. Midterm 2   70pts
3. Project   30 pts
4. Final Report - 30pts

Final grade computation

NONE of grades will be CURVED

During the semester you can earn 200pts
The grade will be determine in the following way:  # of earned points/2 = % grade
The % grade which is translated into letter grade in a standard way i.e.
100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 % is C range, 69 - 60 % is D range, and F is below 60%.
See course SYLLABUS for details 

Records of students grades are being kept on Blackboard
Contact  TAs for information and questions about grading


PROJECT Description 
Project Data: - play around with the project data and familiarize yourself with it   bakarydata.xls
Project Tools :   WEKA    


Please e-mail TA   SAYONTAN GHOSH  names and e-mails of your Team members denoting  who is the designated Team Leader
TA will assign a Team Number to each team and e- mail it to each  Team Leader   to be used for future correspondence. CONTACT him if you do not HAVE a team partner. He will help you to FORM A TEAM
 All this has to be done  by March 14  as  the PROJECT  is   due  March 26

Download   (soon) the  Final  Report description   HERE


PROJECT Description 

Syllabus Slides


L1.  Chapter1: Introduction
L2.  Chapter2: Preprocessing
L2a.  Chapter 2: Short Preprocessing 
L3.  Chapter 6: Classification Introduction
L4.  Chapter 6: Classification Testing
L4a.  Example: Data Preparation and Metaclassifiers
  Paper:  A model Proteins SSP Metaclassifiers
L5. Chapter 6: Decision Trees Introduction
L6.   Chapter 6: Decision Trees Full Algorithm
L7.   Chapter 6: Neural Networks
L8.  Chapter 5: Association Analysis
L9.   Classification by Association
L10.  Chapter 6: Generic Algorithms
L11.  Generic Algorithms Examples
L13.  Chapter 7: Cluster Analysis 1
L14.  Chapter 7: Cluster Analysis 2

Data Mining Book Slides

Here are some  book slides - more to be posted
Book Chapter 2
Book Chapter 5
Book Chapter 6
Book Chapter 7


Datasets for data mining and knowledge discovery
Datasets for data mining competitions
University California Irvine KDD Archive
World Bank datasets

Academic Integrity Statement

Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Any suspected instance of academic dishonesty will be reported to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty, please refer to the academic judiciary website at Academic Judiciary Website

Stony Brook University Syllabus Statement

If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact Disability Support Services at (631) 632-6748 or Disability Support ServicesWebsite They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential. Students who require assistance during emergency evacuation are encouraged to discuss their needs with their professors and Disability Support Services. For procedures and information go to the following website: Disability Support Services Website