Data Mining
Spring 2023

Course Information


TEST 2  - Thursday, April 13, in class
Test covers  Lectures 12-16

Consult Blakckboard for DETAILS
FINAL PRESENTATIONS Description mailed to students  and Posted

TEST 1 and PROJECT Grades Consultation WeekMarch 27 - 31
DETAILS posted in BLackboard amd mailed to studens

TEST 1 Solutions POSTED

Time:  Tuesday,  Thursday   11:30 am  - 12:50pm

Place:   Melville Library, Room E4320 

 Professor:   Anita Wasilewska

208 New CS Building
Phone: 632-8458
e-mail: anitaatcs.stonybrook.edu 

Professor Office Hours:
Tuesday, Thursday  5:00 pm - 6:00 pm   and  by appointment
In person: 208 New CS Building and email
I read emails DAILY and respond within a day or two

Teaching Assistants

Contact TAs if you need more information or need to talk about grading
We have very good TAs - please e-mail them, go to see them anytime  you need help

TA:    posted on Blackboard

Office Hours:  
Office Location:   2126 Old CS Building

Course Book

DATA MINING Concepts and Techniques
Jiawei Han, Micheline Kamber
Morgan Kaufman Publishers, 2003, 2011
Second or Third Edition

General Course Description:

Data Mining (DM), called also Knowledge Discovery in Databases (KDD) is a multidisciplinary field.
It brings together research and ideas from database technology, machine learning, neural networks, statistics, pattern recognition,
knowledge based systems, information retrieval, high-performance computing, and data visualization.
 Its main focus is the automated extraction of patterns representing knowledge implicitly stored in large databases,
data warehouses, and other massive information repositories.
The course will closely follow the book and is designed to give a broad, yet in-depth overview of the Data Mining field
and examine the most recognized techniques in a more rigorous detail.

Grading General Principles and Workload


ALL TESTS are personal, IN CLASS tests
PROJECT, and FINAL PRESENTATIONS  are to be conducted in TEAMS of 4-5   students
All members of the Team receive the same grade


Please e-mail TA   tba  names, IDs,  and e-mails of your Team members denoting  the designated Team Leader
TA will assign a Team Number to each team and email it to each  Team Leader   to be used for future correspondence.
CONTACT him if you do not HAVE a team partner. He will help you to FORM A TEAM

Course  Structure and Content

The course is divided into six parts. Course Lectures slides  are written by me, except when  other sources  are indicated
We list here Chapters numbers from 2nd edition followed by respective  Chapters numbers from 3rd edition put between parenthesis

 In particular we will cover all or part of the following subjects

PART 1 : Introduction; Data Preprocessing, Data Warehouse
Book chapters 1 - 3 (1- 4)  and Lectures 1 - 3
PART 2 Classification: Decision Tree Induction and Neural Networks

Book chapter 6  (8-9) and Lectures 4 - 11
  TEST 1
PART 3 Association Analysis: Apriori Algorithm, Classification by Association
Book chapters 5, 6  (6, 7)  and Lectures 12 - 14
PART 4 : Other Classification Models
Genetic Algorithms
Bayesian Classification
Book chapter 6  (9) and Lectures 15, 16
PART 5 : Cluster  Analysis
Book chapter 7  (10, 11) and Lectures  17, 18
PART 6 : Other DM Areas, Foundations of Data Mining
Book chapters 9-10 (13) Lectures 19 -23


, Final Presentations  are  to be conducted in Teams
Teams  consist  of 4-5 students and must be the SAME for all assignments. All members of the Team receive the same grade

 Tests  and Assignments -  PRELIMINARY Schedule

Spring Break  March 13 - 19
Project  - due Tuesday, March 23
Final Presentation - APRIL 24 - MAY 4

We will use my own Lecture Notes  and I will also post the original  Book Slides as a reference

We will follow the  BOOK  very closely and in particular we will cover a part  or all of the following chapters and subjects. Chapters numbers below are from 2nd edition. Respective Chapters numbers in 3rd edition are listed in the Course Structure section
The order does not need to be sequential

Chapter 1  (1) Introduction. General overview: what is Data Mining, which data, what kinds of patterns can be mined
Chapter 2  (2,3) Data Preprocessing: data cleaning, data integration and transformation, data reduction, discretization and concept hierarchy generation
Chapter 3 (4,5) Data Warehouse and OLAP technology for Data Mining
Chapter 5 (6,7) Mining Association Rules in transactional databases and Apriori Algorithm
Chapter 6 (8,9) Classification and prediction
1. Decision Tree Induction ID3, C4.5
2. Neural Networks
3. Bayesian Classification
4. Classification based on Concepts from Association rule mining
5. Genetic algorithms
Chapter 7  (10,11,12) Cluster Analysis
 A Categorization of major Clustering methods
Chapter 10  Text Mining
Chapter 11  (13) Other DM Areas and Foundations of DM 

Grading Components

During the  the semester you have to complete the following
1. TEST 1  70pt
2. TEST 2   70pts
3. Project   30 pts
4. Final Presentation - 30pts

NONE of grades will be CURVED

During the semester you can earn 200pts
The grade will be determine in the following way:  # of earned points/2 = % grade
The % grade which is translated into letter grade in a standard way i.e.
100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 % is C range, 69 - 60 % is D range, and F is below 60%.
See course SYLLABUS for details 

Records of students grades are being kept on Blackboard
Contact  TAs for information and questions about grading


PROJECT Description
Project Data: - play around with the project data and familiarize yourself with it   bakarydata.xls
Final Presentations description   HERE


TEST 1 Solutions
Syllabus Slides
PROJECT Description
TEST1  Review
 TEST2 Review


L1.  Chapter1 (1): Introduction
L2.  Chapter2 (2,3): Preprocessing
L2a.  Chapter 2 (2,3): Short Preprocessing
L2b. BOOK 3rd Edition Chapter 1 Overview
L3.  Chapter 3 (4,5): Data Warehouse
L4.  Chapter 6 (8, 9): Classification Introduction
L5.  Chapter 6: Classification Testing
L6.  Example: Data Preparation and Metaclassifiers
  Paper:  A model Proteins SSP Metaclassifiers
L7. Chapter 6: Decision Trees Introduction
L8.   Chapter 6: Decision Trees Full Algorithm
L9.   Chapter 6: Neural Networks
L10.   Modular Neural Network
L11.   Image Classification and Convolutional NN
L12.  Chapter 5 (6,7): Association Analysis
L13.   Association Analysis Review
L14.   Classification by Association
L15.  Chapter 6: Generic Algorithms
L16.  Generic Algorithms Examples
L17.  Chapter 7 (10):  Basics of Cluster Analysis
L18.  Chapter 7 (11,12): Cluster Analysis
L19.   Deep Learning
L20.   Text Data Mining
L21.   NLP-Natural Language Processing

L22.   Frequent Patterns Mining Basic - Book 3rd Edition chapter 6
L23.   Frequent Patterns Mining Advanced- Book 3rd Edition chapter 7


Here are some  Lectures-Presentations for FINAL REPORT -  YOU CAN also USE only YIUR Own Sources
Bayes 1
Bayes 2
Genetic Algorithms Applications
Image Classification
NLP Models
Natural Language Processing
Opinion Mining
Clustering 1
Clustering 2
Regression 1
Regression 2
Regression 3
Text Mining 1
Text Mining 2
Web Mining 1
Web Mining 2

Data Mining Book Slides

Here are some  book slides - more to be posted
Book Chapter 2
Book Chapter 5
Book Chapter 6
Book Chapter 7


Datasets for data mining and knowledge discovery
Datasets for data mining competitions
University California Irvine KDD Archive
World Bank datasets

Academic Integrity Statement

Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Any suspected instance of academic dishonesty will be reported to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty, please refer to the academic judiciary website at Academic Judiciary Website

Stony Brook University Syllabus Statements - included in the course SYLLABUS