Indice

Data Mining (309AA) - 9 CFU A.Y. 2025/2026
News
Learning Goals
Schedule
Teaching Material
- Past Excercises and past exams of similar courses
Class Calendar (2025/2026)
- First Semester
Exam
Previous years

Data Mining (309AA) - 9 CFU A.Y. 2025/2026

Instructors:

Anna Monreale
- KDDLab, Università di Pisa
- anna [dot] monreale [at] unipi [dot] it
Mattia Setzu
- KDDLab, Università di Pisa
- mattia [dot] setzu [at] unipi [dot] it

Teaching Assistant:

* Lorenzo Mannocci
- University of Pisa
- lorenzo [dot] mannocci [at] di [dot] unipi [dot] it

News

[18-11-2025]: Project deadline available: January 5th, 2026.
[23-09-2025]: Please register yourself and your group for the project .Group registration available here.

Learning Goals

The Data Mining course tackles the analysis of large collections of data, and the extraction of information and patterns. It aims to explore core components of the Knowledge Discovery from Data (KDD) process, and focuses on:

Data understanding
Data cleaning, preparation, and transformation
Data analysis: outlier detection and data representation
Data clustering
Pattern extraction: itemset, rules, association rules, and sequential patterns
Inference models: trees, and ensemble models
Responsible data use: privacy and interpretability

Schedule

Classes

Day of Week	Hour	Room
Tuesday	11:00 - 13:00	Room C
Wednesday	14:00 - 16:00	Room C
Thursday	14:00 - 16:00	Room A1

Office hours - Ricevimento:

Anna Monreale:TBD- Online using Teams or in my Office (Appointment by email).
Mattia Setzu: Infos on Unimap

A Teams Channel will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will NOT be live-streamed.

Teaching Material

Books

Title	Authors	Edition
Introduction to Data Mining	Pang-Ning Tan, Michael Steinbach, Vipin Kumar	2nd
Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications	Laura Igual, Santi Seguí	2nd
Python Data Science Handbook: Essential Tools for Working with Data	Jake VanderPlas	1st
Deep Learning	Ian Goodfellow, Yoshua Bengio, Aaron Courville
Introduction to Linear Algebra	Gilbert Strang	5th

Online tutorials

	Authors
Digital Signals Theory	Brian McFee
An introduction to Dynamic Time Warping	Romain Tavenard
Introduction to Python	Mattia Setzu

Slides

The slides used in the course will be inserted in the calendar after each class. Some are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Past Excercises and past exams of similar courses

Exercises on Clustering: ex._clustering.pdf
Some text of past exams of a similar course: 2017-1-19.pdf, 2017-9-6.pdf, 2016-05-30-dm1-seconda.pdf, dm2_exam.2017.06.13_solutions.pdf, dm2_exam.2017.07.04_solutions.pdf, dm2_mid-term_exam.2017.06.06_solutions.pdf
Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years: dm2_exam.2015.04.13.results.pdf, dm2_exam.2016.04.4_sol.pdf, dm2_exam.2016.04.5_sol.pdf, dm2_exam.2016.06.20_sol.pdf, dm2_exam.2016.07.08_sol.pdf
Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program: Verifica 2006, Verifica 2005 (con soluzioni), Verifica 2004, Verifica 5 giugno 2007, Verifica 26 giugno 2007, Verifica 24 luglio 2007 (e Soluzioni), Verifica 18 luglio 2008 - parte 1, Verifica 18 luglio 2008 - parte 2, Exam with solution 2010-06-01,Exam with solution 2010-06-22, Exam with solution 2010-09-09, Exam with solution 2010-07-13

Class Calendar (2025/2026)

First Semester

	Day	Topic	Teaching material	References	Teacher
1.	18.09	Course Overview. Introduction to Data Mining	Introduction to DM	Chap. 1 Kumar Book	Setzu
	23.09	Canceled for Teacher's health issues
2.	24.09	Data Understanding + Data Preparation	data_understanding.pdf Data Preparation	Chap. 2 Kumar Book and additioanl resource of Kumar Book: Data Exploration Chap. If you have the first ed. of KUMAR this is the Chap 3	Setzu
3.	25.09	Data representation	data_representation.pdf	References: Introduction to linear algebra (Sections 1, 3.1, 4.2, 6.1, 6.4, 6.5, 7.3), t-SNE paper, UMAP paper (Section 3)	Setzu
4.	30.09	Data Cleaning + Transformations. PyLab: Data Understanding	Data Cleaning & Transformations		Monreale, Mannocci
5.	01.10	PyLab: Data Understanding + Preparation	1_basics_and_understanding.ipynb.zip 2_feature_engineering_and_data_representation.ipynb.zip data_notebook.zip		Monreale, Mannocci
6.	02.10	Similarities + Introduction to Clustering and Centroid-based clustering	6-data_similarity.pdf 6-basic_cluster_analysis-intro.pdf 8-basic_cluster_analysis-kmeans.pdf		Monreale
7.	07.10	K-means	8-basic_cluster_analysis-kmeans.pdf}		Monreale
8.	08.10	Hierarchical Clustering + Density Based Clustering + Validity	9-basic_cluster_analysis-hierarchical.pdf 8.basic_cluster_analysis-dbscan-validity.pdf		Monreale
9.	14.10	Clustering evaluation and Python notebooks	Clustering validation 3_clustering.ipynb.zip		Setzu, Mannocci
10.	15.10	Anomaly detection	Slides		Setzu
11.	16.10	Anomaly detection	Slides , Notebook , Rule extraction from isolation forests		Setzu
12.	21.10	Variants of K-means + Association Rule Mining	11-basic_cluster_analysis-kmeans-variants.pdf 17_association_analysis2023.pdf		Monreale
13.	22.10	Association Rule Mining: Apriori	17_association_analysis2023.pdf		Monreale
14.	23.10	Association Rule Mining: CORELS	Slides , Online tool		Setzu
15.	28.10	Visual Analytcs	Slides Code for data visualization with Altair		Monreale, Rinzivillo
16.	29.10	Association Rule Mining: FP-Growth + Sequential Pattern Mining	FP-Growth SPM		Monreale
	30.10	Lecture is canceled
17.	04.11	Sequential Pattern Mining with time constraints + Python Lab: FPM + SPM.	For SPM the same set of slides used in the previous lecture 5_patternmining.ipynb.zip		Monreale
18.	05.11	Supervised learning and classification	Slides		Setzu
19.	06.11	Classification: Decision Trees	Decision Trees Video		Monreale
20.	07.11	Classification: Decision Trees			Monreale
21.	11.11	Classification: Decision Trees & evaltuation + Decision Rules	Evaluation Decision Rules		Monreale
22.	12.11	Classification: Decision Rules + Instance based methods + Q&A for Project work	10-knn.pdf		Monreale
23.	13.11	Exercises: DT simulation, CLustering, sequences	dt-learning-simulation.pdf learnedtree.pdf 2025-ex-clustering.pdf ex-sequences.pdf		Monreale
24.	18.11	Advanced Decision Trees, GAMs, and ensemble models	Slides		Setzu
25.	25.11	Neural networks	Slides		Setzu
26.	26.11	Time series, Python Supervised Learning & Imbalanced Scenarios	Slides supervised_learning.zip data_notebook.zip		Setzu, Mannocci
27.	27.11	Time series, Python Supervised Learning & Imbalanced Scenarios	Slides , Slides in HTML (w/ working animation)		Setzu
28.	02.12	Shapelet-based Classification, Motif discovery	Slides	shaplet.pdf matrixprofile.pdf Papers and resourse on motif	Monreale
29.	03.12	Py: Time Series	timeseries.zip		Monreale, Mannocci
30.	04.12	Responsible AI: introduction and EU Regulations	Slides	Monreale
31.	09.12	Responsible AI: privacy.	Same slides of previous lecture	chap-anonymity.pdf MIA attack against ML	Monreale
32.	10.12	Responsible AI: Explaianble AI	XAI	Digital book where students can find some basic XAI models and notions XAI Survey describing the taxonony and dimensions of XAI LORE apaproach, ABELE approach LASTS SHAP LIME	Monreale
33.	11.12	XAI Python Notebook + Private and explanable FL, Assessing privacy in XAI	XAI Notebook Slides	GLOR-FLEX FASTSHAP++ REVEAL	Naretto
34.	16.12	Project Presentations - second check - ONLINE - MANDATORY
35.	17.12	Project Presentations - second check - ONLINE - MANDATORY
36.	18.12	Project Presentations - second check - ONLINE - MANDATORY

Exam

The exam can be taken in one of two ways:

Project track:

Project (70% of the final score) to be delivered after the end of the course
Oral exam (30% of the final score)

During the course, you will have some “Project presentation” sessions wherein you’ll briefly (~3 minutes) present your work, and receive feedback from the lecturers. These sessions do not contribute to your grade.

Written test track

Written exam (70% of the final score): to be delivered after the end of the course during the exam sessions and can include both theoretical questions and exercises.
Oral exam (30% of the final score)

Note that a passing grade for the project/written exam is required to be admitted to the oral exam.

Project Guidelines: A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks.

Specifically, if any of these tasks appear in the project track, make sure to focus on the following:

Data understanding

An analysis of all variables, their relations, distributions, and quality
An eventual feature imputation and/or selection
The engineering of additional features, including the aforementioned analyses

Clustering Analysis

A properly justified feature selection phase
Tackling all clusternig families, exploring their respective hyperparameters
An analysis of the best clusterings per family, including cluster description
A comparison of the best clusterings per family

Anomaly detection

A selection of outliers through appropriate algorithms
An interpretation of such outliers
An analysis of the impact of the outliers on the previously performed data understanding

Time series analysis

Appropriate representation choice for the task at hand

Supervised learning

Feature selection
Test different families of models
Proper model validation, including both model performance and model complexity
Comparison of the best models of each family

Explainability

Justified selection of instances to explain
Analysis of the explanations

Project and Deadlines Information about the dataset to be analyzed and project description:

Dataset. https://drive.google.com/file/d/1K9garfm03-PFUMYyOenH9kqEJ7D5RrmD/view?usp=sharing
Project description. data_mining_project.pdf
Project description Task 4. data_mining_project2.pdf
Dataset Task 4. https://drive.google.com/file/d/1Li2roWMoREN6_nKy-trB7pXWDA1xkAzh/view?usp=sharing
Project description Task 5. * Project Question & Answers.Complete Project Description
Deadline. January 5th, 2026.
Delivery instructions. The final deadline of the project is 5th January 2026 at 23:59. This deadline is STRICT. No extension is possible because then the winter session of exams starts. Groups that will not deliver the project by 5th January will need to do the written exam during the exam sessions. Each group must deliver by email to anna.monreale@unipi.it, mattia.setzu@unipi.it, lorenzo.mannocci@di.unipi.it a zipped folder named DM_GroupID.zip and containing 4 folders and 1 pdf file: a folder named DM_GroupID_TASK1, containing source code of data understanding; a folder named DM_GroupID_TASK2, containing source code of data clustering; a folder named DM_GroupID_TASK3, containing source code of classification and explanation analysis; a folder named DM_GroupID_TASK4, containing source code of time series analysis; a pdf file with maximum 25+2 pages including figures discussing the results of the tasks (25 pages for tasks 1-4 and 2 pages for task 5). The name of this file must be: DM_Report_GroupID.pdf. The file must contain the list of authors (i.e., members of the group). The subject of the email must be “DMProject25_GroupID”
How to book for the exam colloquium? In https://esami.unipi.it/ you can find the dates for the exam: one for January and one for February. Each student must do the registration on one of the 2 dates. These are not the dates of the colloquium but we will use the list of registered students for organizing the exam dates. We will share with you a calendar for the oral exam.

Previous years

Data Mining (309AA) - 9 CFU A.Y. 2024/2025

Data Mining (309AA) - 9 CFU A.Y. 2023/2024

DM-INF 2022-2023

Data Mining (309AA) - 9 CFU A.Y. 2021/2022

Data Mining (309AA) - 9 CFU A.Y. 2020/2021

DM-2019/20