====== Data Mining (309AA) - 9 CFU A.Y. 2025/2026 ====== **Instructors:** * **Anna Monreale** * KDDLab, Università di Pisa * [[anna.monreale@unipi.it]] * **Mattia Setzu** * KDDLab, Università di Pisa * [[mattia.setzu@unipi.it]] **Teaching Assistant:** * * **Lorenzo Mannocci** * University of Pisa * [[lorenzo.mannocci@di.unipi.it]] ====== News ====== [23-09-2025]: Please register yourself and your group for the project .Group registration available [[https://docs.google.com/spreadsheets/d/1Xl8Hd-giIuJQw0x2NDkXjbGZ2REGF-OukqC5XGU6pzA/edit?gid=0#gid=0|here]]. ====== Learning Goals ====== The Data Mining course tackles the analysis of large collections of data, and the extraction of information and patterns. It aims to explore core components of the Knowledge Discovery from Data (KDD) process, and focuses on: * Data understanding * Data cleaning, preparation, and transformation * Data analysis: outlier detection and data representation * Data clustering * Pattern extraction: itemset, rules, association rules, and sequential patterns * Inference models: trees, and ensemble models * Responsible data use: privacy and interpretability ====== Schedule ====== **Classes** ^ Day of Week ^ Hour ^ Room ^ | Tuesday | 11:00 - 13:00 | Room C | | Wednesday | 14:00 - 16:00 | Room C | | Thursday | 14:00 - 16:00 | Room A1 | **Office hours - Ricevimento:** * Anna Monreale:TBD- Online using Teams or in my Office (Appointment by email). * Mattia Setzu: Infos on [[https://unimap.unipi.it/cercapersone/dettaglio.php?ri=177323&template=dett_didattica.tpl|Unimap]] A [[ https://teams.microsoft.com/l/team/19%3Ai_Ge38xXm8FdnepLNud6ddbz_OECbBPRKfA1UKbUsQo1%40thread.tacv2/conversations?groupId=41e56778-e965-462a-9fef-250df0ee7055&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Channel]] will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed. ====== Teaching Material ====== **Books** ^ Title ^ Authors ^ Edition ^ | [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php|Introduction to Data Mining]] | Pang-Ning Tan, Michael Steinbach, Vipin Kumar | 2nd | | [[https://link.springer.com/book/10.1007/978-3-031-48956-3|Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications]] | Laura Igual, Santi Seguí | 2nd | | [[http://shop.oreilly.com/product/0636920034919.do| Python Data Science Handbook: Essential Tools for Working with Data]] | Jake VanderPlas | 1st | | [[https://github.com/janishar/mit-deep-learning-book-pdf|Deep Learning]] | Ian Goodfellow, Yoshua Bengio, Aaron Courville | | | [[https://math.mit.edu/~gs/linearalgebra/ila5/indexila5.html|Introduction to Linear Algebra]] | Gilbert Strang | 5th | **Online tutorials** ^ ^ Authors ^ | [[https://brianmcfee.net/dstbook-site/content/intro.html|Digital Signals Theory]] | Brian McFee | | [[https://rtavenar.github.io/blog/dtw.html|An introduction to Dynamic Time Warping]] | Romain Tavenard | | [[https://github.com/msetzu/intro_to_ds_and_ml/blob/master/python/notebooks/Python.ipynb|Introduction to Python]] | Mattia Setzu | **Slides** The slides used in the course will be inserted in the calendar after each class. Some are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ====== Class Calendar (2025/2026) ====== ===== First Semester ===== ^ ^ Day ^ Topic ^ Teaching material ^ References ^ Teacher ^ |1. | 18.09 | Course Overview. Introduction to Data Mining | {{ :magistraleinformatica:dmi:intro_dm.pdf |Introduction to DM}} | Chap. 1 Kumar Book | Setzu | | | 23.09 | Canceled for Teacher's health issues | | | | |2. | 24.09 | Data Understanding + Data Preparation | {{ :magistraleinformatica:dmi:data_understanding.pdf |}} {{ :magistraleinformatica:dmi:data_preparation_and_cleaning.pdf | Data Preparation}}| Chap. 2 Kumar Book and additioanl resource of Kumar Book: [[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Data Exploration Chap.]] If you have the first ed. of KUMAR this is the Chap 3 |Setzu | |3. | 25.09 | Data representation |{{ :magistraleinformatica:dmi:data_representation.pdf |}} | References: Introduction to linear algebra (Sections 1, 3.1, 4.2, 6.1, 6.4, 6.5, 7.3), [[https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf|t-SNE paper]], [[https://arxiv.org/abs/1802.03426 | UMAP paper (Section 3)]] |Setzu | |4. | 30.09 | Data Cleaning + Transformations. PyLab: Data Understanding | {{ :magistraleinformatica:dmi:5-data_cleaning_transformation.pdf | Data Cleaning & Transformations}}| | Monreale, Mannocci | |5. | 01.10 | PyLab: Data Understanding + Preparation |{{ :magistraleinformatica:dmi:1_basics_and_understanding.ipynb.zip |}} {{ :magistraleinformatica:dmi:2_feature_engineering_and_data_representation.ipynb.zip |}} {{ :magistraleinformatica:dmi:data_notebook.zip |}}| | Monreale, Mannocci | |6. | 02.10 | Similarities + Introduction to Clustering and Centroid-based clustering | {{ :magistraleinformatica:dmi:6-data_similarity.pdf |}} {{ :magistraleinformatica:dmi:6-basic_cluster_analysis-intro.pdf |}} {{ :magistraleinformatica:dmi:8-basic_cluster_analysis-kmeans.pdf |}}| | Monreale | |7. | 07.10 | K-means | {{:magistraleinformatica:dmi:8-basic_cluster_analysis-kmeans.pdf |}}}| | Monreale | |8. | 08.10 | Hierarchical Clustering + Density Based Clustering + Validity | {{ :magistraleinformatica:dmi:9-basic_cluster_analysis-hierarchical.pdf |}} {{ :magistraleinformatica:dmi:8.basic_cluster_analysis-dbscan-validity.pdf |}} | | Monreale | | 9. | 14.10 | Clustering evaluation and Python notebooks | {{ https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/dmi/12-basic_cluster_analysis-validity.pdf | Clustering validation}} {{ :magistraleinformatica:dmi:3_clustering.ipynb.zip |}} | | Setzu, Mannocci | | 10. | 15.10 | Anomaly detection | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/anomaly%20detection/Anomaly%20detection.html.pdf | Slides }} | | Setzu | | 11. | 16.10 | Anomaly detection | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/anomaly%20detection/Anomaly%20detection.html.pdf | Slides }} | | Setzu | ====== Exam ====== The exam can be taken in one of two ways: **Project track**: * Project (70% of the final score) to be delivered after the end of the course * Oral exam (30% of the final score) During the course, you will have some “Project presentation” sessions wherein you’ll briefly (~3 minutes) present your work, and receive feedback from the lecturers. These sessions do not contribute to your grade. **Written test track** * Written exam (70% of the final score): to be delivered after the end of the course during the exam sessions and can include both theoretical questions and exercises. * Oral exam (30% of the final score) Note that a passing grade for the project/written exam is required to be admitted to the oral exam. **Project Guidelines:** A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks. Specifically, if any of these tasks appear in the project track, make sure to focus on the following: **Data understanding** * An analysis of all variables, their relations, distributions, and quality * An eventual feature imputation and/or selection * The engineering of additional features, including the aforementioned analyses **Clustering Analysis** * A properly justified feature selection phase * Tackling all clusternig families, exploring their respective hyperparameters * An analysis of the best clusterings per family, including cluster description * A comparison of the best clusterings per family **Anomaly detection** * A selection of outliers through appropriate algorithms * An interpretation of such outliers * An analysis of the impact of the outliers on the previously performed data understanding **Time series analysis** * Appropriate representation choice for the task at hand **Supervised learning** * Feature selection * Test different families of models * Proper model validation, including both model performance and model complexity * Comparison of the best models of each family **Explainability** * Justified selection of instances to explain * Analysis of the explanations **Project and Deadlines** Information about the dataset to be analyzed and project description: * **Dataset.** https://drive.google.com/file/d/1K9garfm03-PFUMYyOenH9kqEJ7D5RrmD/view?usp=sharing * **Project description.** {{ :magistraleinformatica:dmi:data_mining_project.pdf |}} * **Project Question & Answers.** https://docs.google.com/spreadsheets/d/1D6lMKJTGNtMiUuNGFsrQwMPQgAjgAEflw8F5_LNQlXM/edit?usp=sharing * **Deadline.** * **Delivery instructions.** ====== Previous years ===== [[DM-INF 2024-2025]] [[DM-INF 2023-2024]] [[DM-INF 2022-2023]] [[DM-INF 2021-2022]] [[DM-INF 2020-2021]] [[http://didawiki.cli.di.unipi.it/doku.php/dm/dm.2019-20|DM-2019/20]]