Strumenti Utente

Strumenti Sito


magistraleinformatica:dmi:dm-inf_2021-2022

Data Mining (309AA) - 9 CFU A.Y. 2021/2022

Instructor:

Teaching Assistant:

News

Learning Goals

  • Fundamental concepts of data knowledge and discovery.
  • Data understanding
  • Data preparation
  • Clustering
  • Classification
  • Pattern Mining and Association Rules
  • Outlier Detection
  • Time Series Analysis
  • Sequential Pattern Mining
  • Ethical Issues

Hours and Rooms

Classes

Day of Week Hour Room
Wednesday 14:00 - 16:00 Room C - Online
Thursday 14:00 - 16:00 Room C - Online
Friday 09:00 - 11:00 Room A1 - Online

Office hours - Ricevimento: Anna Monreale: Wednesday: 11:00-13:00 online using Teams (Appointment by email) Francesca Naretto: Monday: 15:00-18:00 online using Teams (Appointment by email)

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Slides

Software

  • Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
  • Scikit-learn: python library with tools for data mining and data analysis Documentation page
  • Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page

Class Calendar (2021/2022)

First Semester

Day Topic Learning material References Video Lectures
15.09 14:15‑16:00 Lecture deleted
1. 16.09 14:15‑16:00 Overview. Introduction to KDD 2021-1-overview.pdf1-intro-dm.pdf Chap. 1 Kumar Book Video 1 Video 2
2. 17.09 09:00-10:45 Data Understanding Slides DU Chap.2 Kumar Book and additioanl resource of Kumar Book:Exploring Data If you have the first ed. of KUMAR this is the Chap 3 Video 1 Video 2
3. 22.09 14:15-16:00 Data Understanding + Data Preparation 3-data_preparation.pdf Chap. 2 Kumar Book Video
4. 23.09 14:15-16:00 Data Preparation + Data Similarities.4-data_similarity.pdf Data Similarity is in Chap. 2
5. 24.09 09:00-10:45 Introduction to Clustering. Center-based clustering: kmeans 5-basic_cluster_analysis-intro.pdf 6.1-basic_cluster_analysis-kmeans.pdf Clustering is in Chap. 7
6. 29.09 14:15-16:00 Hierarchical clustering 7.basic_cluster_analysis-hierarchical.pdf Chap. 7 Kumar Book
7. 30.09 14:15-16:00 Density based clustering. Clustering validity. Lab. DU 8.basic_cluster_analysis-dbscan-validity.pdf Notebook DU tips Another Notebook on DU Chap. 7 Kumar Book
8. 01.10 09:00-10:45 Python Lab - Clustering Notebook CLustering Tips
9. 06.10 14:15-16:00 Center-based clustering: Bisecting K-means, Xmeans, EM 6.2-basic_cluster_analysis-kmeans-variants.pdf Chap. 7 Kumar Book, clusteringmixturemodels.pdf xmeans.pdf
10. 07.10 14:15-16:00 Classification Problem. Decision Trees 9.chap3_basic_classification-2020.pdf Chap. 3 Kumar Book
08.10 09:00-10:45 Lecture canceled
11. 13.10 14:15-16:00 Decision Trees + Classifier Evaluation same slides of the previous lecture Chap. 3 Kumar Book
12. 14.10 14:15-16:00 Evaluation Methods for Classification Models same slides of the previous lecture Chap. 3 Kumar Book
13. 15.10 09:00-10:45 Statistical tool for model evaluation + Rule based classification 10-rule-based-clussifiers.pdf Chap. 3 Kumar Book + Chap. 4 Kumar Book
14. 20.10 14:15-16:00 Rule based classification + Instance-based Classification 10-knn.pdf Chap. 4 Kumar Book
15. 21.10 14:15-16:00 Exercise on DT learning + Naive Bayesian Classifier 11_2021-naive_bayes.pdf 2021-dt-ex.pdf Chap. 4 Kumar Book
16. 22.10 09:00-10:45 SVM & Ensemble Classifiers 14_svm_2020.pdf 13_ensemble_2020.pdf Chap. 4 Kumar Book
17. 27.10 14:15-16:00 Neural Networks 15_neural_networks_2021.pdf Chap. 4 Kumar Book
18. 28.10 14:15-16:00 Python Lab on Classification adult_classification_2021.ipynb.zip
29.11 09:00-10:45 Canceled
19. 03.11 14:15-16:00 Python Lab on Classification + Association Rule Mining classificationpython2.zip 17_association_analysis2021.pdf Chap.5 Association Rules: Kumar Book
20. 04.11 14:15-16:00 Association Rule Mining Chap.5 Association Rules: Kumar Book
21. 05.11 09:00-10:45 FP-Growth - Sequential Pattern Mining 17_2021-fp-growth.pdf Chap.6 Kumar Book
22. 10.11 14:15-16:00 Sequential Pattern Mining 18_sequential_patterns_2021.pdf Chap.7 Kumar Book
23. 11.11 14:15-16:00 Time Series Similarities, Transformations & Clustering 22_time_series_similarity_2021.pdf Overview on DM for time series
24. 12.11 09:00-10:45 Motif & Shapelet Discovery 23_time_series_shapelets-motif-2021.pdf matrixprofile.pdf shaplet.pdf
25. 17.11 14:15-16:00 Lab: Association Rules & Sequential pattern mining by Python arm-spm.zip
26. 18.11 14:15-16:00 Ethics & Privacy 19_ethics_privacy2021.pdf > Overview on Privacy allegato11-cpdp13.pdf Privacy by design
27. 19.11 09:00-10:45 Lab: Time series timeseries-py.zip
28. 24.11 14:15-16:00 Explainability 20_explainability_2021.pdf Material: LORE LIME Survey ABELE SHAP LASTS
29. 25.11 14:15-16:00 Explainability + LAB XAI xai-lab.zip
30. 26.11 09:00-10:45 LAB XAI + Anomaly Detection AD&OD
31. 01.12 14:15-16:00 Anomaly Detection + Lab ADPY
32. 02.12 14:15-16:00 CRISP-DM crisp-dm.pdf
. 03.12 09:00-10:45 Canceled
33. 15.12 14:15-16:00 Room C Paper Presentation
34. 16.12 14:15-16:00 Room C Paper Presentation
35. 17.12 09:00-12:45 Room C Paper Presentation

Exams

Mid-term Project

A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 2/3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks.

  1. Dataset: Dataset
  2. Deadline: the fist part has to be delivered within November, 5th 2021 15th 2021.Send an email to: anna.monreale@unipi.it and francesca.naretto@sns.it
  1. Note that the document contains also rules for the delivery and final exam!
  2. Data for time series analysis: CityTemp
  3. Deadline: 5th January 2022

Students who did not deliver the above project within 5th Jan 2022 need to ask by email a new project to the teachers. The project that will be assigned will require about 2 weeks of work and after the delivery it will be discussed during the oral exam.

Paper Presentation (OPTIONAL)

Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions. They only need to present the project (see next point). The paper presentation can be done by the group or by a single person.

Oral Exam

  • Project presentation (with slides) – 10 minutes: mandatory for all the students
  • Open questions on the entire program: optional only for students opting for paper presentation.

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Previous years

magistraleinformatica/dmi/dm-inf_2021-2022.txt · Ultima modifica: 04/11/2022 alle 12:11 (17 mesi fa) da Salvatore Ruggieri