Strumenti Utente

Strumenti Sito


dm:start

Data Mining A.A. 2019/20

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

DM 2: Advanced topics on Data Mining and case studies (6 CFU)

Instructors:

DM: Data Mining (9 CFU)

Instructors:

News

  • [03.10.2019] Please, fill the spreadsheet with name of the group (Group1, Group2, …), the list of students composing the group.
  • [26.09.2019] Global Climate Strike: teachers of DM course tomorrow Friday September 27 will join the Global Climate strike, so tomorrow the lecture is suppressed.
  • [18.09.2019] Event: “Privacy: limite o opportunità? Gli esempi delle Nuove Tecnologie e dei Dati Sanitari” Information here.

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week Hour Room
Lunedì/Monday 14:00 - 16:00 Aula E1
Mercoledì/Wednesday 16:00 - 18:00 Aula A1
Venerdì/Friday 11:00 - 13:00 Aula C1

Office hours - Ricevimento:

  • Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
  • Prof. Monreale: Lunedì/Monday h 09:00 - 11:00, Dipartimento di Informatica

DM 2

Classes - Lezioni

Day of week Hour Room
Thursday 14 - 16 A1
Friday 16 - 18 C1

Office hours - Ricevimento:

  • Nanni : appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
  • Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.

Slides of the classes -- Slides del corso

Past Exams

* Some text of past exams on DM1 (6CFU):

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

Data mining software

  • KNIME The Konstanz Information Miner. Download page
  • Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
  • Scikit-learn: python library with tools for data mining and data analysis Documentation page
  • Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
  • WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page

Class calendar - Calendario delle lezioni (2019/2020)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

Day Topic Learning material Instructor
1. 16.09 14:00-16:00 Overview. Introduction to KDD Course Overview Introduction DM Pedreschi
18.09 16:00-18:00 Lecture canceled (Event at Scuola S. Anna Information in News Section of this page) Pedreschi
2. 20.09 11:00-13:00 Introduction to KDD: technologies, Application and Data Pedreschi
3. 23.09 14:00-16:00 Data Understanding (from Bertold book!) Slides DU Slides on Descriptive Statistics useful for clarifying some statistical notions of statistics. Unfortunately this material is only in Italian. Monreale
4. 25.09 16:00-18:00 Data Preparation Slides DP Monreale
27.09 11:00-13:00 Climate Strike
5. 30.09 14:00-16:00 Introduction to Python. Python Introduction Monreale
6. 02.10 16:00-18:00 Clustering: Introduction + Centroid-based clustering, K-means Clustering: Intro and K-means Pedreschi
7. 04.10 11:00-13:00 Lab: Data Understanding & Preparation in Knime Knime: 01_data_understanding.zip Data: Titanic File Monreale
8. 07.10 14:00-16:00 Lab: DU Python + Project presentation Python: titanic_data_understanding2.ipynb.zip Monreale
9. 09.10 16:00-18:00 Clustering: K-means + Hierarchical 5.basic_cluster_analysis-hierarchical.pdf Monreale
10. 11.10 11:00-13:00 Suppressed for Internet festival Pedreschi
11. 14.10 14:00-16:00 Clustering: DBSCAN & VALIDITY 6.basic_cluster_analysis-dbscan-validity.pdf Pedreschi
12. 16.10 16:00-18:00 Exercises on Clustering Tool for Dm ex: Didactic Data Mining Monreale
13. 18.10 11:00-13:00 Lab: Clustering Monreale
14. 21.10 14:00-16:00 Classification Pedreschi
15. 23.10 16:00-18:00 Classification Pedreschi
16. 25.10 11:00-13:00 Classification Pedreschi/ Milli
17. 28.10 14:00-16:00 LAB: Classificazione Monreale
18. 30.10 16:00-18:00 Exercises Classification + Discussion Clustering Monreale
19. 04.11 11:00-13:00 Pattern Mining Pedreschi
20. 06.11 16:00-18:00 Pattern Mining Pedreschi
08-14.11 Project work
21. 15.11 11:00-13:00 Exercises and Lab on Pattern Mining Monreale
18.11 14:00-16:00 Suppressed
20.11 16:00-18:00 Suppressed
22. 22.11 11:00-13:00 Exercises Classification Monreale
Next Classes are dedicated to DM of 9 CFU

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Room (Aula) Topic Learning material Instructor (default: Nanni)
1. 21.02.2019 14:00-16:00 A1 Introduction + Sequential patters/1 Introduction, Sequential patterns
2. 22.02.2019 16:00-18:00 C1 Sequential patterns/2
3. 01.03.2019 16:00-18:00 C1 Sequential patterns/3 Sample exercises (fixed)
4. 07.03.2019 14:00-16:00 A1 Sequential patterns/4 Sequential pattern tools: Link to SPMF + Sample datasets, Python2 GSP educational implementation(source), PrefixSpan-py (requires Python3)
5. 08.03.2019 16:00-18:00 C1 Time series/1 Time series
6. 14.03.2019 14:00-16:00 A1 Time series/2 Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978
7. 15.03.2019 16:00-18:00 C1 Time series/3
8. 21.03.2019 14:00-16:00 A1 Time series/4 Preprocessing in Python DTW in Python
9. 22.03.2019 16:00-18:00 C1 Time series/5
10. 28.03.2019 14:00-16:00 A1 Exercises for mid-term exam Exercises from past exams
11. 29.03.2019 16:00-18:00 C1 Exercises for mid-term exam Exercises from past exams (with some solutions)
04.04.2019 16:00-18:00 A1 + E mid-term exam
11. 11.04.2019 14:00-16:00 A1 Classification: alternative methods/1 kNN and Bayes classifier
12. 12.04.2019 16:00-18:00 C1 Classification: alternative methods/2 NN and SVM, Exercises
02.05.2019 14:00-16:00 A1 Cancelled
13. 03.05.2019 16:00-18:00 C1 Classification: alternative methods/3
14. 09.05.2019 14:00-16:00 A1 Classification: alternative methods/4 Ex. on NNs and SVM, Ex. on KNN and Naive Bayes
15. 10.05.2019 16:00-18:00 C1 Classification: Model Evaluation Model performances
16. 16.05.2019 14:00-16:00 A1 Classification: Model Evaluation Unbalanced data, Classification weights
17. 17.05.2019 16:00-18:00 C1 Classification: alternative methods/5 Ensembles, Homeworks!
18. 23.05.2019 14:00-16:00 A1 Exercises + Outlier detection/1 Ex. on Lift chart, Ex. on Ensembles, Outlier detection
19. 24.05.2019 16:00-18:00 C1 Outlier detection/2 Ex. on outliers, Ex. from past exams
20. 31.05.2019 16:00-18:00 C1 Due to a strike, the lesson will not take place. For you convenience, here is some material you can use: Examples of classification and validation in Python, Examples of outlier detection in Python, CRISP-DM guidelines. Feel free to contact me if you need clarifications. Remark: the CRISP-DM model will be not part of the exam program.
06.06.2019 16:00-18:00 E (+A1) mid-term exam 2nd mid-term of last year and its solutions (careful: they were not double-checked).

Exams

Exam DM part I (DMF)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the mid-term test of December.
  • An oral exam (optional) , that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. It is optional for students passing the written part by ONLY the mid-term test.
  • A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2018-2019] Project 2” in the subject.

Tasks of the project:

  1. Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification: Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
  • An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
  • A project, that consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam.
    • Dataset: the data is a time series dataset on air quality, which can be downloaded here: Dataset.
    • Task 1: Time series: Consider only attribute “PT08.S1(CO)” and split the corresponding time series into daily series, deleting those with too many missing values (value = -200) and fixing the others in some way. Make also sure that all time series have 24 values. Compute clustering (with an algorithm of your choice) based on DTW and Euclidean distances and compare the results.
    • Task 2: Sequential patterns: discover contiguous sequential patterns of at least length 4. Before that, time series should be discretized in some way.
    • Task 3:Classification methods: define a target variable “WE” for the time series data set to “true” for weekend days, and “false” for the others. Test the K-NN classification method using DTW as distance measure, and at least another classification method using the 24 values as separate variables.
    • Task 4: Outlier detection: from the original dataset (i.e. the raw records with all attributes, not the time series built only on the “PT08.S1(CO)” attribute), identify the top 1% outliers. Adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, …) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. Before doing the analysis, the records containing missing values should be deleted to avoid trivial results.

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
DM1: First Mid-term 2018 30.10.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/ results
DM1: Second Mid-term 2018 18.12.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/
DM2: First Mid-term 2019 04.04.2019 16-18 Room A1, E Please, use the system for registration: https://esami.unipi.it/
Text + Solutions
Results
DM2: Second Mid-term 2019 06.06.2019 16-18 Room E
(+ A1 if needed)
Please, use the system for registration: https://esami.unipi.it/
Text
Results

Appelli regolari / Exam sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 Room E
2.06.02.2019 14:00 - 18:00 Room E
3.19.06.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results
4.10.07.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results

Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18

Date Time Room Notes Results

Previous years

dm/start.txt · Ultima modifica: 16/10/2019 alle 09:36 (2 giorni fa) da Anna Monreale