Indice
Data Mining A.A. 2015/16
DM 1: Foundations of Data Mining
Instructors - Docenti:
- Dino Pedreschi, Anna Monreale
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
Teaching assistant - Assistente:
- Riccardo Guidotti
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
DM 2: Advanced topics on Data Mining and case studies
Instructors:
- Mirco Nanni, Anna Monreale
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
News
- [21/06/2016] The results of the written exam of 20 June 2016 for DM1 and DM2 are out: Results
- [03/06/2016] The results of the written exam for DM1 and DM2 are out:Results
- [30/05/2016] The project for the DM1 project is out
- [10/05/2016] The topics for the final projects are out: Projects DM II, 2015-16
- [01/03/2016] The dates of the mid-term test and the regular summer exam sessions are out!
- [09/02/2016] The results of the exam of 8 Feb 2016 are available online.
- [08/02/2016] Oral Exam Sessions 11/02/2016 at 11.00, 22/02/2016 at 11.00 in Pedreschi's office. Send an email to pedre@di.unipi.it, monreale@di.unipi.it and guidotti@di.unipi.it to register for the oral exam if you have not already registered during the written exam of the 08/02/2016.
- [05/02/2016] The time schedule for the oral exam of February will be decided during the written exam of 08/02/2016 in room A1 from 9.00 to 12.00. You can also decide to perform the oral exam during the written exam. Note that these information are already present in this web page in the Exam section.
- [05/02/2016] San Francisco Crime list of projects received. If you submitted the project but your name does not appear in the list please submit your project (in a unique pdf file) again at guidotti@di.unipi.it. Buonaccorsi_Carta_Galassi
- [01/02/2016] During the exam session in February it is possible to do the written exam for improving the evaluation of only one or both of the two mid-term tests of DMI. All students, who attended to DMI course in the 2015-2016 academic year, must subscribe for the written exam by sending within 4th February an email to: anna.monreale@unipi.it, pedre@di.unipi.it and guidotti@di.unipi.it. They can avoid the online subscription because otherwise must compile the evaluation of the DM2 course without attending it.
- [22/01/2016] The oral exam session of 25/01/2016 will start at 14.00 instead of 15.00.
- [19/01/2016] The results of the exam of 18 Jan 2016 are available online.
- [18/01/2016] Oral Exam Sessions 20/01/2016 at 11.00, 25/01/2016 at 15.00 in Pedreschi's office. Send an email to pedre@di.unipi.it and guidotti@di.unipi.it to register for the oral exam if you have not already registered during the written exam of the 18/01/2016.
- [12/01/2016] Titanic Disaster Classification Top 5.
1 Rizzi-Romano-Scigliuzzo 0,8134; 2 Criscolo-Quintini-Trafficante 0,80383; 3 Bazzali-Borghi-Giannella 0,79904; 3 Deidda-Policardo-Salamida 0,79904; 3 DelleMacchie-Iavarone-Rambelli 0,79904; 3 Kocan-Erdem 0,79904; 3 Stili-Strazzulla-Gaggioli 0,79904; 4 Calamia-Ortolani-Tardelli 0,79426; 5 Abedini-Baltakiene 0,78947; 5 Loconte-Spontella-Di Modugno 0,7894;
- [08/01/2016] During the exam session in January and February it is possible to do the written exam for improving the evaluation of only one or both of the two mid-term tests of DMI. All students, who attended to DMI course in the 2015-2016 academic year, must subscribe for the written exam by sending within 14th January or 4th February an email to: anna.monreale@unipi.it, pedre@di.unipi.it and guidotti@di.unipi.it. They can avoid the online subscription because otherwise must compile the evaluation of the DM2 course without attending it.
- [07/01/2016] Titanic Disaster list of projects received. If you submitted the project but your name does not appear in the list please submit your project (in a unique pdf file) again at guidotti@di.unipi.it.
Abedini_Baltakiene, Alzetta_Miaschi_Semplici, Bambini_Catania_Incorvaia, Bazzali_Borghi_Giannella, Boncoraglio_Delicto_Veshi, Calamia_Ortolani_Tardelli, Criscolo_Quintini_Trafficante, Deidda_Policardo_Salamida, DelleMacchie_Iavarone_Rambelli, Donati, Dossena_Grossi_LaPerna, Fuccio_Furlan_LaPusata, Gentile_Miliani_Rossi, Giacalone_Montisci_Salerno, Kocan_Erdem, LaCroce, Loconte_Spontella_DiModugno, Rizzi_Romano_Scigliuzzo, Russo, Stili_Strazzulla_Gaggioli, Xu
- [07/01/2016] A new project is now available!!! Detailed infos are available in the Exams section.
- [22/12/2015] The results of the first mid-term test are online. If someone does not find his or her name in the file, please send me an email. During the exam sessions of January and February it is possible to do only one of the two parts of the written exam.
- [04/12/2015] The lesson planned for 7th Dec 2015 is suppressed.
- [23/11/2015] Each students who would like to do the second mid-term test MUST subscribe for the exam at https://esami.unipi.it/
- [19/11/2015] The results of the first mid-term test are online. If someone does not find his or her name in the file, please send me an email
- [15/09/2015] The first lesson of Data Mining I will take place on Friday, Sept. 25th, in room A1.
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
- i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
- le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
- alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
- l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download
Hours - Orario e Aule
DM 1
Classes - Lezioni
Giorno | Orario | Aula |
---|---|---|
Lunedì/Monday | 16:00 - 18:00 | Aula C |
Venerdì/Friday | 14:00 - 16:00 | Aula A1 |
Office hours - Ricevimento:
- Prof. Pedreschi/Monreale: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
DM 2
Classes - Lezioni
Day of week | Hour | Room |
---|---|---|
Monday | 9:00 - 11:00 | Room N1 |
Thursday | 9:00 - 11:00 | Room A1 |
Office hours - Ricevimento:
- Nanni / Monreale: appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
- Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Slides of the classes -- Slides del corso
- Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining".
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Testi di esame
- Oltre ai testi e (dove disponibili) soluzioni degli appelli d'esame degli anni recenti, sono consultabili i seguenti esercizi proposti in anni precedenti.
Data mining software
- KNIME The Konstanz Information Miner. Download page
- R: a language and environment for statistical computing
- Scikit-learn: python library with tools for data mining and data analysis
- WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Class calendar - Calendario delle lezioni (2015-2016)
First part of course, first semester (DMF - Data mining: foundations)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 21.09.2015 16:00-18:00 | C | Canceled | - | |
2. | 25.09.2015 14:00-16:00 | A1 | Overview | 1.dm-overview.pdf | Pedreschi/Monreale |
3. | 28.09.2015 16:00-18:00 | C | Introduction | 2.dm_ml_introduction.pdf | Pedreschi |
4. | 02.10.2015 14:00-16:00 | A1 | Introduction | 2.dm_ml_introduction.pdf | Monreale |
5. | 05.10.2015 16:00-18:00 | C | Data Understanding | 3.dataunderstanding.pdf 3.data-understanting-appendix.pdf | Monreale |
6. | 09.10.2015 14:00-16:00 | A1 | Data Preparation | 4.data_preparation.pdf | Monreale |
7. | 12.10.2015 16:00-18:00 | C | Clustering analysis. Centroid-based methods. | dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf | Monreale |
8. | 16.10.2015 14:00-16:00 | A1 | Clustering analysis. Hierarchical methods. Tutorial Knime | dm2014_clustering_hierarchical.pdf knime_slides_mains.pdf | Monreale |
9. | 19.10.2015 16:00-18:00 | C | Clustering Analysis. Density Based Clustering and Validation | dm2014_clustering_dbscan.pdf dm2014_clustering_validation.pdf | Monreale |
10. | 21.10.2015 16:00-18:00 | C | Exercises on Data Understanding. | exercises-dm1.pdf | Monreale |
11. | 23.10.2015 14:00-16:00 | A1 | Exercises on Clustering. | HC with Group Average exercises-clustering.pdf | Monreale/Guidotti |
12. | 26.10.2015 16:00-18:00 | C | Knime Exercises | datamanipulation.zip knime_clustering_iris.zip | Pedreschi/Guidotti |
13. | 30.10.2015 14:00-16:00 | A1 | R and Python Exercises | manipulation-clystering-r.zip manipulation-clustering-py.zip | Pedreschi/Guidotti |
02.11.2015-06.11.2015 | First Mid-term test: 6th November 14:00-16:00 Room A | ||||
14. | 09.11.2015 16:00-18:00 | C | Classification | chap4_basic_classification.pdf | Monreale |
15. | 13.11.2015 14:00-16:00 | A1 | Classification | Monreale | |
16. | 16.11.2015 16:00-18:00 | C | Classification | Monreale | |
17. | 20.11.2015 14:00-16:00 | A1 | Classification | Monreale | |
18. | 23.11.2015 16:00-18:00 | C | Exercises on Classification. Knime Exercises | knime_classification_iris.zip knime_classification_adult.zip knime_classification_over_adult.zip | Guidotti/Monreale |
19. | 27.11.2015 14:00-16:00 | A1 | Frequent Patterns & Association Rules | 4-5tdm-restructured_assoc.pdf | Monreale |
20. | 30.11.2015 16:00-18:00 | C | Canceled | ||
21. | 04.12.2015 14:00-16:00 | A1 | Canceled | ||
22. | 07.12.2015 16:00-18:00 | C | Canceled | Pedreschi | |
23. | 11.12.2015 14:00-16:00 | A1 | Exercises on Patterns. Knime Exercises | knime_pattern.zip | Guidotti / Pedreschi |
24. | 14.12.2015 16:00-18:00 | C | python-classification-pattern.zip r-classification-patterns.zip | Guidotti / Pedreschi | |
16.12.2015-18.12.2015 | Second Mid-term test |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Day | Aula | Topic | Learning material | Instructor | |
---|---|---|---|---|---|
1. | 22.02.2016 09:00-11:00 | N1 | Introduction + Sequential Patterns / 1 | sequential_patterns.pdf, textbook Ch. 7.4 | Nanni & Pedreschi |
2. | 25.02.2015 09:00-11:00 | A1 | Sequential Patterns / 2 | ||
3. | 29.02.2015 09:00-11:00 | A1 | Sequential Patterns / Exercises | Link to SPMF, a tool for seq. patterns and sample dataset. Exercises: Text 1 and Text 2 | |
4. | 03.03.2015 09:00-11:00 | A1 | Advanced Classification Methods / 1 | alternative_classification_1_dino_03.03.2016.pdf | Pedreschi |
5. | 07.03.2015 09:00-11:00 | A1 | Advanced Classification Methods / 2 | alternative_classification_2_dino_07.03.2016.pdf | Pedreschi |
6. | 10.03.2015 09:00-11:00 | A1 | Advanced Classification Methods / Tools and Exercises | exercises_classification.pdf sample_knime_workflows.zip | |
7. | 14.03.2015 09:00-11:00 | A1 | Advanced Classification Methods / Exercises | Exercises (also) on classification from 2014-15 | |
8. | 17.03.2015 09:00-11:00 | A1 | Time Series / 1 | time_series_from_keogh_tutorial.pdf | |
9. | 21.03.2015 09:00-11:00 | A1 | Time Series / 2 | ||
10. | 24.03.2015 09:00-11:00 | A1 | Time Series / Exercises | Some exercises from past exams: (Sequences and time series) (Classification) | |
25-29.03.2015 | EASTER HOLIDAYS | ||||
04.04.2015 09:00-13:00 | TBD | Midterm tests | |||
11. | 07.04.2015 09:00-11:00 | A1 | Case study: CRM - Customer Segmentation + CRISP-DM | Customer segmentation CRISP-DM | |
12. | 11.04.2015 09:00-11:00 | A1 | Case study: CRM - Churn Analysis | Intro_CRM Churn External_Churn | |
13. | 14.04.2015 09:00-11:00 | A1 | Case study: CRM - Promotions and Sophistication | Promotions Sophistication | |
14. | 18.04.2015 09:00-11:00 | A1 | Mobility Data Analysis / 1 | Preprocessing Patterns and models | |
15. | 21.04.2015 09:00-11:00 | A1 | Mobility Data Analysis / 2 | Individual/Collective models GSM_DM | |
16. | 28.04.2015 09:00-11:00 | A1 | Case study: Mobility Data Analysis | Case studies | |
17. | 02.05.2015 09:00-11:00 | A1 | Complements: Ethical Issues / 1 | slides | Monreale |
18. | 05.05.2015 09:00-11:00 | A1 | Complements: Ethical Issues / 2 | Monreale | |
19. | 09.05.2015 09:00-11:00 | A1 | Projects presentation | Projects | |
20. | 12.05.2015 09:00-11:00 | A1 | Complements: Outlier Detection | Slides from SDM2010 tutorial | |
21. | 16.05.2015 09:00-11:00 | A1 | Projects discussion |
Exams
Exam DM part I (DMF)
The exam is composed of three parts:
- A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
- An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
- A project, consisting in exercises that require the use of data mining tools for analysys of data. Exercises include: data understanding, market basket analysis, clustering analysis and classification. The project has to be performed by max 3 people. It has to be performed by using Knime, R, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 10 pages of text, figures excluded. The project must be delivered at least 2 days before the oral exam. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2015-2016] Project” in the subject. Tasks of the project:
- Data Understanding: San Francisco Crime Data Set. Assigned on: 30.05.2016. Download the data set (train.csv) here: https://www.kaggle.com/c/sf-crime/data (in CSV format) where you can also find the data description. From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7… belong to test set, week 2,4,6,8 belong to training set. Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
- Clustering analysis: San Francisco Crime Data Set. Assigned on: 30.05.2016. Perform the clustering analysis on the above data (train.csv), with any of the studied methods, using an appropriate subset of variables. Determine an adequate number of clusters, if any, and try to explain the properties of the discovered clusters (or else, argue why this dataset does not exhibit a clustering structure).
- Association Rule Mining: San Francisco Crime Data Set. Assigned on: 30.05.2016. Given the above data (train.csv), find the set of frequent items and analyse and discuss the most interesting association rules that is possible to derive from the frequent patterns.
- Classification: San Francisco Crime Data Set. Assigned on: 30.05.2016. Given time, location and additional infos, finding decision trees to predict the category of crime that occurred. Use the dataset “train.csv” for training the model and then use the file “test.csv” as test set checking your accuracy on the kaggle web site. The paper has to illustrate the adopted classification methodology and the decision tree validation and interpretation. In order to obtain a score for your model for the “test.csv” you have to prepare your model as usual using the file “train.csv” and fit your model using it. When you think your model is well trained run the prediction for the file “test.csv”. You have to produce a .csv file with the same formato of the file “sampleSubmission.csv” in the kaggle website. Then you have to upload this file on kaggle and you'll receive your score indicating the accuracy of your model. Report your score in the final paper.
- Hint 1! For those using Knime: since the train.csv file can be big to be managed with Knime, you can work on a sample of (test.csv) but select a permanent sample (e.g. a subset of the temporal window or a particular geographical area) and not a random one, and specify your selection in the final report.
- Hint 2! For those using Knime: exploit the OSM Map View node to visualize the San Francisco Map with the crimes.
- Hint 3! Classification task: try to build subsequent binary classifiers besides a unique classifier for multiple attributes (e.g. the first tree decide if the crime PROSTITUTION or not, if is not, the second one decide if the crime is KIDNAPPING or not, if is not the third tree decide if the crime is a ROBBERY or not…).
Guidelines for the project are here.
Exam DM part II (DMA)
The exam is composed of three parts:
- A written exam, with exercises and questions about classification (advanced topics), sequential patterns and times series.
- A project, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 1 page) describing the data to use and the analysis objectives. The work done for the project should be summarized in a report, to be sent to the teachers at least 2 days before the oral exam. The proposed projects are the following:
- Market basket: Individual vs collective purchase behaviours
- Online services: Churn analysis on LastFM listenings
- Mobility: Taxi cabs & criminality in San Francisco
- An oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
Appelli di esame
Mid-term exams
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
First Mid-term 2015 | Friday 06.11.2015 | 14.00 | Room A | Results | |
Second Mid-term 2015 | Wednesday 16.12.2015 | 11.00 | Room A1 | Results |
Date | Hour | Place | Notes | Marks | |
---|---|---|---|---|---|
Mid-term 2016 | Monday 04.04.2016 | 9.00 | Room A1 | Results |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Results |
---|---|---|---|---|---|
1. | Monday 18 January 2016 | 9.00 | A1 | In the same date we will define the dates for the oral exam. | |
2. | Monday 08 February 2016 | 9.00 | A1 | In the same date we will define the dates for the oral exam. | |
3. | Monday, 30 May 2016 | 9.00 | C | In the same date we will define the dates for the oral exam. | DM1: Written exam results DM2: Written exam results |
4. | Monday, 20 June 2016 | 9.00 | C | In the same date we will define the dates for the oral exam. | |
5. | Friday, 08 July 2016 | 9.00 | C | In the same date we will define the dates for the oral exam. | |
6. | Monday, 05 Sept 2016 | 9.00 | C | In the same date we will define the dates for the oral exam. |
Appelli straordinari A.A. 2014/15 / Extra sessions A.A. 2014/15
Date | Time | Room | Notes | Results |
---|---|---|---|---|
6 November 2015 | 14:00-16:00 | Room A | ||
04 April 2016 | 9.00-13:00 | Room A1 |