Strumenti Utente

Strumenti Sito


dm:dm.2014-15

Data Mining A.A. 2014/15

DM 1: Foundations of Data Mining

Instructors - Docenti:

Teaching assistant - Assistente:

DM 2: Advanced topics on Data Mining and case studies

Instructors:

News

  • [21/09/2015] The results of the written exam for DM II of 09.09.2015 are available here: DMII-Results-Sept
  • [22/07/2015] The results of the written exam for DM II of 17.07.2015 are available here: Results
  • [06/07/2015] The results of the written exam for DM II of 26.06.2015 are available here: Results
  • [30/06/2015] The results of the written exam for DM I of 26.06.2015 are available here: result
  • [14/06/2015] The results of the written exam for DM II of 05.06.2015 are available here: Results 05.06.2015
  • [18/05/2015] The results of the midterm exam for DM II is available here: Results 13.04.2015
  • [03/03/2015] The midterm test for DM I and the special session for exams for DM II will take place on April 13th, 2015 in room C1, at 9 a.m.
  • [21/02/2015] Results of DM I (written exam) available Data Mining I:result
  • [19/02/2014] The first lesson of Data Mining 2 will take place on Monday, Feb. 23-rd, in room N1.
  • [16/02/2015] The next oral exam will be on Monday 23 February 2015 at 11.00 and Monday 2 March 2015 at 11:00 at Predreschi's office. Note that you have to send an email (milli [at] di [dot] unipi [dot] it or dino [dot] pedreschi [at] di [dot] unipi [dot] it) to register for the oral exam.
  • [21/01/2015] Results of DM I (written exam) available Data Mining I: Results of written exam, January 19, 2015
  • [19/01/2015] The next oral exam will be on Monday 19 January 2015 at 9.00 and Thursday 29 January 2015 at 14:00 at Predreschi's office. Note that you have to send an email (milli [at] di [dot] unipi [dot] it or dino [dot] pedreschi [at] di [dot] unipi [dot] it) to register for the oral exam.
  • [15/12/2014] The text for the third and fourth exercises, has been released. Deadline: three days before the oral exam.
  • [12/12/2014] Today lesson is cancelled for strike. The lesson is moved to Monday 15/12/2014 16:00
  • [12/12/2014] Le valutazioni del secondo esercizio sono / Evaluation of the second homework is online
  • [12/12/2014] Le valutazioni del primo esercizio sono / Evaluation of the first homework is online
  • [24/11/2014]Il 27 & 28 Novembre il KDD Lab tiene il suo workshop annuale aperto a tutti gli interessati KddLab Workshop
  • [07/11/2014]The text for the first and second exercises, has been released. Deadline: 28/11/2014.
  • [17/10/2014] Appello straordinario Anno Accademico 2013/2014: venerdì 7 novembre 2014 ore 9:00-11:00 aula C1
  • Richiesta di collaborazione al progetto di ricerca scientifica MOTUS - Mobility and Tourism in Urban Scenarios. Dedicate 2 ore di tempo il prossimo Giovedì 16 ottobre al Dipartimento di Informatica a testare e valutare nuove app per la mobilità. Dettagli e iscrizione: focus group MOTUS
  • Nuovo orario/new hours: Lunedì/Monday 16:00-18:00 Aula C; Venerdì/Friday 14:00-16:00 Aula A1
  • Per impegni del docente precedenti allo spostamento dell'orario, la lezione di Lunedì 13 ottobre inizierà alle 16:30. The class of Monday 13 October will begin at 16:30
  • [07/10/2014] Il doodle per decidere se spostare o meno la lezione del Giovedì è in linea qui. Esprimere le vostre disponibilità entro la lezione di Giovedì 9 ottobre.
  • [07/10/2014] Siete tutti invitati all'evento dell'Internet Festival “Big Data e la mobilità del futuro” sabato 11 ottobre dalle 10 alle 18:30 nell'aula magna del Polo Fibonacci (edificio E). Big Data e la Mobilità del Futuro
  • [25/09/2014] La lezione di oggi è sostituita dall'evento BRIGHT presso il CNR di Pisa - Big Data Tales Notte dei Ricercatori al CNR di Pisa

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM 1

Classes - Lezioni

Giorno Orario Aula
Lunedì/Monday 16:00 - 18:00 Aula C
Venerdì/Friday 14:00 - 16:00 Aula A1

Office hours - Ricevimento:

  • Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
  • Giannotti/Milli: appointment by email, c/o ISTI-CNR

DM 2

Classes - Lezioni

Day of week Hour Room
Monday 9:00 - 11:00 Room N1
Thursday 9:00 - 11:00 Room A1

Office hours - Ricevimento:

  • Nanni / Monreale: appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006

Slides of the classes -- Slides del corso

  • Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining".
  • The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Testi di esame

Data mining software

Class calendar - Calendario delle lezioni (2014-2015)

First part of course, first semester (DMF - Data mining: foundations)

Day Aula Topic Learning material Instructor
1. 25.09.2014 14:00-16:00 B Intro: data mining & knowledge discovery process Textbook, Chapt. 1 dm_intro-2011.pdf Pedreschi
2. 26.09.2014 16:00 CNR Evento BRIGHT presso il CNR di Pisa - Big Data Tales Pedreschi
3. 02.10.2014 14:00-16:00 B Intro: data mining & knowledge discovery process Textbook, Chapt. 1 dm_intro-2011.pdf Pedreschi
4. 03.10.2014 14:00-16:00 A1 Intro: data mining & knowledge discovery process Textbook, Chapt. 1 dm_intro-2011.pdf Pedreschi
5. 09.10.2014 14:00-16:00 B Data: types and basic measures Textbook, Chapt. 2 chap2_data_new.pdf Pedreschi
6. 10.10.2014 14:00-16:00 A1 Data: types and basic measures Textbook, Chapt. 2 chap2_data_new.pdf Pedreschi
7. 13.10.2014 14:00-16:00 B Data: types and basic measures Textbook, Chapt. 2 chap2_data_new.pdf Pedreschi
8. 17.10.2014 14:00-16:00 A1 Canceled Pedreschi
9. 20.10.2014 14:00-16:00 B Exploratory data analysis and data understanding. Textbook, Chapt. 3 chap3_data_exploration.pdf Pedreschi
10. 24.10.2014 14:00-16:00 A1 Clustering analysis. Centroid-based methods Textbook, Chapt. 8 dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf Pedreschi
11. 27.10.2014 14:00-16:00 B Clustering analysis. Hierarchical methods Textbook, Chapt. 8 dm2014_clustering_hierarchical.pdf Pedreschi
12. 31.10.2014 14:00-16:00 A1 Tutorial on Knime Slide: knime_slides_dm.pdf Workflows: data-manipulation_iris.zip data-manipulation_adult.zip clustering_iris.zip Pedreschi
13. 10.11.2014 14:00-16:00 B Clustering analysis. Density-based methods Textbook, Chapt. 8 dm2014_clustering_dbscan.pdf Pedreschi
14. 14.11.2014 14:00-16:00 A1 Classification and predictive methods Textbook, Chapt. 4 chap4_basic_classification.pdf Pedreschi
15. 17.11.2014 14:00-16:00 B Classification. Decision trees Textbook, Chapt. 4 chap4_basic_classification.pdf Pedreschi
16. 21.11.2014 14:00-16:00 A1 Classification. Decision trees Textbook, Chapt. 4 chap4_basic_classification.pdf Pedreschi
17. 24.11.2014 14:00-16:00 B Classification. Validation and Weka & KNIME Lab Workflows:decisiontreeiris.zip decisiontreeadult.zip decisiontreeadultoverfitting.zip Milli
18. 28.11.2014 14:00-16:00 A1 Classification. Rule-based and bayesian methods Textbook, Chapt. 4 chap4_basic_classification.pdf Pedreschi
19. 01.12.2014 14:00-16:00 B Frequent Pattern Mining. Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf Pedreschi
20. 05.12.2014 14:00-16:00 A1 Association Rule Mining Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf Pedreschi
21. 12.12.2014 14:00-16:00 A1 Cancelled for strike Pedreschi
22. 15.12.2014 14:00-16:00 B Association Rule Mining and Knime Workflow: FP and AR Monreale

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Aula Topic Learning material Instructor
1. 23.02.2014 09:00-11:00 N1 Introduction + Sequential patterns / 1 Sequential Patterns - Slides Nanni
2. 26.02.2015 09:00-11:00 A1 Sequential patterns / 2 Link to Tool for seq. patterns Nanni
3. 02.03.2015 09:00-11:00 N1 Graph mining Slides Nanni
05.03.2015 09:00-11:00 A1 ———–
4. 09.03.2015 09:00-11:00 N1 Advanced Classification Methods / 1 Slides Monreale
5. 12.03.2015 09:00-11:00 A1 Advanced Classification Methods / 2 Monreale
6. 16.03.2015 09:00-11:00 N1 Advanced Classification Methods / 3 Exercises on Classidication Monreale
7. 19.03.2015 09:00-11:00 A1 Time series / 1 Slides Nanni
8. 23.03.2015 09:00-11:00 N1 Time series / 2 Example of DTW in R Nanni
9. 26.03.2015 09:00-11:00 A1 Exercises Exercises from past exams Nanni
10. 30.03.2015 09:00-11:00 N1 Exercises Monreale
11. 02.04.2015 09:00-11:00 A1 Exercises Monreale
03-07.04.2015 EASTER HOLIDAYS
13.04.2015 09:00-11:00 C1 Midterm test
12. 16.04.2015 09:00-11:00 A1 Case study: CRM - Customer Segmentation + CRISP-DM AMRP & Stulong CRISP-DM Nanni
13. 23.04.2015 09:00-11:00 A1 Case study: CRM - Churn Analysis Intro CRM Churn ST-Churn Nanni
14. 27.04.2015 09:00-11:00 N1 Case study: CRM - Promotions and Sophistication Promotions Sophistication Nanni
15. 30.04.2015 09:00-11:00 A1 Spatiotemporal analysis / 1 ST Analysis REF: Survey paper Nanni
16. 04.05.2015 09:00-11:00 N1 Spatiotemporal analysis / 2 Nanni
17. 07.05.2015 09:00-11:00 A1 Case study: Spatiotemporal analysys / 1 + Projects presentation Case study 1 Projects Nanni
18. 11.05.2015 09:00-11:00 N1 Case study: Spatiotemporal analysys / 2 Case study 2 Nanni
19. 14.05.2015 09:00-11:00 A1 Spatiotemporal analysis / 3 ST Classification Nanni
20. 18.05.2015 09:00-11:00 N1 Outlier detection Slides from SDM2010 tutorial Nanni
21. 21.05.2015 09:00-11:00 A1 Ethical Issues in Data Analytics Slides Monreale
22. 25.05.2015 09:00-11:00 N1 Ethical Issues in Data Analytics / Fraude Detection Case Study Monreale

Exams

Exam DM part I (DMF)

L'esame consiste in una prova scritta ed in una prova orale:

  • La prova scritta è composta essenzialmente di esercizi sui metodi e algoritmi visti a lezione. I testi degli appelli d'esame passati vengono regolarmente messi online e possono essere presi come riferimento generale. La prova scritta può essere sostituita dalle due verifiche intermedie: nel caso vengano entrambe superate con successo la media dei loro voti costituirà il voto con cui presentarsi all'orale – a meno che non si sostenga nuovamente l'esame scritto, nel qual caso il voto più recente cancella quelli precedenti (in meglio o in peggio). Non è possibile recuperare una sola verifica intermedia durante gli appelli d'esame regolari. Per l'a.a. 2013-2014, le verifiche intermedie sono sostituite da una serie di esercizi che verranno proposti durante il corso.
  • La prova orale verte sugli aspetti più teorici del corso (definizioni, metodi, algoritmi, ecc.) trattati a lezione, oppure dalla discussione di bibliografia concordata con i docenti.

Exam DM part II (DMA)

The exam is composed of three parts:

  • A written exam, with exercises and questions about classification (advanced topics), sequential patterns, graph mining and times series.
  • A project, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 1 page) describing the data to use and the analysis objectives. The work done should be summarized in a report, to be sent to the teachers at least 2 days before the oral exam. The proposed projects are the following:
  • An oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.

Esercizi 2014-2015

Esercizi DM parte I -- Exercises DM First Part

Guidelines for the homework are here.

  • Data Understanding: Thyroid Disease Data Set. Assigned on: 07.11.2014. To be completed within: 28.11.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the Thyroid Disease Data Set Thyroid Disease Data Set (in CSV format, zipped). This data set is one of the several databases about Thyroid avalaible at the UCI repository,http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease, where you can also find the data description. Explore the dataset with the analytical tools of KNIME or Weka (or whatever you like) and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
  • Clustering analysis:Thyroid Disease Data Set. Assigned on: 07.11.2014. To be completed within: 28.11.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 2” in the subject. Download the Thyroid Disease Data Set Thyroid Disease Data Set (in CSV format, zipped). Perform an adequate data understanding phase, and then clustering analysis, with any of the studied methods, using an appropriate subset of variables. Determine an adequate number of clusters, if any, and try to explain the properties of the discovered clusters (or else, argue why this dataset does not exhibit a clustering structure).
  • Market Basket Analysis: SuperMarket dataset. Assigned on: 15.12.2014. To be completed within: three days before the oral exam. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 3” in the subject. Download the SuperMarket dataset (in CSV format, zipped). Given a database of customer transactions of a supermarket, find the set of frequent items co-purchased and analyse the most interesting association rules that is possible to derive from the frequent patterns. Provide a short document which illustrates the input dataset, the adopted frequent pattern algorithm and the association rule analysis discussing your findings related to the most interesting rules. The database is composed of two files:(1) transactions.csv containing the customer transactions where each row contains a SCONTRINO_ID (transaction code) and COD_MKT_ID (the code of the item purchased); (2) segments-description.csv containing the full description of each item. For each COD_MKT_ID you can find information about the CATEGORY, SECTOR, AREA, SEGMENT and so on. Perform the analysis considering the segment level.
  • Classification. Serie A statistic player dataset. Assigned on: 15.12.2014. To be completed within: three days before the oral exam. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 4” in the subject. Download the Dataset here: serieA_dataset (in CSV format, zipped). The dataset contain two files:serieA_aggregated_player.csv that contains the statistics aggregated for each player during the last soccer championship and serieA_events.csv that contains for each match all the statistics for each player. You can choose the file that you prefer. Objective: finding decision trees to predict the position of a player (Defender, Goalkeeper, Forward, Midfielder). The paper has to illustrate the input dataset, some analyses for the data understanding, the adopted classification methodology and the decision tree validation and interpretation.

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
Mid-term 2015 Monday 13.04.2015 9.00 Room C1

Appelli regolari / Exam sessions

Session Date Time Room Notes Results
1. Monday 19 January 2015 9.00 C Results of written exam
1. Wednesday 21 January 2015 9.00 Predreschi's office oral exam. Send an email to register for the oral exam
1. Thursday 29 January 2015 14.00 Predreschi's office oral exam. Send an email to register for the oral exam
2. Monday 16 February 2015 9.00 C Results of written exam
2. Monday 23 February 2015 11.00 Predreschi's office oral exam. Send an email to register for the oral exam
2. Monday 2 March 2015 11.00 Predreschi's office oral exam. Send an email to register for the oral exam
3. Friday 05 June 2015 14.00 C Results of written exam
Session Date Time Room Notes Results
1. Monday 19 January 2015 9.00 C
2. Monday 16 February 2015 9.00 C
3. Friday 05.06.2015 14.00 C
4. Friday 26.06.2015 14.00 C
5. Friday 17.07.2015 9.00 C
6. Wednesday 09.09.2015 9.00 C

Appelli straordinari A.A. 2013/14 / Extra sessions A.A. 2013/14

Date Time Room Notes Results
7 November 2014 9:00-11:00 C1

Edizioni anni precedenti

dm/dm.2014-15.txt · Ultima modifica: 21/09/2015 alle 13:17 (2 anni fa) da Dino Pedreschi