Strumenti Utente

Strumenti Sito


dm:start

Data Mining A.A. 2018/19

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

Teaching assistant - Assistente:

DM 2: Advanced topics on Data Mining and case studies (6 CFU)

Instructors:

DM: Data Mining (9 CFU)

Instructors:

Teaching assistant - Assistente:

News

  • Material for the course. For downloading it you need to use the login and password that we sent by email
  • Students need to decide the group composition for the project and fill this spreadsheet within October 1, 2018. It is strongly recommended an heterogenous composition with respect to the master degree. The number of members of each group can be 3 or 4.

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week Hour Room
Lunedì/Monday 14:00 - 16:00 Aula C1
Mercoledì/Wednesday 14:00 - 16:00 Aula C1
Venerdì/Friday 11:00 - 13:00 Aula C1

Office hours - Ricevimento:

  • Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
  • Prof. Monreale: Giovedì/Thursday h 14:00 - 16:00, Dipartimento di Informatica
  • Dr. Guidotti: appointment by email (guidotti@di.unipi.it), Dipartimento di Informatica

DM 2

Classes - Lezioni

Day of week Hour Room
Thursday 14 - 16 A1
Friday 16 - 18 C1

Office hours - Ricevimento:

  • Nanni : appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
  • Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.

Slides of the classes -- Slides del corso

Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"

Past Exams

* Some text of past exams on DM1 (6CFU):

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

Data mining software

Class calendar - Calendario delle lezioni (2018/2019)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

Day Aula Topic Learning material Instructor
1. 19.09 14:00-16:00 C1 Overview. Introduction. 1.2018-dm-overview.pdf Pedreschi
2. 20.09 16:00-18:00 C1 Introduction Pedreschi
3. 21.09 11:00-13:00 C1 Lecture canceled Pedreschi
4. 24.09 14:00-16:00 C1 KDD Process & Applications. Data Understanding. DM + Applications DU Monreale
5. 26.09 14:00-16:00 C1 Data Understanding. Data Preparation Monreale
6. 28.09 11:00-13:00 C1 Introduction to Python, Knime intro_knime intro_python Monreale/Guidotti
7. 01.10 14:00-16:00 C1 Data Preparation Data Preparation Monreale
8. 03.10 14:00-16:00 C1 Clustering Introduction e Centroid-based clustering 4.basic_cluster_analysis-intro-kmeans.pdf Monreale
9. 05.10 11:00-13:00 C1 Lecture canceled
10. 08.10 14:00-16:00 C1 Knime - Python: Data Understanding du_knime du_python Guidotti
11. 10.10 14:00-16:00 C1 Clustering: K-means & Hierarchical 5.basic_cluster_analysis-hierarchical.pdf Pedreschi
12. 12.10 11:00-13:00 C1 Lecture canceled for IF
13. 15.10 14:00-16:00 C1 Clustering: DBSCAN 6.basic_cluster_analysis-dbscan-validity.pdf Pedreschi
14. 17.10 14:00-16:00 C1 Clustering: Validity Pedreschi
15. 19.10 11:00-13:00 C1 Discussion on Projects - DU Guidotti
16. 22.10 14:00-16:00 C1 Exercises for mid-term test Monreale
17. 24.10 14:00-16:00 C1 Knime - Python: Clustering Guidotti
18. 26.10 11:00-13:00 C1 Exercises for mid-term test Monreale

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Room (Aula) Topic Learning material Instructor (default: Nanni)

Exams

Exam DM part I (DMF)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
  • An oral exam (optional) , that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. It is optional for students passing the written part by mid-term tests.
  • A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2018-2019] Project” in the subject. Students who will decide to perform the project during the winter exam sessions, or summer exam sessions will find the dataset of the project online after 05/01/2019. In this case the project must be delivered at least 2 days before the oral exam.

Tasks of the project:

  1. Data Understanding (Collective discussion on: 19/10/2018): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis (Collective discussion on: ??/??/????): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification (Collective discussion on: ??/??/????): Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules (Collective discussion on: ??/??/????): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)
  • Project 1
  1. Dataset: Credit Card Default
  2. Assigned: 01/10/2018
  3. Firm Deadline: 05/01/2019

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
  • An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
  • A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam:
    • Time series: given the 50+ years long history of stock values of a company, split it into years, and study their similarities, also using clustering. Objectives: compare similarities, compute clustering. Dataset: IBM stocks (source: Yahoo Finance), includes a Python snippet to read and split the data. Dataset obtained from Yahoo!Finance service.
    • Sequential patterns: discover patterns over the stock value time series above. Before that, preprocess the data by splitting it into monthly time series and discretizing them in some way. Objective: find Motifs-like patterns (i.e. frequent contiguous subsequences) of length at least 4 days. Dataset: same as the point before.
    • (Alternative) Classification methods: test different classification methods over a simple classification problem. Dataset: the UCI Abalone dataset, containing various features of abalones, including the age – to be inferred by the number or rings. Objective: (i) discard the “Infant” abalones; (ii) discretize the attribute “Number of rings” into 2 classes; (iii) try at least 3 different classification methods (among those discussed in DM2, including ensemble methods) on the resulting dataset, using the discretized n. of rings as class, and evaluating them with cross-validation.
    • Outlier detection: from the Abalone dataset used above, identify the top 1% outliers. Objective: adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, …) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. Dataset: same as the point before.

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
First Mid-term 2018 30.10.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/
Second Mid-term 2018 18.12.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/

Appelli regolari / Exam sessions

Session Date Time Room Notes Marks

Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18

Date Time Room Notes Results

Previous years

dm/start.txt · Ultima modifica: 17/10/2018 alle 22:55 (3 giorni fa) da Anna Monreale