Strumenti Utente

Strumenti Sito


dm:start

Data Mining A.A. 2020/21

DM1 - Data Mining: Foundations (6 CFU)

Instructors:

Teaching Assistant

DM2 - Data Mining: Advanced Topics and Applications (6 CFU)

News

  • [09.09.2020] The course will be held online on MS Teams.
  • [09.09.2020] The first lesson will be held on 16/09.

Learning Goals

  • DM1
    • Fundamental concepts of data knowledge and discovery.
    • Data understanding
    • Data preparation
    • Clustering
    • Classification
    • Pattern Mining and Association Rules
    • Clustering
  • DM2
    • Outlier Detection
    • Regression and Forecasting
    • Advanced Classification
    • Time Series Analysis
    • Sequential Pattern Mining
    • Advanced Clustering
    • Transactional Clustering
    • Ethical Issues

Hours and Rooms

DM1

Classes

Day of Week Hour Room
Monday 14:00 - 16:00 MS Teams
Wednesday 16:00 - 18:00 MS Teams

Office hours - Ricevimento:

  • Prof. Pedreschi: Monday 16:00 - 18:00, Online
  • Prof. Nanni: appointment by email, Online

DM 2

Classes

Day of Week Hour Room
Monday 09:00 - 11:00 MS Teams
Wednesday 16:00 - 18:00 MS Teams

Office Hours - Ricevimento:

  • Room 268 Dept. of Computer Science
  • Thursday: 15-17, Room: 286
  • Appointment by email

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
  • Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.

Slides

Software

  • Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
  • Scikit-learn: python library with tools for data mining and data analysis Documentation page
  • Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
  • KNIME The Konstanz Information Miner. Download page
  • WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page

Class Calendar (2020/2021)

First Semester (DM1 - Data Mining: Foundations)

Day Room Topic Learning material Instructor
1. 16.09.2020 14:00-16:00 MS Teams Introduction. Course Overview Introduction DM Pedreschi
2. 23.09.2020 16:00-18:00 MS Teams Data Understanding Slides DU Slides on Descriptive Statistics Pedreschi
3. 28.09.2020 14:00-16:00 MS Teams Data Understanding Pedreschi
4. 30.09.2020 16:00-18:00 MS Teams Data Preparation Slides DP Pedreschi
5. 05.10.2020 14:00-16:00 MS Teams Lab: Introduction to Python and Knime Python Introduction, Knime simple workflow Lecture 5 part 1, Lecture 5 part 2 Guidotti, Citraro
6. 07.10.2020 16:00-18:00 MS Teams Lab: Data Understanding & Preparation Dataset: Iris, Titanic, Knime: 01_data_understanding.zip Python: titanic_data_understanding2.ipynb.zip Lecture 6 part 1, Lecture 6 part 2 Guidotti, Citraro
7. 12.10.2020 14:00-16:00 MS Teams Clustering: Intro & K-means Slides clustering 1 Nanni
8. 14.10.2020 16:00-18:00 MS Teams Clustering: Hierarchical methods Slides clustering 2 Nanni
9. 19.10.2020 14:00-16:00 MS Teams Clustering: Density-based methods and exercises Slides clustering 3, Clustering exercises Nanni
10. 21.10.2020 16:00-18:00 MS Teams Clustering: Validation methods and exercises Slides clustering 4 Nanni
11. 26.10.2020 14:00-16:00 MS Teams Lab: Clustering Knime , Python Iris Python Titanic Citraro

Second Semester (DM2 - Data Mining: Advanced Topics and Applications)

Day Room Topic Learning material Instructor (Guidotti)
1. 17.02.2020 09:00-11:00 MS Teams Introduction Intro, Libraries

Exams

Exam DM1

The exam is composed of two parts:

  • An oral exam , that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises.
  • A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification (see the guidelines for more details). The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM1 2020-2021] Project” in the subject.

Tasks of the project:

  1. Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification: Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)
  • Project 1
    1. Dataset: IBM-HR
    2. Assigned: 16/09/2020
    3. Midterm Deadline: 21/11/2020 (half project required, i.e., data understanding and at least two clustering algorithms)
    4. Final Deadline: 07/01/2021 (complete project required)
    5. Data: here
    6. Description: IBM-HR
    7. (please download the data from here and not from the link with the description as we are using a different version of the data)

Guidelines for the project are here.

Exam DM2

TBD

Exam Dates

Exam Sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 MS Teams Please, use the system for registration: https://esami.unipi.it/

Past Exams

  • Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures.

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Previous years

dm/start.txt · Ultima modifica: 23/10/2020 alle 10:20 (2 giorni fa) da Riccardo Guidotti