| Entrambe le parti precedenti la revisioneRevisione precedenteProssima revisione | Revisione precedente |
| dm:start [29/04/2019 alle 09:50 (7 anni fa)] – [Second part of course, second semester (DMA - Data mining: advanced topics and case studies)] Mirco Nanni | dm:start [21/01/2026 alle 11:16 (11 giorni fa)] (versione attuale) – [Exam Enrollment Instruction] Riccardo Guidotti |
|---|
| <html> | ====== Data Mining A.A. 2025/26 ====== |
| <!-- Google Analytics --> | |
| <script type="text/javascript" charset="utf-8"> | |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ | |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), | |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) | |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); | |
| |
| ga('create', 'UA-34685760-1', 'auto', 'personalTracker', {'allowLinker': true}); | ===== DM1 - Data Mining: Foundations (6 CFU) ===== |
| ga('personalTracker.require', 'linker'); | |
| ga('personalTracker.linker:autoLink', ['pages.di.unipi.it', 'enforce.di.unipi.it', 'didawiki.di.unipi.it'] ); | |
| | |
| ga('personalTracker.require', 'displayfeatures'); | |
| ga('personalTracker.send', 'pageview', 'ruggieri/teaching/dm/'); | |
| setTimeout("ga('send','event','adjusted bounce rate','30 seconds')",30000); | |
| </script> | |
| <!-- End Google Analytics --> | |
| <!-- Capture clicks --> | |
| <script> | |
| jQuery(document).ready(function(){ | |
| jQuery('a[href$=".pdf"]').click(function() { | |
| var fname = this.href.split('/').pop(); | |
| ga('personalTracker.send', 'event', 'DM', 'PDFs', fname); | |
| }); | |
| jQuery('a[href$=".r"]').click(function() { | |
| var fname = this.href.split('/').pop(); | |
| ga('personalTracker.send', 'event', 'DM', 'Rs', fname); | |
| }); | |
| jQuery('a[href$=".zip"]').click(function() { | |
| var fname = this.href.split('/').pop(); | |
| ga('personalTracker.send', 'event', 'DM', 'ZIPs', fname); | |
| }); | |
| jQuery('a[href$=".mp4"]').click(function() { | |
| var fname = this.href.split('/').pop(); | |
| ga('personalTracker.send', 'event', 'DM', 'Videos', fname); | |
| }); | |
| jQuery('a[href$=".flv"]').click(function() { | |
| var fname = this.href.split('/').pop(); | |
| ga('personalTracker.send', 'event', 'DM', 'Videos', fname); | |
| }); | |
| }); | |
| </script> | |
| </html> | |
| ====== Data Mining A.A. 2018/19 ====== | |
| |
| ===== DM 1: Foundations of Data Mining (6 CFU) ===== | Instructors: |
| | |
| Instructors - Docenti: | |
| * **Dino Pedreschi** | * **Dino Pedreschi** |
| * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa | * KDDLab, Università di Pisa |
| * [[http://www-kdd.isti.cnr.it]] | * [[http://www-kdd.isti.cnr.it]] |
| * [[dino.pedreschi@unipi.it]] | * [[dino.pedreschi@unipi.it]] |
| |
| Teaching assistant - Assistente: | |
| * **Riccardo Guidotti** | * **Riccardo Guidotti** |
| * KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa | * KDDLab, Università di Pisa |
| * [[guidotti@di.unipi.it]] | * [[https://kdd.isti.cnr.it/people/guidotti-riccardo]] |
| | * [[riccardo.guidotti@di.unipi.it]] |
| | |
| |
| ===== DM 2: Advanced topics on Data Mining and case studies (6 CFU) ===== | Teaching Assistant |
| | * **Alessio Cascione** |
| Instructors: | * KDDLab, Università di Pisa |
| * **Mirco Nanni, Dino Pedreschi** | * [[https://www.linkedin.com/in/alessio-cascione-a77224159/?originalSubdomain=it]] |
| * KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa | * [[alessio.cascione@phd.unipi.it]] |
| * [[http://www-kdd.isti.cnr.it]] | |
| * [[mirco.nanni@isti.cnr.it]] | |
| * [[dino.pedreschi@unipi.it]] | |
| |
| ===== DM: Data Mining (9 CFU) ===== | ===== DM2 - Data Mining: Advanced Topics and Applications (6 CFU) ===== |
| |
| Instructors: | Instructors: |
| * **Dino Pedreschi, Anna Monreale** | |
| * KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa | |
| * [[http://www-kdd.isti.cnr.it]] | |
| * [[mirco.nanni@isti.cnr.it]] | |
| * [[dino.pedreschi@unipi.it]] | |
| * [[anna.monreale@unipi.it]] | |
| |
| Teaching assistant - Assistente: | |
| * **Riccardo Guidotti** | * **Riccardo Guidotti** |
| * KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa | * KDDLab, Università di Pisa |
| * [[guidotti@di.unipi.it]] | * [[https://kdd.isti.cnr.it/people/guidotti-riccardo]] |
| | * [[riccardo.guidotti@di.unipi.it]] |
| |
| ====== News ===== | Teaching Assistant |
| * ** Last exam session on Feb, 14. Please register your name here: https://doodle.com/poll/6dgc5du4fgpnbyyx ** | * **Alessio Cascione** |
| * Results of the written exam of Feb {{ :dm:dm_evaluation_1819_-_appello-feb.pdf |}} | * KDDLab, Università di Pisa |
| * Results of the written exam of January {{:dm:dm_evaluation_1819-jan-session.pdf |}} | * [[https://www.linkedin.com/in/alessio-cascione-a77224159/?originalSubdomain=it]] |
| * Dates for exam registration: (a) Jan 21: slot 14 - 15, 16-17; (b) Jan 22: slot 10 - 11; ( c ) Jan 23: slot 09 - 10. Location: Monreale's office. | * [[alessio.cascione@phd.unipi.it]] |
| * ** I setup 3 days for the oral exam: 25, 28, 29 January. Other dates will we available after the written exam of Feb. For booking your oral exam please use the doodle indicating you Surname and Name: https://doodle.com/poll/3wunys9yd8s9q8ay ** | |
| * Final results including project evaluation available here: {{ :dm:dm_evaluation_1819.pdf |}}. If you do not find your evaluation please write an email to Anna Monreale. | |
| * **New project is available!** | |
| * *Results of the {{ :dm:secondmidterm-2018.pdf |Second mid-term test}}. After the evaluation of the project we will propose you the average grade considering: first and second midterm tests and project. * | |
| * Get clusters from scipy dendogram: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster | |
| * Help for installing Pyfim library https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.html | |
| * *Results of the {{ :dm:20181030-midterm-test.pdf | first mid-term test}}.* | |
| * Students need to decide the group composition for the project and fill this [[https://docs.google.com/spreadsheets/d/1LubmSiJobg6WstjojdG4r7er2pzOrFrDElG6kdVJrAM/edit?usp=sharing| spreadsheet]] within October 1, 2018. It is strongly recommended an heterogenous composition with respect to the master degree. The number of members of each group can be 3 or 4. | |
| |
| | ====== News ====== |
| | * **[17.12.2025] DM Exam Registration instruction available in Exam section**. |
| ====== Learning goals -- Obiettivi del corso ====== | * [01.12.2025] The lecture of Thursday 04/12/2025 is moved to Friday 05/12/2025 9-11 in room C (project presentation of Prof.ssa Pierotti will start at 11 after DM lecture). The last lecture will be held on Tuesday 09/12/2025 9-11 in room M1 (as Monday 08/12/2025 is holiday), while the lecture of P4DS is moved to 09/12/2025 16-18 in room C1. |
| | * [19.11.2025] The lecture of Thursday 20/11/2025 will be held in room N1 due to not usability of room E. |
| | * [07.10.2025] The lecture of Thursday 10/10/2025 is canceled due to the UniPi Orienta event. The recovery lecture is Tuesday 14/10/2025 9-11 room M1. |
| | * [06.10.2025] Link to Project Groups Registration DM1 [25/26] (max 3 students for each group - access with your University of Pisa account, deadline 17/10/2025:) [[https://docs.google.com/spreadsheets/d/1JX3VRwcZZFcTdpiguEwPsR_p4gDyRd7J89O84J7AeyY/edit?gid=0#gid=0| Link]] |
| | * [28.07.2025] Lectures will start on Monday 29 September 2025 at 09.00 room E. Lectures will be in presence only. Registrations of the lectures of past years can be found at the bottom of this web page. |
| | |
| | |
| |
| ** ... a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the "sexiest" around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. ** | ---- |
| |
| //Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.// | ====== Learning Goals ====== |
| | * DM1 |
| | * Fundamental concepts of data knowledge and discovery. |
| | * Data understanding |
| | * Data preparation |
| | * Clustering |
| | * Classification |
| | * Pattern Mining and Association Rules |
| | * Sequential Pattern Mining |
| |
| La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti: | * DM2 |
| - i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati; | * Outlier Detection |
| - le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi; | * Dimensionality Reduction |
| - alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici. | * Regression |
| - l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza | * Advanced Classification and Regression |
| | * Time Series Analysis |
| | * Transactional Clustering |
| | * Explainability |
| |
| ===== Reading about the "data scientist" job ===== | ====== Hours and Rooms ====== |
| |
| * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} | ===== DM1 ===== |
| * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] | |
| * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} | |
| * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1|link]] | |
| * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 [[http://www.ilsole24ore.com/art/tecnologie/2012-09-21/futuro-scritto-data-155044.shtml?uuid=AbOQCOhG|link]] | |
| * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{:dm:crossroadsxrds2012fall-dl.pdf|download}} | |
| * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: [[https://www.youtube.com/watch?v=mXLy3nkXQVM|YouTube video]] | |
| | |
| * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. [[http://www.fusioncharts.com/whitepapers/downloads/Towards-Effective-Decision-Making-Through-Data-Visualization-Six-World-Class-Enterprises-Show-The-Way.pdf|download]] | |
| ====== Hours - Orario e Aule ====== | |
| | |
| ===== DM1 & DM ===== | |
| |
| **Classes - Lezioni** | **Classes** |
| |
| ^ Day of Week ^ Hour ^ Room ^ | ^ Day of Week ^ Hour ^ Room ^ |
| | Lunedì/Monday | 14:00 - 16:00 | Aula C1 | | | Monday | 09:00 - 11:00 | E | |
| | Mercoledì/Wednesday | 14:00 - 16:00 | Aula C1 | | | Thursday | 09:00 - 11:00 | E | |
| | Venerdì/Friday | 11:00 - 13:00 | Aula C1 | | |
| |
| **Office hours - Ricevimento:** | **Office hours - Ricevimento:** |
| |
| * Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica | * Prof. Pedreschi |
| * Prof. Monreale: by appointment, Room 374/DO, Dept. of Computer Science. | * Monday 15:00-17:00 or Appointment by email |
| * Dr. Guidotti: class-appointment (see calendar) | * Room 318 Dept. of Computer Science or MS Teams |
| | |
| | * Prof. Guidotti |
| | * Thursday 16:00 - 18:00 or Appointment by email |
| | * Room 363 Dept. of Computer Science or MS Teams |
| | |
| | |
| | * Alessio Cascione |
| | * Google Meet slot - https://calendly.com/alessio-cascione-phd/30min |
| | * Alternative appointment by email |
| | * I will be out of office from 05/12/2025 to 15/12/2025, checking emails and answering sporadically. |
| | |
| ===== DM 2 ===== | ===== DM 2 ===== |
| |
| |
| **Classes - Lezioni** | **Classes** |
| |
| ^ Day of week ^ Hour ^ Room ^ | ^ Day of Week ^ Hour ^ Room ^ |
| | Thursday | 14 - 16 | A1 | | | Monday | 11:00 - 13:00 | E | |
| | Friday | 16 - 18 | C1 | | | Wednesday | 09:00 - 11:00 | E | |
| |
| **Office hours - Ricevimento:** | **Office Hours - Ricevimento:** |
| | |
| | * Tuesday 15.00-17.00 or Appointment by email |
| | * Room 363 Dept. of Computer Science or MS Teams |
| |
| * Nanni : appointment by email, c/o ISTI-CNR | |
| ====== Learning Material -- Materiale didattico ====== | ====== Learning Material -- Materiale didattico ====== |
| |
| * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 | * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 |
| * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] | * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] |
| * I capitoli 4, 6, 8 sono disponibili sul sito del publisher. -- Chapters 4,6 and 8 are also available at the publisher's Web site. | * I capitoli 3, 5, 7 sono disponibili sul sito del publisher. -- Chapters 3,5 and 7 are also available at the publisher's Web site. |
| * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 | * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 |
| * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. | * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. |
| |
| |
| ===== Slides of the classes -- Slides del corso ===== | ===== Slides ===== |
| |
| * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. | * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. |
| //Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]// | |
| ===== Past Exams ===== | |
| | ===== FAQ ===== |
| |
| * Some text of past exams on **DM1 (6CFU)**: | For the academic year 2025/2026, we make available a document containing **frequently asked questions (FAQs)** about the project at the end of the lecture. |
| | Please consult this document first, as your question may already be answered there. |
| | The FAQ will be updated regularly after each lecture with new relevant questions from students. |
| |
| * {{ :dm:2017-1-19.pdf |}}, {{ :dm:2017-9-6.pdf |}}, {{ :dm:2016-05-30-dm1-seconda.pdf |}} | Check the document: |
| | https://docs.google.com/document/d/1OLa02xofxRPj1zUJ7zm_boxL_ZeAFR1HWCB4lgozgz8/edit?usp=sharing |
| |
| * Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers **DM1 (9CFU)**: | |
| * {{ :dm:dm2_exam.2017.06.13_solutions.pdf |}}, {{ :dm:dm2_exam.2017.07.04_solutions.pdf |}}, {{ :dm:dm2_mid-term_exam.2017.06.06_solutions.pdf |}} | |
| |
| * Some exercises (partially with solutions) on **sequential patterns** and **time series** can be found in the following texts of exams from the last years: | |
| * {{ :dm:dm2_exam.2015.04.13.results.pdf|}}, {{ :dm:dm2_exam.2016.04.4_sol.pdf |}}, {{ :dm:dm2_exam.2016.04.5_sol.pdf |}}, {{ :dm:dm2_exam.2016.06.20_sol.pdf |}}, {{ :dm:dm2_exam.2016.07.08_sol.pdf |}} | |
| |
| | ===== Recording past years ===== |
| |
| * Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program: | Link to past years recordings (incrementally updated with respect to the current lectures of the course) |
| * {{tdm:verifica2006.pdf|Verifica 2006}}, {{tdm:verifica2005.pdf|Verifica 2005 (con soluzioni)}}, {{tdm:verifica2004.pdf|Verifica 2004}} | |
| * {{dm:verifica.05.06.2007.pdf|Verifica 5 giugno 2007}}, {{dm:verifica.26.06.2007.pdf|Verifica 26 giugno 2007}}, {{dm:verifica.24.07.2007_corretto.pdf|Verifica 24 luglio 2007}} (e {{dm:verifica.24.07.2007_soluzioni.pdf|Soluzioni}}) | |
| * {{:dm:verifica.2008.04.03.pdf|Verifica 3 aprile 2008}} (e {{:dm:soluzioni.2008.04.03.pdf|Soluzioni}}), {{:dm:dm-tdm.appello_2008_07_18_parte1.pdf|Verifica 18 luglio 2008 - parte 1}}, {{:dm:dm-tdm.appello_2008_07_18_parte2.pdf|Verifica 18 luglio 2008 - parte 2}} | |
| * {{:dm:appello.2010.06.01_soluzioni.pdf| Exam with solution 2010-06-01}} {{:dm:appello.2010.06.22_soluzioni.pdf|Exam with solution 2010-06-22}} {{:dm:appello.2010.09.09_soluzioni.pdf|Exam with solution 2010-09-09}}{{:dm:appello.2010.07.13_soluzioni.pdf| Exam with solution 2010-07-13}} | |
| |
| ===== Data mining software===== | https://unipiit-my.sharepoint.com/:f:/g/personal/a_cascione_studenti_unipi_it/IgCdnqZe6wTKQJR_4yVrXE3gAcmqWHBSxvxW0HtsA596LWQ?e=OCa34K |
| | ===== Software===== |
| |
| * [[http://www.knime.org | KNIME ]] The Konstanz Information Miner. [[http://www.knime.org/download-desktop| Download page ]] | * Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python. [[https://www.anaconda.com/distribution/| Download page]] (the following libraries are already included) |
| * [[https://www.continuum.io/downloads | Python - Anaconda (2.7 version!!!)]]: Anaconda is the leading open data science platform powered by Python. [[https://www.continuum.io/downloads | Download page]] (the following libraries are already included) | |
| * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] | * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] |
| * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] | * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] |
| | |
| | Other softwares for Data Mining |
| | * [[http://www.knime.org | KNIME ]] The Konstanz Information Miner. [[http://www.knime.org/download-desktop| Download page ]] |
| * [[http://www.cs.waikato.ac.nz/ml/weka/ | WEKA ]] Data Mining Software in JAVA. University of Waikato, New Zealand [[http://www.cs.waikato.ac.nz/ml/weka/ | Download page ]] | * [[http://www.cs.waikato.ac.nz/ml/weka/ | WEKA ]] Data Mining Software in JAVA. University of Waikato, New Zealand [[http://www.cs.waikato.ac.nz/ml/weka/ | Download page ]] |
| | * Didactic Data Mining [[http://matlaspisa.isti.cnr.it:5055/Help| DDMv1]], [[https://kdd.isti.cnr.it/ddm/#/| DDMv2]] |
| | |
| | ====== Class Calendar (2025/2026) ====== |
| |
| ====== Class calendar - Calendario delle lezioni (2018/2019) ====== | ===== First Semester (DM1 - Data Mining: Foundations) ===== |
| |
| ===== First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining) ===== | ^ ^ Day ^ Time ^ Room ^ Topic ^ Material ^ Lecturer ^ |
| | | | 15.09.2025 | | | No Lecture | | | |
| | | | 18.09.2025 | | | No Lecture | | | |
| | | | 22.09.2025 | | | No Lecture | | | |
| | | | 25.09.2025 | | | No Lecture | | | |
| | |01.| 29.09.2025 | 09-11 | E | Overview, Introduction | {{ :dm:00_dm1_introduction_2025_26.pptx.pdf | Intro}} | Pedreschi | |
| | |02.| 02.10.2025 | 09-11 | E | The KDD process | {{ :dm:00_dm1_introduction_2025_26.pptx.pdf | Intro}} | Pedreschi | |
| | |03.| 06.10.2025 | 09-11 | E | Introduction to Python | {{:dm:06.10.25_python_basic_2025_lecture_in_class.zip |}} | Pedreschi, Cascione | |
| | | | 09.10.2025 | | | No Lecture (UNIPI Orienta) | | | |
| | |04.| 13.10.2025 | 09-11 | E | Data Understanding | {{ :dm:01_dm1_data_understanding_2025_26.pdf | Data Understanding }} | Pedreschi | |
| | |05.| 14.10.2025 | 09-11 | C1 | Data Preparation | {{ :dm:02_dm1_data_preparation_2025_26.pdf | Data Preparation}}, {{ :dm:03_dm1_data_similarity_2025_26.pdf | Data Similarity}} | Guidotti | |
| | |06.| 16.10.2025 | 09-11 | E | Data Understanding Lab| {{ :dm:16.10.25_data_understanding_2025_lecture_in_class.zip |}} | Guidotti, Cascione | |
| | |07.| 20.10.2025 | 09-11 | E | Data Similarity and Introduction to Clustering | {{ :dm:03_dm1_data_similarity_2025_26.pdf | Data Similarity}}, {{ :dm:04_dm1_clustering_intro_2025_26.pdf | Introduction to Clustering}} | Guidotti | |
| | |08.| 23.10.2025 | 09-11 | E | Centroid-based Clustering Algorithm | {{ :dm:05_dm1_kmeans_2025_26.pdf | Centroid-based Clustering}} | Guidotti | |
| | |09.| 27.10.2025 | 09-11 | E | Hierarchical Clustering Algorithm | {{ :dm:06_dm1_hierarchical_clustering_2025_26.pdf | Hierarchical Clustering}} | Guidotti | |
| | |10.| 27.10.2025 | 09-11 | E | Density-based Clustering Algorithm | {{ :dm:07_dm1_density_based_2025_26.pdf | Density-based Clustering}} | Guidotti | |
| | |11.|03.11.2025 | 09-11 | E | Clustering Lab | {{ :dm:03.11.25_clustering_2025_lecture_in_class.zip |}} | Pedreschi, Cascione | |
| | |12.|04.11.2025 | 09-11 | C1 | Classification: Overview and K-Nearest Neighbours | {{ :dm:08_dm1_classification_intro_2024_25.pptx.pdf | Classification Overview }} {{ :dm:09_dm1_knn_2024_25.pptx.pdf | KNN Classifier }} | Pedreschi | |
| | |13.|06.11.2025 | 09-11 | E | Classification: Naive Bayes Classifier and Exercises | {{ :dm:10_dm1_naive_bayes_2024_25.pptx.pdf | Naive Bayes }} | Pedreschi | |
| | |14.|10.11.2025 | 09-11 | E | Classification: Evaluation | {{ :dm:11_dm1_classification_eval_2024_25.pptx.pdf | Model evaluation }} | Pedreschi | |
| | |15.|13.11.2025 | 09-11 | E | Classification: Decision Trees (1) | {{ :dm:12_dm1_decision_trees_2024_25.pptx.pdf | Decision trees }} | Pedreschi | |
| | |16.|17.11.2025 | 09-11 | D5 | Classification: Decision Trees (2) | | Pedreschi | |
| | |17.|18.11.2025 | 09-11 | C1 | Classification: Decision Trees (3) | | Pedreschi | |
| | |18.|20.11.2025 | 09-11 | N1 | Classification Lab | {{ :dm:20.11.25_classification_2025_lecture_in_class.zip |}} | Guidotti, Cascione | |
| | |19.|24.11.2025 | 09-11 | E | Pattern Mining: Apriori | {{ :dm:14_dm1_pattern_mining_2024_25.pptx.pdf | Pattern mining & association rules }} | Pedreschi | |
| | |20.|25.11.2025 | 09-11 | C | Pattern Mining: Lift, Interest, Multiattribute | | Pedreschi | |
| | |21.|27.11.2025 | 09-11 | E | Regression: Problem, Linear, KNN, Decision Tree | {{ :dm:13_dm1_linear_regression_2024_25.pptx.pdf | Regression }} | Pedreschi | |
| | |22.|01.12.2025 | 09-11 | E | Lab on Regression and Pattern Mining; FPGROWTH| {{ :dm:01.12.25_regression_2025_lecture_in_class.zip |}}, {{ :dm:01.12.25_pattern_mining_2025_lecture_in_class.zip |}}, {{ :dm:14_dm1_pattern_mining_2024_25.pptx.pdf | FPGROWTH }}| Guidotti, Cascione | |
| | |23.|04.12.2025 | 09-11 | C | Exercises Pattern Mining & Decision Trees | | Guidotti | |
| | |24.|09.12.2025 | 09-11 | M1 | Rule-based Classifiers |{{ :dm:15_dm1_rule_based_classifier_2025_26.pdf | Rule-Based Classifier}} | Guidotti | |
| |
| ^ ^ Day ^ Aula ^ Topic ^ Learning material ^ Instructor ^ | |
| |1.| 19.09 14:00-16:00 | C1 | Overview. Introduction. | {{ :dm:1.2018-dm-overview.pdf |}} | Pedreschi | | |
| |2.| 20.09 16:00-18:00 | C1 | Introduction | | Pedreschi | | |
| | | 21.09 11:00-13:00 | C1 | Lecture canceled | | Pedreschi | | |
| |3.| 24.09 14:00-16:00 | C1 | KDD Process & Applications. Data Understanding. | {{ :dm:2.2018-dm-introduction.pdf |DM + Applications}} {{ :dm:2-dataunderstanding-sa.pdf |DU}}| Monreale | | |
| |4.| 26.09 14:00-16:00 | C1 | Data Understanding. Data Preparation | | Monreale | | |
| |5.| 28.09 11:00-13:00 | C1 | Introduction to Python, Knime | {{ :dm:00_start_with_knime.zip | intro_knime}} {{ :dm:intro_python_jupyter.zip | intro_python}} | Monreale/Guidotti | | |
| |6.| 01.10 14:00-16:00 | C1 | Data Preparation | {{ :dm:3.dm_ml_data_preparation.pdf | Data Preparation}} | Monreale | | |
| |7.| 03.10 14:00-16:00 | C1 | Clustering Introduction e Centroid-based clustering | {{ :dm:4.basic_cluster_analysis-intro-kmeans.pdf |}} | Monreale | | |
| | | 05.10 11:00-13:00 | C1 | Lecture canceled | | | | |
| |8.| 08.10 14:00-16:00 | C1 | Knime - Python: Data Understanding |{{ :dm:01_data_understanding.zip | du_knime}} {{ :dm:titanic_data_understanding.ipynb.zip | du_python}}| Guidotti | | |
| |9.| 10.10 14:00-16:00 | C1 | Clustering: K-means & Hierarchical | {{ :dm:5.basic_cluster_analysis-hierarchical.pdf |}}| Pedreschi | | |
| | | 12.10 11:00-13:00 | C1 | Lecture canceled for IF | | | | |
| |10.| 15.10 14:00-16:00 | C1 | Clustering: DBSCAN | {{ :dm:6.basic_cluster_analysis-dbscan-validity.pdf |}}| Pedreschi | | |
| |11.| 17.10 14:00-16:00 | C1 | Clustering: Validity | | Pedreschi | | |
| |12.| 19.10 11:00-13:00 | C1 | Discussion on Projects - DU | | Guidotti | | |
| |13.| 22.10 14:00-16:00 | C1 | Exercises for mid-term test | Tool for Dm ex: [[http://matlaspisa.isti.cnr.it:5055/Help|Didactic Data Mining ]] {{ :dm:ex-clustering.pdf | Ex. Clustering PDF}} {{ :dm:ex-clustering.zip |Ex. Clustering PPTX}}| Monreale | | |
| |14.| 24.10 14:00-16:00 | C1 | Knime - Python: Clustering | {{ :dm:knime_clustering.zip |clustering_knime}} {{ :dm:python_clustering.zip |clustering_python}}| Guidotti | | |
| |15.| 26.10 11:00-13:00 | C1 | Exercises for mid-term test | {{ :dm:clustering-2.zip |Ex. Clustering PPTX - complete }} {{ :dm:clustering-2.pdf |Ex. Clustering PDF - complete }} {{ :dm:exercises-dm1.pdf | Exercises DU}} {{ :dm:ex-silhouette.pdf |}}| Monreale | | |
| |16.| 05.11 14:00-16:00 | C1 | Classification/1 | {{ :dm:7.chap3_basic_classification.ppt |}}| Monreale | | |
| |17.| 07.11 14:00-16:00 | C1 | Classification/2 | | Monreale | | |
| | | 09.11 11:00-13:00 | C1 | CANCELED| | | | |
| |18.| 12.11 14:00-16:00 | C1 | LAB: Classification |{{ :dm:knime_classification.zip | knime_classification}} {{ :dm:python_classification.zip | python_classification}} | Guidotti | | |
| |19.| 14.11 14:00-16:00 | C1 | Pattern Mining | {{ :dm:opentheblackbox.pdf | Explanation of classification/ML models }} {{ :dm:dm_patternmining.intro.pptx.pdf | Pattern mining Intro }} {{ :dm:8.tdm-patterns-assrules.pdf | Apriori Algorithm for Pattern/AR Mining }} | Pedreschi| | |
| |20.| 16.11 11:00-13:00 | C1 | Pattern Mining| | Pedreschi | | |
| |21.| 19.11 14:00-16:00 | C1| Exercises for the mid-term| {{ :dm:ex-second-midterm.pdf |}} |Monreale | | |
| |22.| 21.11 14:00-16:00 | C1| Lab Pattern Mining+ Discussion Clustering |{{ :dm:pattern_knime.zip |knime_pattern}} {{ :dm:pattern_python.zip |python_pattern}} https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.html|Guidotti/Pedreschi| | |
| | | | | **The next lectures are dedicated to the DM of 9 credits** | | | | |
| |23.| 23.11 11:00-13:00 | C1| Alternative methods for Pattern Mining. Privacy in DM | {{ :dm:fp-growth.pdf |}}|Monreale| | |
| |24.| 26.11 14:00-16:00 | C1| Alternative methods for Clustering. Privacy in DM | {{ :dm:1-alternative-clustering.pdf |}}|Monreale| | |
| |25.| 28.11 14:00-16:00 | C1| Privacy in DM. Transactional Clustering | {{ :dm:2-transactionalclustering.pdf |}} {{ :dm:privacydt.pdf |}} {{ :dm:papers.zip |Papers on Clustering}} |Monreale| | |
| |26.| 30.11 11:00-13:00 | C1| Alternative methods for classification/1 | {{ :dm:lezioneadvancedclassificationmethods1-knn_nb.pdf | K-Nearest Neighbors & Naive Bayes }} |Pedreschi| | |
| |27.| 03.12 14:00-16:00 | C1| Alternative methods for classification/2 | {{ :dm:lezioneadvancedclassificationmethods3_rules-ensemble.pdf | Ensemble methods}} {{ :dm:ensemblemethod_wisdomofthecrowd.pdf | Wisdom of the crowd & Ensemble methods }} {{ :dm:voxpopuli-galton-1907.pdf | Galton's Vox Populi}} |Pedreschi| | |
| |28.| 05.12 14:00-16:00 | C1| Alternative methods for classification/3 | |Pedreschi| | |
| |29.| 07.12 11:00-13:00 | C1| Exercises on clustering and classification | {{ :dm:exercises-clope.pdf | CLOPE}} {{ :dm:exercises-clustering-kmode.pdf | K-mode}} {{ :dm:ex-classification-knn-nb.pdf | KNN & NB}}|Monreale| | |
| |30.| 10.12 14:00-16:00 | C1| Exercises on Second part - all students | | Monreale| | |
| |31.| 12.12 14:00-16:00 | C1|Final Discussion on Project - all students | |Pedreschi/Guidotti| | |
| |32.| 14.12 11:00-13:00 | C1| Cancelled | | | | |
| |
| | ===== Second Semester (DM2 - Data Mining: Advanced Topics and Applications) ===== |
| |
| ===== Second part of course, second semester (DMA - Data mining: advanced topics and case studies) ===== | ^ ^ Day ^ Time ^ Room ^ Topic ^ Material ^ Lecturer ^ |
| | |01.| 18.02.2025 | 14-16 |A1| Overview, Imbalanced Learning | {{ :dm:16_dm2_intro_2024_25.pdf | Introduction}}, {{ :dm:dm2_project_guidelines_24_25.pdf | Guidelines}}, {{ :dm:17_dm2_imbalanced_learning_2024_25.pdf | Imbalanced Learning}}, [[https://unipiit.sharepoint.com/:v:/s/a__td_64992/EWrX2F6xAS9JtNXh1l5JIgMByAU0eMWBFr5sbGIYL3jakA|Link]] | Guidotti| |
| |
| ^ ^ Day ^ Room (Aula) ^ Topic ^ Learning material ^ Instructor (default: Nanni)^ | |
| |1.| 21.02.2019 14:00-16:00 | A1 | Introduction + Sequential patters/1 | {{ :dm:dm2_2019_intro.pdf |Introduction}}, {{ :dm:sequential_patterns_2019.pdf |Sequential patterns}} | | | |
| |2.| 22.02.2019 16:00-18:00 | C1 | Sequential patterns/2 | | | | |
| |3.| 01.03.2019 16:00-18:00 | C1 | Sequential patterns/3 | {{ :dm:exercises_2019.03.01_fixed.zip |Sample exercises (fixed)}} | | | |
| |4.| 07.03.2019 14:00-16:00 | A1 | Sequential patterns/4 | Sequential pattern tools: Link to [[http://www.philippe-fournier-viger.com/spmf/|SPMF]] + {{ :dm:spmf_datasets.zip | Sample datasets}}, {{ :dm:gsp_py_2019.zip |Python2 GSP educational implementation}}([[http://sequenceanalysis.github.io/|source]]), [[https://github.com/chuanconggao/PrefixSpan-py|PrefixSpan-py]] (requires Python3) | | | |
| |5.| 08.03.2019 16:00-18:00 | C1 | Time series/1 | {{ :dm:time_series_2019.pdf |Time series}} | | | |
| |6.| 14.03.2019 14:00-16:00 | A1 | Time series/2 | [[https://cs.gmu.edu/~jessica/BookChapterTSMining.pdf|Overview on DM for time series]], [[https://pdfs.semanticscholar.org/18f3/55d7ef4aa9f82bf5c00f84e46714efa5fd77.pdf|DTW paper by Sakoe and Chiba, 1978]] | | | |
| |7.| 15.03.2019 16:00-18:00 | C1 | Time series/3 | | | | |
| |8.| 21.03.2019 14:00-16:00 | A1 | Time series/4 | {{ :dm:timeseries_1_preprocess_2019.zip |Preprocessing in Python}} {{ :dm:timeseries_2_dtw_2019.zip |DTW in Python}} | | | |
| |9.| 22.03.2019 16:00-18:00 | C1 | Time series/5 | | | | |
| |10.| 28.03.2019 14:00-16:00 | A1 | Exercises for mid-term exam | {{ :dm:0.dm2_mid-term_exam.2018.04.10.pdf |Exercises from past exams}} | | | |
| |11.| 29.03.2019 16:00-18:00 | C1 | Exercises for mid-term exam | {{ :dm:exercises_2019.03.29.zip |Exercises from past exams (with some solutions)}} | | | |
| | | 04.04.2019 16:00-18:00 | A1 + E | **mid-term exam** | | | | |
| |11.| 11.04.2019 14:00-16:00 | A1 | Classification: alternative methods/1 | {{ :dm:lezioneadvancedclassificationmethods1-knn_nb.pdf |kNN and Bayes classifier}} | | | |
| |12.| 12.04.2019 16:00-18:00 | C1 | Classification: alternative methods/2 | {{ :dm:classification_nnandsvm.pdf |NN and SVM}}, {{ :dm:exercises_classification_2.pdf |Exercises}} | | | |
| | | <del>02.05.2019 14:00-16:00</del> | <del>A1</del> | Cancelled | | | | |
| |13.| 03.05.2019 16:00-18:00 | C1 | Classification: alternative methods/3 | | | | |
| ====== Exams ====== | ====== Exams ====== |
| |
| ===== Exam DM part I (DMF) ====== | ** How and Where: ** |
| | The exam will take place in oral mode only at the teacher's office or classroom previously designated. |
| | The exam will be held online on the 420AA Data Mining course channel only at the request of the |
| | student in accordance with current legislation. |
| |
| The exam is composed of three parts: | ** When: ** |
| | The dates relating to the start of the three exams are/will be published on the online platform |
| | https://esami.unipi.it/. Within each session, we will identify dates and slots in order to distribute the |
| | various orals. The dates and slots to take the exam will be published on the course page by the end of |
| | May. Each student must also register on https://esami.unipi.it/. The examination can only be carried out after the delivery of the project. The project must be delivered one week before when you want to take the exam. Group oral discussions will be preferred in respect of the project groups in order to parallelize any discussion on the project. It is not mandatory to take the oral exam together with the other members of the group. |
| | In the event that the oral exam is not passed, it will not be possible to take until the next exam session. If the project is not considered sufficient, it must be carried out again on a new dataset or a very updated version of the current one. |
| |
| * A **written exam**, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December. | ** What: ** |
| | The oral test will evaluate the practical understanding of the algorithms. The exam will evaluate three aspects. |
| | - Understanding of the theoretical aspects of the topics addressed during the course. The student may be required to write on formulas or pseudocode. During the explanations, the student can use pen and paper. |
| | - Understanding of the algorithms illustrated during the course and their practical implementation. You will be asked to perform one or more simple exercises. The text will be shown on the teacher's screen and / or copied to Miro. The student will have to use pen and paper (if online by Miro https://miro.com/ to show how the exercise is solved. |
| | - Discussion of the project with questions from the teacher regarding unclear aspects, questionable steps or choices. |
| |
| * An **oral exam (optional) **, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. It is optional for students passing the written part by ONLY mid-term tests. | ** Final Mark: ** for 12-credit exam, the final mark will be obtained as the |
| | average mark of DM1 and DM2. |
| |
| * A **project** consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to [[datamining.unipi@gmail.com]]. Please, use “[DM 2018-2019] Project 2” in the subject. ** Students who will decide to perform the project during the winter exam sessions, or summer exam sessions will find the dataset of the project online after 31/05/2019. In this case the project must be delivered at least 2 days before the oral exam**. | ===== Exam Enrollment Instruction ===== |
| Tasks of the project: | * If you are a student of Data Science 1st year |
| - ** Data Understanding (Collective discussion on: 19/10/2018): ** Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details) | * Then register here: [[https://forms.gle/NceAgxW3FmqfSKhu7|here]] |
| - ** Clustering analysis (Collective discussion on: 21/11/2018): ** Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details) | * Else (not Data Science first year or other degrees like Digital Humanities or any other) register [[https://esami.unipi.it/|here]] |
| - ** Classification (Collective discussion on: 12/12/2018): ** Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details) | * Deadline: 01/02/2026 |
| - ** Association Rules (Collective discussion on: 12/12/2018): ** Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details) | * Oral Exams will start from the 06/02/2026 |
| | * Some days after the 01/02/2026 and before the 06/02/2026 all those registered will receive an email with a link to an Agenda to select the exam day and the time slot. |
| | |
| |
| | ===== Exam DM1 ====== |
| |
| * Project 1 | The exam is composed of two parts: |
| - Dataset: **Credit Card Default** | |
| - Assigned: 01/10/2018 | |
| - Deadline: <del>05/01/2019</del>, 09/01/2019 | |
| - Link: https://www.kaggle.com/t/5d7277746f8d45d6a10686506f602a9b | |
| |
| | * An **oral exam**, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises. |
| |
| * Project 2 | * A **project**, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, pattern mining, and classification (guidelines will be provided for more details). The project has to be performed by min 2, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to [[alessio.cascione@phd.unipi.it]] and [[riccardo.guidotti@unipi.it]]. Please, use “[DM1 2025-2026] Project” in the subject. |
| - Dataset: **Telco Customer Churn** | |
| - Assigned: 10/01/2019 | * **Dataset** |
| - Deadline: 31/05/2019 | - Assigned: 15/10/2025 |
| - Link: https://www.kaggle.com/blastchar/telco-customer-churn | - MidTerm Submission: 15/11/2025 (+0.5) (half project required, i.e., Data Understanding & Preparation and Clustering) |
| | - Final Submission: 31/12/2025 (+0.5) one week before the oral exam (complete project required). |
| | - Dataset: Download here {{ :dm:dm1_25_26_dataset.zip |}} |
| |
| | ** DM1 Project Guidelines ** |
| | See {{ :dm:dm1_project_guidelines_25_26.pdf |}} |
| |
| **Guidelines for the project are [[dm:start:guidelines|here]].** | |
| | |
| ===== Exam DM part II (DMA) ====== | |
| |
| The exam is composed of three parts: | ===== Exam DM2 ====== |
| |
| * A **written exam**, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June. | The exam is composed of two parts: |
| |
| * A small **online test** for the data ethics part. The test can be taken at the following link: [[https://thinfi.com/2etq|Link to "First Aid for Data Scientist" web site]] (pwd: datamining_2018). Register, and enroll to the "First Aid for Data Scientist" course. Take the quizzes of the 3 units. Then, download your certificate and send it to [[mirco.nanni@isti.cnr.it]] before the oral exam. | * An **oral exam**, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises. |
| |
| * An **oral exam**, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. | * A **project**, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: imbalanced learning, dimensionality reduction, outlier detection, advanced classification/regression methods, time series analysis/clustering/classification (guidelines will be provided for more details). The project has to be performed by min 1, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages of text including figures. The paper must be emailed to [[andrea.fedele@phd.unipi.it]] and [[riccardo.guidotti@unipi.it]]. Please, use “[DM2 2024-2025] Project” in the subject. |
| | |
| | * **Dataset** |
| | - Assigned: 18/02/2026 |
| | - MidTerm Submission: 07/05/2026 |
| | - Final Submission: one week before the oral exam (complete project required). |
| | - Dataset: TBD |
| |
| * A **project** consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam: | ** DM2 Project Guidelines ** |
| * **Time series**: given the 50+ years long history of stock values of a company, split it into years, and study their similarities, also using clustering. **Objectives**: compare similarities, compute clustering. **Dataset**: {{ :dm:dm2018_project_1.zip |IBM stocks}} (source: [[https://finance.yahoo.com/quote/IBM/history?period1=-252378000&period2=1523656800&interval=1d&filter=history&frequency=1d&guccounter=1|Yahoo Finance]]), includes a Python snippet to read and split the data. Dataset obtained from Yahoo!Finance service. | See TBD. |
| * **Sequential patterns**: discover patterns over the stock value time series above. Before that, preprocess the data by splitting it into monthly time series and discretizing them in some way. **Objective**: find Motifs-like patterns (i.e. frequent contiguous subsequences) of length at least 4 days. **Dataset**: same as the point before. | |
| * **(Alternative) Classification methods**: test different classification methods over a simple classification problem. **Dataset**: the [[https://archive.ics.uci.edu/ml/datasets/Abalone|UCI Abalone dataset]], containing various features of abalones, including the age -- to be inferred by the number or rings. **Objective**: (i) discard the "Infant" abalones; (ii) discretize the attribute "Number of rings" into 2 classes; (iii) try at least 3 different classification methods (among those discussed in DM2, including ensemble methods) on the resulting dataset, using the discretized n. of rings as class, and evaluating them with cross-validation. | |
| * **Outlier detection**: from the Abalone dataset used above, identify the top 1% outliers. **Objective**: adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, ...) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. **Dataset**: same as the point before. | |
| |
| ====== Appelli di esame ====== | |
| |
| ===== Mid-term exams ===== | |
| |
| ^ ^ Date ^ Hour ^ Place ^ Notes ^ Marks ^ | |
| | DM1: First Mid-term 2018 | 30.10.2018 | 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/| {{ :dm:20181030-midterm-test.pdf | results }} | | |
| | DM1: Second Mid-term 2018 | 18.12.2018| 11-13 | Room C1, L1, N1 | Please, use the system for registration: https://esami.unipi.it/| | | |
| | DM2: First Mid-term 2019 | 04.04.2019 | 16-18 | Room A1, E | Please, use the system for registration: https://esami.unipi.it/ \\ {{ :dm:dm2_mid-term_exam.2019.04.04_solutions.pdf |Solutions}}| ongoing... | | |
| |
| ===== Appelli regolari / Exam sessions ===== | |
| ^ Session ^ Date ^ Time ^ Room ^ Notes ^ Marks ^ | |
| |1.|16.01.2019| 14:00 - 18:00| Room E | | | | |
| |2.|06.02.2019| 14:00 - 18:00| Room E | | | | |
| |
| ===== Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18===== | ===== Past Exams ===== |
| | * Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures. |
| |
| ^ Date ^ Time ^ Room ^ Notes ^ Results ^ | ===== Reading About the "Data Scientist" Job ===== |
| | |
| | ** ... a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the "sexiest" around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. ** |
| | |
| | //Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.// |
| | |
| | * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} |
| | * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] |
| | * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} |
| | * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1|link]] |
| | * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 [[http://www.ilsole24ore.com/art/tecnologie/2012-09-21/futuro-scritto-data-155044.shtml?uuid=AbOQCOhG|link]] |
| | * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{:dm:crossroadsxrds2012fall-dl.pdf|download}} |
| | * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: [[https://www.youtube.com/watch?v=mXLy3nkXQVM|YouTube video]] |
| | * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. [[http://www.fusioncharts.com/whitepapers/downloads/Towards-Effective-Decision-Making-Through-Data-Visualization-Six-World-Class-Enterprises-Show-The-Way.pdf|download]] |
| |
| ====== Previous years ===== | ====== Previous years ===== |
| | * [[dm_ds2024-25]] |
| | * [[dm_ds2023-24]] |
| | * [[dm.2022-23ds]] |
| | * [[dm.2021-22ds]] |
| | * [[dm.2020-21]] |
| | * [[dm.2019-20]] |
| | * [[dm.2018-19]] |
| * [[dm.2017-18]] | * [[dm.2017-18]] |
| * [[dm.2016-17]] | * [[dm.2016-17]] |
| * [[dm.2012-13]] | * [[dm.2012-13]] |
| * [[dm.2011-12]] | * [[dm.2011-12]] |
| * [[dm.2010-11]] | |
| * [[dm.2009-10]] | |
| * [[dm.2008-09]] | |
| * [[dm.2007-08]] | |
| * [[dm.2006-07]] | |
| * [[PhDWorkshop2011]] | |
| * [[SNA.Ingegneria2011]] | |
| * [[SNA.IMT.2011]] | |
| * [[MAINS.SANTANNA.2011-12]] | |
| * [[MAINS.SANTANNA.DM4CRM.2012]] | |
| * [[MAINS.SANTANNA.DM4CRM.2016]] | |
| * [[MAINS.SANTANNA.DM4CRM.2017 | Data Mining for Customer Relationship Management 2017]] | |
| * [[MAINS.SANTANNA.DM4CRM.2018]] | |
| * [[MAINS.SANTANNA.DM4CRM.2019]] | |
| * [[SDM2018 | Instructions for camera ready and copyright transfer]] | |
| * [[DM-SAM | Storie dell'Altro Mondo]] | |