====== Data Mining (309AA) - 9 CFU A.Y. 2020/2021 ====== **Instructor:** * **Anna Monreale** * KDDLab, Università di Pisa * [[anna.monreale@unipi.it]] **Teaching Assistant:** * **Francesca Naretto** * KDDLab, SNS, Pisa * [[francesca.naretto@sns.it]] ====== News ====== * [01.10.2020] ** The lecture on 9.10.2020 will be suppressed. ** * [09.09.2020] The course will be held online, please use this link to join the class: https://teams.microsoft.com/l/team/19%3a8f6779bab74f4368ba7ce1c2b092346d%40thread.tacv2/conversations?groupId=8da15095-b6e5-41c1-a894-d418aed3983e&tenantId=c7456b31-a220-47f5-be52-473828670aa1 * ====== Learning Goals ====== * Fundamental concepts of data knowledge and discovery. * Data understanding * Data preparation * Clustering * Classification & Regression * Pattern Mining and Association Rules * Outlier Detection * Time Series Analysis * Sequential Pattern Mining * Ethical Issues ====== Hours and Rooms ====== **Classes** ^ Day of Week ^ Hour ^ Room ^ | Wednesday | 09:00 - 10:45 | Online | | Thursday | 09:00 - 10:45 | Online | | Friday | 11:00 - 12:45 | Online | **Office hours - Ricevimento:** Anna Monreale: Wednesday: 11:00-13:00 online using Teams (Appointment by email) Francesca Naretto: Monday: 15:00-18:00 online using Teams (Appointment by email) ====== Learning Material -- Materiale didattico ====== ===== Textbook -- Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * Chapters 4,6 and 8 are also available at the publisher's Web site. * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. * Jake VanderPlas. **[[http://shop.oreilly.com/product/0636920034919.do| Python Data Science Handbook: Essential Tools for Working with Data.]]** 1st Edition. ===== Slides ===== * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ===== Software===== * Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. [[https://www.anaconda.com/distribution/| Download page]] (the following libraries are already included) * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] ====== Class Calendar (2020/2021) ====== ===== First Semester ===== ^ ^ Day ^ Topic ^ Learning material ^ References ^ |1.| 16.09 09:00-10:45 | Overview. Introduction to KDD | {{ :magistraleinformatica:dmi:1-overview.pdf |}} {{ :magistraleinformatica:dmi:1-intro-dm.pdf |}} | Chap. 1 Kumar Book| |2.| 17.09 09:00-10:45 | Data Understanding | {{ :magistraleinformatica:dmi:2-data_understanding.pdf | Slides DU}} |Chap.2 Kumar Book and additioanl resource of Kumar Book:[[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Exploring Data]] If you have the first ed. of KUMAR this is the Chap 3 | |3.| 18.09 09:00-10:45 | Data Preparation | {{ :magistraleinformatica:dmi:3-data_preparation.pdf |}} | Chap. 2 Kumar Book | |4.| 23.09 09:00-10:45 | Data Preparation: Transformations & PCA | {{ :magistraleinformatica:dmi:3-data_preparation.pdf |}} | Chap. 2 Kumar Book, Appendix B Dimensionality Reduction (only PCA) | |5.| 24.09 09:00-10:45 | Data Similarities. Introduction to Clustering.|{{ :magistraleinformatica:dmi:4-data_similarity.pdf |}} {{ :magistraleinformatica:dmi:5-basic_cluster_analysis-intro.pdf |}} | Data Similarity is in Chap. 2 while Clustering is in Chap. 7 | |6.| 25.09 11:00-12:45 | LAB: Data Understanding in Python | {{ :magistraleinformatica:dmi:python_basics.ipynb.zip | Very basic notions on Python}} {{ :magistraleinformatica:dmi:tips_data_understanding.ipynb.zip |Notebook on Data Understanding}} {{ :magistraleinformatica:dmi:tipsdata.zip |}}| |7.| 30.09 09:00-10:45 | Center-based clustering: kmeans| {{ :magistraleinformatica:dmi:6-basic_cluster_analysis-kmeans-variants.pdf |}} | Chap. 7 Kumar Book| |8.| 01.10 09:00-10:45 | Center-based clustering: Bisecting K-means, Xmeans, EM| Same Slides of the previous lectures | Chap. 7 Kumar Book, {{ :magistraleinformatica:dmi:clusteringmixturemodels.pdf | Clustering & Mixture Models}} {{ :magistraleinformatica:dmi:xmeans.pdf |}}| |9.| 02.10 11:00-12:45 | Hierarchical clustering| {{ :magistraleinformatica:dmi:7.basic_cluster_analysis-hierarchical.pdf |}} {{ :magistraleinformatica:dmi:ex._hierarchical-clustering.pdf |}}| Chap. 7 Kumar Book | |10.| 07.10 09:00-10:45 | Density based clustering|{{ :magistraleinformatica:dmi:8.basic_cluster_analysis-dbscan-validity.pdf |}} | Chap. 7 Kumar Book | |11.| 08.10 09:00-10:45 | Lab: clustering + Project Assignment | {{ :magistraleinformatica:dmi:py-clustering.zip |}} | | | | 09.10 11:00-12:45 | Lecture canceled | | | |12.| 14.10 09:00-10:45 | Classification Problem + Decision trees| {{ :magistraleinformatica:dmi:9.chap3_basic_classification-2020.pdf |}}| Chap. 3 Kumar Book | |13.| 15.10 09:00-10:45 | Only 30 minutes of Discussion on the project due to connection problems| | Chap. 3 Kumar Book | |14.| 16.10 11:00-12:45 | Decision Tree + Classifier Evaluation| | Chap. 3 Kumar Book | |15.| 21.10 09:00-10:45 | Evaluation Methods for Classification Models| {{ :magistraleinformatica:dmi:9.chap3_basic_classification-2020.pdf |}}| Chap. 3 Kumar Book + Chap. 4 Kumar Book| |16.| 22.10 09:00-10:45 | Statistical tool for model evaluation + Rule based classification| {{ :magistraleinformatica:dmi:10-rule-based-clussifiers.pdf |}} | Chap. 3 Kumar Book + Chap. 4 Kumar Book| |17.| 23.10 11:00-12:45 | Rule based classification + Instance-based Classification| {{ :magistraleinformatica:dmi:11-knn.pptx |}} | Chap. 4 Kumar Book | |18.| 28.10 09:00-10:45 |Naive Bayesian Classifier + Ensemble Classifieres | {{ :magistraleinformatica:dmi:12-naive_bayes.pdf |}} {{ :magistraleinformatica:dmi:13_ensemble_2020.pdf |}} | Chap. 4 Kumar Book | |19.| 29.10 09:00-10:45 | SVM & NN | {{ :magistraleinformatica:dmi:14_svm_2020.pdf |}} {{ :magistraleinformatica:dmi:15_neural_networks_2020.pdf |}}| Chap. 4 Kumar Book | |20.| 30.10 11:00-12:45 | MLNN & Lab on Classification| {{ :magistraleinformatica:dmi:classification.zip |Nootebook Python for classification}} | Chap. 4 Kumar Book | |21.| 04.11 09:00-10:45 | Regression & Association Rule Mining| {{ :magistraleinformatica:dmi:16_linear_regression.pdf |}} {{ :magistraleinformatica:dmi:17_association_analysis.pdf |}}| Regression: Appendix D in Kumar BOOK Chap.5 Association Rules: Kumar Book| |22.| 05.11 09:00-10:45 | Association Rule Mining| | Chap.5 Association Rules: Kumar Book| |23.| 06.11 11:00-12:45 | Sequential Pattern Mining| {{ :magistraleinformatica:dmi:18_sequential_patterns_2020.pdf |}}| Chap.6 Kumar Book| |24.| 11.11 09:00-10:45 | Ethics in AI & Privacy | {{ :magistraleinformatica:dmi:19_ethics_privacy.pdf |}} | [[https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai|Report in Trustworthy AI]] | |25.| 12.11 09:00-10:45 | Ethics in AI & Privacy | | {{ :dm:allegato1_chapter.pdf | Overview on Privacy}} {{ :magistraleinformatica:dmi:allegato11-cpdp13.pdf |}}{{ :dm:capprivacy.pdf | Privacy by design}} | |26.| 13.11 11:00-12:45 | Ethics in AI & Privacy, Explainability | {{ :magistraleinformatica:dmi:20_explainability_2020.pdf |}} | | |27.| 18.11 09:00-10:45 | Explainability | {{ :magistraleinformatica:dmi:20_explainability_2020.pdf |}} | Material: [[https://arxiv.org/pdf/1805.10820.pdf|LORE]] [[https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf| LIME]] [[http://delivery.acm.org/10.1145/3240000/3236009/a93-guidotti.pdf?ip=94.38.73.6&id=3236009&acc=OA&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2ED544636226B69D47&__acm__=1576196869_06b3353aae4fe3bd8ea30d9c9c5356eb|Survey]] {{ :magistraleinformatica:dmi:pkdd_2019_abele_cr.pdf |ABELE}}| |28.| 19.11 09:00-10:45 | Anomaly Detection | {{ :magistraleinformatica:dmi:21_anomaly_detection_2020.pdf |}} | Chap. 9 of Kumar Book| |29.| 20.11 11:00-12:45 | Anomaly Detection | {{ :magistraleinformatica:dmi:anomalydetection.ipynb.zip |}} | Chap. 9 of Kumar Book | |30.| 25.11 09:00-10:45 |Time series Siminarity | {{ :magistraleinformatica:dmi:22_time_series_similarity.pdf |}}| [[https://cs.gmu.edu/~jessica/BookChapterTSMining.pdf|Overview on DM for time series]], [[https://pdfs.semanticscholar.org/18f3/55d7ef4aa9f82bf5c00f84e46714efa5fd77.pdf|DTW paper by Sakoe and Chiba, 1978]]| |31.| 26.11 09:00-10:45 |Time series Clustering | {{ :magistraleinformatica:dmi:22_time_series_similarity.pdf |}} | | |32.| 27.11 11:00-12:45 |Lab on Association Rules and Sequential Pattern Mining | {{ :magistraleinformatica:dmi:patterns.zip |}} | | |33.| 02.12 09:00-10:45 | Time Series: Motif Discovery | {{ :magistraleinformatica:dmi:23_time_series_motif_shapelets.pdf |}} | {{ :magistraleinformatica:dmi:randomproj.pdf |}}{{ :magistraleinformatica:dmi:matrixprofile.pdf |}}| |34.| 03.12 09:00-10:45 | Time Series: Shapelets Discovery + Ex. DTW + Subsequences + Thesis available| {{ :magistraleinformatica:dmi:23_time_series_motif_shapelets.pdf |}} {{ :magistraleinformatica:dmi:ex-dtw-sequences.pdf |}} {{ :magistraleinformatica:dmi:research_topics.pdf |Thesis Proposals}} | {{ :magistraleinformatica:dmi:shaplet.pdf |}} | | | 04.12 11:00-12:45 | Lecture Canceled | | | |35.| 09.12 09:00-10:45 | Paper Presentation | | | |36.| 10.12 09:00-10:45 | Paper Presentation | | | |37.| 11.12 11:00-12:45 | Paper Presentation | | | ====== Exams ====== **Mid-term Project ** A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 2/3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks. * First part of the project consists in the **assignments** described here: {{ :magistraleinformatica:dmi:dm-projectdescriptionpart1.pdf | Project Description}} * **Dataset:** {{ :magistraleinformatica:dmi:customer_supermarket.csv.zip |}} * **Deadline**: the fist part has to be delivered within November, 5th 2020. ** November, 12 2020. ** * Second part of the project consists in the **assignment Task 3** described here: {{ :magistraleinformatica:dmi:project_description.pdf |Updated Project Description}} * **Deadline**: the second part has to be delivered within ** January, 4th 2021 ** * Third part of the project consists in the **assignment Task 4** described here: {{ :magistraleinformatica:dmi:dm-project_description.pdf | Final Project Description}} * **Deadline**: ** January, 4th 2021 (strict) ** Prepare a single zip folder containing also the material of the previous submitted task (even if they are already submitted). Note that, in the file of the project description I reported all the detailed instructions for the delivery of all the tasks for the final submission. ** Project to be delivered during the exam sessions ** Students who did not deliver the above project within 4 Jan 2021 need to ask by email a new project to the teacher. ** Paper Presentation (OPTIONAL)** Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions. They only need to present the project (see next point). **Oral Exam** * **Project presentation** (with slides) – 10 minutes: mandatory for all the students * ** Open questions ** on the entire program: optional only for students opting for paper presentation. ====== Exam Dates ====== TBD ===== Exam Sessions ===== TBD ===== Reading About the "Data Scientist" Job ===== ** ... a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the "sexiest" around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. ** //Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.// * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1|link]] * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 [[http://www.ilsole24ore.com/art/tecnologie/2012-09-21/futuro-scritto-data-155044.shtml?uuid=AbOQCOhG|link]] * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{:dm:crossroadsxrds2012fall-dl.pdf|download}} * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: [[https://www.youtube.com/watch?v=mXLy3nkXQVM|YouTube video]] * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. [[http://www.fusioncharts.com/whitepapers/downloads/Towards-Effective-Decision-Making-Through-Data-Visualization-Six-World-Class-Enterprises-Show-The-Way.pdf|download]] ====== Previous years ===== [[DM-INF 2020-2021]] [[http://didawiki.cli.di.unipi.it/doku.php/dm/dm.2019-20|DM-2019/20]]