====== ICT for BI & CRM - Part III: Data Mining ====== * **Dino Pedreschi** Università di Pisa, Knowledge Discovery and Data Mining Lab [[pedre@di.unipi.it]] ===== News ===== * The data mining software Weka can be downloaded from [[http://www.cs.waikato.ac.nz/ml/weka/|here]]. ====== Goals ====== Organizations and business are overwhelmed by the flood of data continuously collected into their data warehouses and arriving from external sources – the Web above all. Traditional exploratory techniques may fail to make sense of the data, due to its inherent complexity and size. Data mining and knowledge discovery techniques emerged as an alternative approach, aimed at revealing patterns, rules and models hidden in the data, and at supporting the analytical user to develop descriptive and predictive models for a number of business problems, notably in the CRM domain. ====== Syllabus ====== * Basic concepts of data mining and the knowledge discovery process. * Data and data sources. * Exploratory data analysis. * Fundamental data mining tasks and methods: clustering, classification and prediction, patterns and association rules. * Hints on descriptive and predictive analytics for CRM tasks: customer segmentation, churn analysis, promo redemption, product recommendation, market basket analysis. * Discussion of industrial data mining projects for CRM in retail, both traditional and online. ====== Textbooks ====== * Slides (see Calendar). * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * Gordon S. Linoff e Michael J. Berry. //Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management.// Wiley, 2011. ====== Reading about the "data analyst" job ====== * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} ====== Calendar ====== ^ ^ Date ^ Topic ^ Learning material ^ |1. |05.03.2013 - 11:00-13:00 | Introduction to Data Mining and the Knowledge Discovery Process | {{:dm:introductiondm.pdf|slides}} - Textbook: chapt. 1 | |2. |06.03.2013 - 09:00-13:00 | Data understanding. Introduction to Weka | {{:dm:chap2_data.pdf|slides}} - Textbook: chapt. 2 (2.1, 2.2) and chapt. 3 (3.1, 3.2, 3.3) | |3. |06.03.2013 - 14:00-18:00 | Clustering Analysis | {{:dm:clustering.pdf|slides}} - Textbook: chapt. 8 (8.1, 8.2, 8.5) | |4. |07.03.2013 - 09:00-13:00 and 14:00-18:00 | Classification and predictive analysis | {{:dm:dm.classification.pdf|slides}} - Textbook: chapt. 4 (4.1, 4.2, 4.3, 4.4, 4.5) | ===== Exercise ===== * **Breast Cancer Wisconsin (Diagnostic) Data Set. Assigned on: 07.03.2013. To be completed within: 22.03.2013. Send papers (3 pages max of text, figures excluded) by email to [[pedre@di.unipi.it]] cc: Fosca Giannotti[[fosca.giannotti@gmail.com]]. Use "[DM-MAINS] " in the subject. Groupwork allowed, max 3 people per group, inter-disciplinary competence required in each group!** * **Instructions:** Download the {{http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29|Wisconsin Diagnostic Breast Cancer (WDBC) dataset}} from the UCI archive. The dataset contains 569 observations on samples of breast tissue, together with their classification as benign or malignant, as performed by istologists. You are supposed to perform the following tasks: 1) Data understanding and exploratory analysis; 2) clustering analysis (disregarding the class information), including description of the discovered (best) clusters; 3) classification analysis using decision trees for the task of diagnosing a sample as benign or malignant. Describe the process adopted to select the proposed clustering/tree, together with their quality evaluation. ====== Exams ====== The exam of the Data Mining module consists in the evaluation of the report of assigned exercises. For students of the two-year LM-MAINS degree the exam consists in the evaluation of the report of exercises, and an individual oral exam devoted to the discussion of aspects emerging from the exercises. The evaluation of the reports is the same for all components of the group (max 3 students oer group). The date of the first oral exam session of the LM-MAINS students will set by appointment. ====== 2012 Edition ====== [[Edizione2012|ICT for BI & CRM - Part III: Data Mining 2012]]