====== Statistics for Data Science (628PP) A.Y. 2024/25 ====== =====Instructors===== * **Francesco Giannini** * Università di Pisa * [[https://www.francescogiannini.eu/]] * [[ francesco.giannini@unipi.it]] * **Office hours:** TBD or by appointment, at the Department of Computer Science, room 385/DO, or via Teams. * **Salvatore Ruggieri** * Università di Pisa * [[http://pages.di.unipi.it/ruggieri/]] * [[salvatore.ruggieri@unipi.it]] * **Office hours:** Tuesdays h 16:00 - 18:00 or by appointment, at the Department of Computer Science, room 321/DO, or via Teams. =====Hours and rooms===== ^ Day of Week ^ Hour ^ Room ^ | Tuesday | 14:00 - 16:00 | Fib-C | | Wednesday | 9:00 - 11:00 | Fib-C1 | | Thursday | 9:00 - 11:00 | Fib-C1 | =====Pre-requisites===== Students should be comfortable with most of the topics on mathematical calculus covered in: * **[P]** J. Ward, J. Abdey. **Mathematics and Statistics**. University of London, 2013. __Chapters 1-8 of Part 1__. Extra-lessons refreshing such notions may be planned in the first part of the course. =====Mandatory Teaching Material===== The following are //mandatory text books//: * **[T]** F.M. Dekking C. Kraaikamp, H.P. Lopuha, L.E. Meester. **A Modern Introduction to Probability and Statistics**. Springer, 2005. * **[R]** P. Dalgaard. **Introductory Statistics with R**. 2nd edition, Springer, 2008. * selected chapters of other books for advanced topics =====Software===== * [[https://cran.r-project.org/|R]] * [[https://posit.co/download/rstudio-desktop/|R Studio]] =====Preliminary program and calendar===== * [[https://unipi.coursecatalogue.cineca.it/corsi/2025/11357/insegnamenti/2025/53385_703139_77399/2025/53385_9386?schemaid=9386|Preliminary program]]. * [[https://didattica.di.unipi.it/en/master-programme-in-data-science-and-business-informatics/academic-calendar-2025-2026/|Calendar of lessons]]. =====Exams===== __//There are no mid-terms//.__ The exam consists of a written part and an oral part. The written part consists of exercises and questions on the topics of the course. Each question is assigned a grade, summing up to 30 points. Example written texts: **{{ :mds:sds:sds_sample1.pdf | sample1}}**, **{{ :mds:sds:sds_sample2.pdf | sample2}}**. Students are admitted to the oral part if they receive a grade of at least 18 points. The oral part consists of critical discussion of the written part and of open questions and problem solving on the topics (both theory and R programming) of the course. In particular, students must demonstrate to be able to summarize both the theory and the software related to any of the lessons using the slides and R scripts of the lessons. Registration to exams is mandatory (**beware of the registration deadline!**): [[https://esami.unipi.it/esami2/|register here]]. **The dates below are only for the written test (normal exam). Dates for project discussion are included in the project description**.\\ ^ Date ^ Hour ^ Room ^ Notes ^ | 29/1/2026 | 11:00 - 13:00 | FIB-C1 | | =====Student project===== * The project replaces the written part of the examination * Project description and rules and Q&A will be published here in April. =====Class calendar===== A [[https://teams.microsoft.com/l/team/19%3ADJ6an3j-cIPJMvtGKes7oJZEKToOyr48E2290_chI5k1%40thread.tacv2/conversations?groupId=62728cdb-6f88-4134-98ac-3e0282384b1c&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams channel]] is used to post news, notes, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed. Recordings from previous years are available for non‑attending students (see the past years section below); however, these materials may not fully correspond to the content taught in the current academic year. Calendar and material of future lessons refer to the last academic year. Schedule, slides and R scripts might be updated **after the classes** to align with actual content of current year and to correct typos. //Be sure to download the updated versions.// ^ # ^ Date ^ Room ^ Topic ^ Mandatory teaching material ^ |01| 17/02 14-16 | Fib-C | Introduction. Probability and independence. | **[T]** Chpts. 1-3 {{:mds:sds:s4ds01.pdf|slides01 (.pdf)}}| |02| 18/02 9-11 | Fib-C1 | R basics. | **[R]** Chpts. 1,2.1-2.3 {{:mds:sds:s4ds02.pdf|slides02 (.pdf)}}, {{:mds:sds:s4ds02.r|script02 (.R)}}| |03| 20/02 9-11 | Fib-C1 | Bayes' rule and applications. | **[T]** Chpt. 3 {{:mds:sds:s4ds03.pdf|slides03 (.pdf)}}, {{:mds:sds:s4ds03.r|script03 (.R)}}| |04| 24/02 14-16 | Fib-C | Discrete random variables. | **[T]** Chpts. 4, 9.1, 9.2, 9.4 **[R]** Chpt. 3 {{:mds:sds:s4ds04.pdf|slides04 (.pdf)}}, {{:mds:sds:s4ds04.r|script04 (.R)}}| |05| 25/02 9-11 | Fib-C1 | Discrete random variables (continued). | | |06| 27/02 9-11 | Fib-C1 | Recalls: derivatives and integrals. | **[P]** Chpt. 1-8 {{:mds:sds:s4ds06.pdf|slides06 (.pdf)}}, {{:mds:sds:s4ds06.r|script06 (.R)}}| |07| 03/03 14-16 | Fib-C | R data access and programming. | **[R]** Chpt. 2.3,2.4 {{:mds:sds:s4ds07.zip|script07 (.zip)}} | |08| 04/03 9-11 | Fib-C1 | Continuous random variables. | **[T]** Chpts. 5, 9.2-9.4 **[R]** Chpt. 3 {{:mds:sds:s4ds08.pdf|slides08 (.pdf)}}, {{:mds:sds:s4ds08.r|script08 (.R)}}| |09| 06/03 9-11 | Fib-C1 | Expectation and variance. Computations with random variables. | **[T]** Chpts. 7,8 {{:mds:sds:s4ds09.pdf|slides09 (.pdf)}}, {{:mds:sds:s4ds09.r|script09 (.R)}}| |10| 10/03 14-16 | Fib-C | Expectation and variance. Computations with random variables (continued). Moments. Functions of random variables. | **[T]** Chpts. 9-11 {{:mds:sds:s4ds10.pdf|slides10 (.pdf)}}, {{:mds:sds:s4ds10.zip|script10 (.zip)}} | |11| 11/03 9-11 | Fib-C1 | Functions of random variables (continued). Distances between distributions. | {{:mds:sds:murphychpt6.pdf|Murphy's book}} Chpt. 6 {{:mds:sds:s4ds11.pdf|slides11 (.pdf)}}, {{:mds:sds:s4ds11.R|script11 (.R)}} | |12| 13/03 9-11 | Fib-C1 | Simulation. | **[T]** Chpts. 6.1-6.2 {{:mds:sds:s4ds12.pdf|slides12 (.pdf)}}, {{:mds:sds:s4ds12.r|script12 (.R)}} {{:mds:sds:s4ds12_sol07.r|script12_sol07 (.R)}}| |13| 17/03 14-16 | Fib-C | Power laws and Zipf's law. | [[https://arxiv.org/pdf/cond-mat/0412004.pdf | Newman's paper]] Sect I, II, III(A,B,E,F) {{:mds:sds:s4ds13.pdf|slides13 (.pdf)}}, {{:mds:sds:s4ds13.r|script13 (.R)}}| |14| 18/03 9-11 | Fib-C1 | Law of large numbers. The central limit theorem. | **[T]** Chpts. 13-14 {{:mds:sds:s4ds14.pdf|slides14 (.pdf)}}, {{:mds:sds:s4ds14.R|script14 (.R)}} | |15| 20/03 9-11 | Fib-C1 | Graphical summaries. Kernel Density Estimation. | **[T]** Chpt. 15, **[R]** Chpt. 4 {{:mds:sds:s4ds15.pdf|slides15 (.pdf)}}, {{:mds:sds:s4ds15.r|script15 (.R)}}| |16| 24/03 14-16 | Fib-C | Numerical summaries. | **[T]** Chpt. 16, **[R]** Chpt. 4 {{:mds:sds:s4ds16.pdf|slides16 (.pdf)}}, {{:mds:sds:s4ds16.r|script16 (.R)}} | |17| 25/03 9-11 | Fib-C1 |Data preprocessing in R. Estimators.| **[R]** Chpt. 10, **[T]** Chpts. 17.1-17.3{{:mds:sds:s4ds17.r|script17 (.R)}}, {{ :mds:sds:dataprep.r | dataprep.R}} | |18| 27/03 9-11 | Fib-C1 | Unbiased estimators. Efficiency and MSE. | **[T]** Chpts. 19, 20 {{:mds:sds:s4ds18.pdf|slides18 (.pdf)}}, {{:mds:sds:s4ds18.r|script18 (.R)}} | |19| 31/03 14-16 | Fib-C | Maximum likelihood estimation. | **[T]** Chpt. 21 {{ :mds:sds:s4dsln.pdf |}} Chpt. 1 {{:mds:sds:s4ds19.pdf|slides19 (.pdf)}}, {{:mds:sds:s4ds19.r|script19 (.R)}} | |20| 01/04 9-11 | Fib-C1 | Linear regression. Least squares estimation. | **[T]** Chpts. 17.4,22 **[R]** Chpt. 6 {{ :mds:sds:s4dsln.pdf |}} Chpt. 2 {{:mds:sds:s4ds20.pdf|slides20 (.pdf)}}, {{:mds:sds:s4ds20.r|script20 (.R)}} | |21| 08/04 9-11 | Fib-C1 | Non-linear, and multiple linear regression. | **[R]** Chpt. 12.1,13,16.1-16.2 {{ :mds:sds:s4dsln.pdf |}} Chpt. 2 {{:mds:sds:s4ds21.pdf|slides21 (.pdf)}}, {{:mds:sds:s4ds21.R|script21 (.R)}} | |22| 10/04 9-11 | Fib-C1 | Issues with linear regression. Logistic regression. | **[R]** Chpt. 12.1,13,16.1-16.2 {{:mds:sds:s4ds22.pdf|slides22 (.pdf)}}, {{:mds:sds:s4ds22.zip|script22 (.zip)}} | |23| 14/04 14-16 | Fib-C | Statistical decision theory. | {{ :mds:sds:s4dsln.pdf |}} Chpt. 4 {{:mds:sds:s4ds23.pdf|slides23 (.pdf)}}, {{:mds:sds:s4ds23.r|script23 (.R)}} | |24| 15/04 9-11 | Fib-C1 | Statistical decision theory (continued). | | |25| 17/04 9-11 | Fib-C1 | Statistical decision theory (continued). Project presentation. | | |26| 21/04 14-16 | Fib-C | Confidence intervals: mean, proportion, linear regression. | **[T]** Chpts. 23.1,23.2,23.4,24.3,24.4 {{ :mds:sds:s4dsln.pdf |}} Chpt. 3 {{:mds:sds:s4ds26.pdf|slides26 (.pdf)}}, {{:mds:sds:s4ds26.r|script26 (.R)}} | |27| 22/04 9-11 | Fib-C1 | Confidence intervals (continued). Bootstrap and resampling methods. | **[T]** Chpts. 18.1-18.3,23.3 {{:mds:sds:s4ds27.pdf|slides27 (.pdf)}}, {{:mds:sds:s4ds27.r|script27 (.R)}} | |28| 24/04 9-11 | Fib-C1 | Bootstrap and resampling methods (continued). | | |29| 28/04 14-16 | Fib-C | Hypotheses testing. One-sample tests of the mean and application to linear regression. | **[T]** Chpts. 25,26,27, **[R]** Chpts. 5.1,5.2 {{ :mds:sds:s4dsln.pdf |}} Chpt.3.3 {{:mds:sds:s4ds29.pdf|slides29 (.pdf)}}, {{:mds:sds:s4ds29.r|script29 (.R)}} | |30| 29/04 9-11 | Fib-C1 | One-sample tests of the mean and application to linear regression (continued). Classifier performance metrics in R. | {{:mds:sds:s4ds30.pdf|slides30 (.pdf)}}, {{:mds:sds:s4ds30.r|script30 (.R)}} | |31| 05/05 14-16 | Fib-C | Two-sample tests of the mean and applications to classifier comparison. | **[T]** Chpt. 28, **[R]** Chpts. 5.3-5.7 {{:mds:sds:s4ds31.pdf|slides31 (.pdf)}}, {{:mds:sds:s4ds31.r|script31 (.R)}} | |32| 06/05 9-11 | Fib-C1 | Multiple-sample tests of the mean and applications to classifier comparison. | **[R]** Chpt. 7 {{:mds:sds:s4ds32.pdf|slides32 (.pdf)}}, {{:mds:sds:s4ds32.r|script32 (.R)}} | |33| 08/05 9-11 | Fib-C1 | Fitting distributions. Testing independence/association. | **[R]** Chpt. 8 {{ :mds:smd:ks.pdf | K-S}}, {{:mds:sds:s4ds33.pdf|slides33 (.pdf)}}, {{:mds:sds:s4ds33.r|script33 (.R)}} | |34| 12/05 14-16 | Fib-C | Fitting distributions. Testing independence/association (continued). | | |35| 13/05 9-11| Fib-C1 | //Mandatory seminar:// TBD | | |36| 15/05 9-11 | Fib-C1 | Project Q&A. | | Lessons not held (due to strikes or other unforeseen circumstances) will be rescheduled on the following dates: 19, 20, 22, 26, 27, and 29 May. =====Past years===== * [[mds:sds:2024|Statistics for Data Science A.Y. 2024/25]] * [[mds:sds:2023|Statistics for Data Science A.Y. 2023/24]]