dm:start:guidelines
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
Entrambe le parti precedenti la revisioneRevisione precedenteProssima revisione | Revisione precedente | ||
dm:start:guidelines [15/11/2015 alle 10:26 (9 anni fa)] – [Guidelines for the homework on data understanding] Anna Monreale | dm:start:guidelines [23/11/2020 alle 10:34 (4 anni fa)] (versione attuale) – [Guidelines for the task on Classification] Riccardo Guidotti | ||
---|---|---|---|
Linea 1: | Linea 1: | ||
- | ====== Guidelines for the task on data understanding | + | ====== Guidelines for the task on Data Understanding |
- | * ** Data semantics (4 points)** | + | |
- | * ** Distribution of the variables and statistics (7 points)** | + | |
- | * ** Assessing data quality (missing values | + | - Distribution of the variables and statistics (7 points) |
- | * ** Pairwise correlations | + | - Assessing data quality (missing values, outliers) (7 points) |
- | * ** Presentation | + | - Variables transformations |
+ | - Pairwise correlations | ||
====== Guidelines for the task on clustering ====== | ====== Guidelines for the task on clustering ====== | ||
- | * **Clustering Analysis by K-means: (15 points)** | + | * Clustering Analysis by K-means: (13 points) |
- | * Identification of the best value of k | + | - Choice of attributes and distance function (1 points) |
- | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset | + | - Identification of the best value of k (5 points) |
+ | | ||
+ | * Analysis by density-based clustering (9 points) | ||
+ | - Choice of attributes and distance function (2 points) | ||
+ | - Study of the clustering parameters (2 points) | ||
+ | - Characterization and interpretation of the obtained clusters (5 points) | ||
+ | * Analysis by hierarchical clustering (5 points) | ||
+ | - Choice of attributes and distance function (2 points) | ||
+ | - Show and discuss different dendograms using different algorithms (3 points) | ||
+ | * Final evaluation of the best clustering approach and comparison of the clustering obtained (3 points) | ||
+ | |||
+ | |||
+ | ====== Guidelines for the task on Association Rules Mining ====== | ||
+ | * Frequent patterns extraction with different values of support and different types (i.e. frequent, close, maximal), (6 points) | ||
+ | * Discussion of the most interesting frequent patterns and analyze how changes the number of patterns w.r.t. the min_sup parameter (7 points) | ||
+ | * Association rules extraction with different values of confidence (6 points) | ||
+ | * Discussion of the most interesting rules and analyze how changes the number of rules w.r.t. the min_conf parameter, histogram of rules' confidence and lift (7 points) | ||
+ | * Use the most meaningful rules to replace missing values and evaluate the accuracy (2 points) | ||
+ | * Use the most meaningful rules to predict the target variable and evaluate the accuracy (2 points) | ||
+ | |||
+ | |||
+ | ====== Guidelines for the task on Classification ====== | ||
+ | * Learning of different decision trees/ | ||
+ | * Decision trees interpretation, | ||
+ | * Training of different KNN classifiers with different parameters with the object of maximizing the performances (6 points) | ||
+ | * Discussion of the best prediction model (6 points) | ||
- | * **Analysis | + | ====== Guidelines for the Project ====== |
- | | + | * Title page is not counted in the 20 page limits, i.e., you can have 20 pages + 1 title page, the page limit is strict: additional pages will not be considered for the final evaluation, i.e., pages 21,22,23 etc. will not be read and evaluated. |
- | | + | * The project size must not exceed 25Mb, i.e. you must be able to send it by email without compression. |
+ | * Only PDF file are allowed, you do not have to submit python code or the knime workflows. | ||
+ | * The final paper must be easily readable, i.e., it is better to use font size higher than 9pt. | ||
+ | * Use a readable font type and size, e.g. Arial, Times New Romans | ||
+ | * You can use multiple columns and change the margin size but the project must be readable. | ||
+ | * It is NOT required to put python code, knime flows, or theoretical descriptions | ||
+ | | ||
+ | * You can get 3 additional extra points in the final mark with respect to the following criteria: | ||
+ | - Innovation (0.5 points) | ||
+ | - Experimentation (0.5 points) | ||
+ | - Performance (0.5 points) | ||
+ | - Appearance (0.5 points) | ||
+ | - Organization (0.5 points) | ||
+ | - Summary (0.5 points) | ||
- | * **Analysis by hierarchical clustering (5 points)** | ||
- | * Analysis to be performed on a sampling of the data for scalability reasons (if necessary) |
dm/start/guidelines.txt · Ultima modifica: 23/11/2020 alle 10:34 (4 anni fa) da Riccardo Guidotti