====== Guidelines for the homework on clustering ====== * **Data Understanding: useful as a preliminary step to capture some data property that can help the clustering analysis (8 points)** * Distribution analysis and suitable transformation of variables * Elimination of redundant variables by correlation analysis * **Clustering Analysis by K-means: (15 points)** * Identification of the best value of k * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset * **Analysis by density-based clustering (7 points)** * Study of the clustering parameters * Characterization and interpretation of the obtained clusters * **Analysis by hierarchical clustering (Optional - 3 points)** * Analysis to be performed on a sampling of the data for scalability reasons ====== Description of the variables ====== For each car driver we observe the following quantities, measured over a certain time window of mobile activity: Length = total traveled distance (m.) Duration = total time spent driving (sec.) Count = number of different trips Phighway = distance traveled on highways (m.) Pcity = distance traveled inside cities (m.) Length_arc_crowded = distance traveled on 20% most crowded roads (m.) Pnight = distance traveled at night time (m.) Pover = distance traveled over speed limit (m.) Profile = number of systematic trips, e.g., work-home Radius_g = radius of gyration: sparsity of location from the center of mass of the driver (mean position) Radius_g_L1 = radius of gyration w.r.t. L1: sparsity of location from the driver's most frequent location (e.g., home) Avg_Dist_L1 = average distance from L1: average distance from the driver's most frequent location (e.g., home) TimeL1L2 = % time spent at locations L1 and L2 (most and second most preferred locations) EntropyArc = entropy on road segment frequencies, measures the diversity of roads traveled EntropyLocation = entropy on location frequencies, measures the diversity of places visited EntropyTime = entropy on hours of the day, measures the diversity of daily patterns Notice that there are no missing values in the dataset, hence "0"s are actual "0"s, NOT missing values.