Strumenti Utente

Strumenti Sito


bigdataanalytics:bda:start

Big Data Analytics A.A. 2016/17

Instructors - Docenti:

Exam days: 23-24 January / 16-20 February (Send an email to register! and do not forget the report 7 days before)

Learning goals -- Obiettivi del corso

Objective In our digital society, every human activity is mediated by information technologies. Therefore, every activity leaves digital traces behind, that can be stored in some repository. Phone call records, transaction records, web search logs, movement trajectories, social media texts and tweets, Every minute, an avalanche of “big data” is produced by humans, consciously or not, that represents a novel, accurate digital proxy of social activities at global scale. Big data provide an unprecedented “social microscope”, a novel opportunity to understand the complexity of our societies, and a paradigm shift for the social sciences. Objective of the course is twofold: an introduction to the emergent field of big data analytics and social mining, aimed at acquiring and analyzing big data from multiple sources to the purpose of discovering the patterns and models of human behavior that explain social phenomena and an introduction to the technological scenario of scalable analytics.

Course structure

The course is organized into three intertwined modules:

Module1: Big Data Analytics and Social Mining: The focus is on what can be learnt from big data in different domains: mobility and transportation, urban planning, demographics, economics, social relationships, opinion and sentiment, etc.; and on the analytical and mining methods that can be used

Module2: Scalable Data Analytics Technologies. The focus is on managing the pipeline of the analytical process to build scalable, robust data science applications: introduction to Hadoop, Spark and Mahout. Managing scalability: real case examples.

Module3: Students Activities.Students are requested to actively participate with individual seminars and team projects.

Virtual Machine:

Other resources:

Hours - Orario e Aule

Monday 16:00 - 18:00 Aula Fib N1

Friday 9:00 - 11:00 Aula Fib L1

Day Topic Materials Notes Instructor
1. Mon 26/09 Course Presentation; Module1: Big Data Landscape: Opportunities, risks, big data sources, challenges. http://goo.gl/b2syFA 1st student assignment: “Big Data, Data Analyst, Crowdsourcing, Crowdsensing” (at least one) Giannotti/Trasarti
2. Fri 30/09 Module2: Introduction to Hadoop https://goo.gl/0UiFg8 Trasarti
3. Mon 3/10 Module1: Big Data Analytics scenario: New questions to be answered Round table Giannotti
4. Fri 7/10 Understanding dynamic of society with Mobile Phone Traces https://goo.gl/kEt3m3 Projects assignment https://goo.gl/DkGMJg Giannotti
5. Mon 10/10 Module2: Design Patterns https://goo.gl/ksJQDJ 2nd student assignment: Papers & Technologies Trasarti
6. Fri 14/10 Module2: Analyzing Big Data with Spark https://goo.gl/On4B77 https://goo.gl/luhYzB Trasarti
Mon 17/10
Fri 21/10
7. Mon 24/10 Module2: Data Mining with Spark https://goo.gl/IWWkkc Trasarti
8. Fri 28/10 Module3: Project formulation Formulations here: https://goo.gl/0iyKAM Trasarti
9. Mon 7/11 Module2: Technologies highlights I Student seminars (technologies 1/6) Technologies here: https://goo.gl/9tU1bT Trasarti
10 Fri 11/11 Module2: Technologies highlights II Student seminars (technologies 2/6) & Master Big Data presentation, register here! Trasarti
11 Mon 14/11 Module3: Mid-term Project presentations Slides here: https://goo.gl/2rCX4l Trasarti
12 Fri 18/11 Module1: Understanding Human Mobility with Big data Giannotti
13 Mon 21/11 Module1: Novel Demography with Phone Data Resources: https://goo.gl/LiQkbn Giannotti
14 Fri 25/11 Module1: Deep Learning https://goo.gl/WUNR4S Round table Giannotti
15 Mon 28/11 Module2: Realizing a scalable sociometer (ASAP Project) https://goo.gl/2kQiBJ Student seminars (papers 3/6) Papers here: https://goo.gl/gKm4w2 Trasarti
16 Fri 02/12 Module2: Realizing a classifier for GPS traces (Navionics Project) https://goo.gl/esAhxd Student seminars (papers 4/6) Trasarti
Mon 05/12
17 Fri 09/12 Module1: Social media mining - Sentiment analysis Student seminars (papers 5/6) Giannotti
18 Mon 12/12 Student seminars (papers 6/6) Trasarti
19 Fri 16/12 Module3: Student pre-final Project Presentations https://goo.gl/rerdUv Groups 1,2,4,5,9 Giannotti
20 Mon 19/12 Module3: Student pre-final Project Presentations https://goo.gl/rerdUv Groups 3,6,7,8 Giannotti

Exam

The exam is composed by three parts:

  • The assignments during the course (papers and technologies) (20%)
  • A project, The work done should be summarized in a report (max. 10 pages), to be sent to the teachers at least a week before the oral exam (project discussion). (50%)
  • An oral exam, the discussion of the project with a group presentation (30 minutes for all the group); (30%).

In addition the students may ask for some additional questions on the course content to improve his/her grade.

Round Table

Each team of 2 students prepares 3 minute presentation (~1-2 slides) about the round table topic. The presentation must be sent before the round table and should contain: the names of the students, the resources used (Web pages, books, papers, etc) and students' opinion about the topic.

Grades: (A) Excellent (B) Very Good (C) Good (D) Sufficient (E) Not Sufficient

Papers & Technologies assignment

Title Students Seminar day Grade
1 - Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale Giannella Raffaele3 B
2 - Petuum: A New Platform for Distributed Machine Learning on Big Data Francesco Scigliuzzo3 D/E
3 - Forecasting Fine-Grained Air Quality Based on Big Data Francesco La Perna3 A+
4 - Untangling performance from successFilippo Delle Macchie6 A/B
5 - Panther: Fast Top-k Similarity Search on Large Networks Giacinto Trafficante3 A
6 - The Effectiveness of Marketing Strategies in Social Media: Evidence from Promotional Events Maurizio Quintini3 A/B
9 - Online Topic-based Social Influence Analysis for the Wimbledon Championships Matteo Borghi4 A
10 - E-commerce in Your Inbox: Product Recommendations at Scale Nunzio Spontella 4 B
11 - Gender and Interest Targeting for Sponsored Post Advertising at Tumblr Nicolò Dossena4 A/B
12 - Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT) Tommaso Furlan5 B
13 - Discovering Collective Narratives of Theme Parks from Large Collections of Visitors’ Photo Streams Benedetta Iavarone6 A
14 - Early Identification of Violent Criminal Gang Members Baltakiene Margarita5 A
16 - Building Discriminative User Profiles for Large-scale Content Recommendation Rossi Maria Teresa6 A/B
17 - An analytical framework to nowcast well-being using mobile phone data. Pietro Gianluca Calamia6A
18 - Mobile Communication Signatures of Unemployment Ada Gentile6 A/B
19 - Dataveillance and the False-Positive Paradox Fabrizio Rizzi6 A
20 - On the Dominant Role of Returners’ Human Mobility Networks on Urban Energy Consumption Giuseppe Di Modugno6A
22 - Do Street Fairs Boost Local Businesses? A Quasi-Experimental Analysis Using Social Network Data Emiliano Fuccio6B
23 - No place to hide? The ethics and analytics of tracking mobility using mobile phone data Martina Miliani6A
T1 - Hive: https://hive.apache.org/ Antonio Loconte1A/B
T2 - Scala: http://www.scala-lang.org/ Simona Ortolani1B
T4 - HBase: http://hbase.apache.org/ Maria Francesca Montisci2B
T5 - Flume: https://flume.apache.org/ Cristian Criscolo2D
T7 - Oozie: http://oozie.apache.org/ Lapo Chirici6A+
T9 - ZooKeeper: https://zookeeper.apache.org/ Andrea Meini1B/C
T11 - Julia: http://julialang.org/ Maurizio Deidda2D
T12 - Docker: https://www.docker.com/Alessandro Romano1A

Repository for the papers: http://goo.gl/5BQ50o

Each student prepares a presentation of 10 minutes (~5 slides) for papers or 15 minutes (~8 slides) for technologies:

  • Paper presentations should contain: Data description, Problem statement, Data manipulation, The analytical process and Validation
  • Technologies presentations should contain: Technology objectives, Features provided, Limitations, Examples of usage and Documentation references

At the end of the presentation there will be 5 minutes of discussion and questions.

The students must use this link https://goo.gl/7QzR2V to express their preferences (do not change already taken papers or technologies) and they will be allocated in one of the seminar days. Deadline for expressing preferences: 13/10.

Project requirements

Each team of 3-4 students select a dataset from the proposed ones and should formulate the objectives of their project. After that there will be two more presentations about the progress of the work during the course: formulation, mid and pre-final. Those presentations are intend to be used to receive feedbacks from the other students and instructors in order to improve the final project. During the exam the final presentation of the project will be done by the team.

The final project presentation/report should include:

  • Formulation of the problem to be solved (also inspired by the proposed papers)
  • Data acquisition/pre-processing and data exploration
  • Formulation of the problem to be solved in terms of data mining problem
  • Implementation of the proposed solution in a big data platform
  • Model construction and validation
  • Discussion of result exploitation
  • Ethical and Privacy issues

To express your preferences about the dataset and the composition of the groups please use the following link: https://goo.gl/N9WiD1 *Remember to add your email address to receive the NDA to sign, only when i receive the NDA from all the student in a group i will share the dataset!*

Shared Folder

All the documents, presentations or other materials produced by the students must be uploaded in the following shared folder. Create a folder with your surname(s) and put the files inside it. https://drive.google.com/drive/folders/0B_IBzUGc9jCPUmNCY0xudUtiR28?usp=sharing

Big Data Analytics 2015/216 website

bigdataanalytics/bda/start.txt · Ultima modifica: 26/01/2017 alle 15:08 (6 mesi fa) da Roberto Trasarti