Indice

Information Retrieval - Academic Year 2015/2016

General Information

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

Exam

The exam will consist of a written test, plus an oral discussion on the exercises.

Date Room Text
11/01/2016 L1 (9:00) text
01/02/2016 L1 (9:00) text
27/06/2016 L1 (9:00) text
19/07/2016 L1 (9:00) no participants
02/09/2016 L1 (9:30) text

Books

Content of the Lectures

Date Argument Refs
22/09/2015 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The structure of a search engine. Slides
Chapt 1 of [MRS]
24/09/2015 Web search engine: difficulties in their design and their ephocs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. Slides,
Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS].
29/09/2015 Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Consistent Hashing. Slides.
Sect 20.3 and 20.4 of [MRS]. For doubts on Bloom Filter see paper.
01/10/2015 Compressed storage of the Web graph. Compressed storage of documents: LZ-based compression. Slides,
Sect 19.1 and 19.2 of [MRS], and Sect 1.1 and 2.2 of Ferragina's notes.
06/10/2015 MTF (with proof of its 2-optimality), RLE, and the Burrows-Wheeler Transform with bzip. Slides of the previous lecture,
Sect 2.1 and 2.3 of Ferragina's notes above.
08/10/2015 Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. Slides.
13/10/2015 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Slides.
Sect. 2.1, 2.2 and 5.1 of [MRS].
15/10/2015 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Slides.
Chapter 4 of [MRS].
20/10/2015 Dynamic indexing. Posting list compression, codes: gamma, delta, variable bytes. PForDelta. Slides.
Sez 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
22/10/2015 Exact search: hashing with chaining, univeral hashing, cuckoo hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted) Slides.
27/10/2015 Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Slides.
Chap 3 of [MRS].
29/10/2015 Query processing: skip pointers (with solution based on dynamic programming), caching, phrase queries. Zone index and tiered index. Slide.
Sect. 2.3 and 2.4 of [MRS]
10/11/2015 The auto-complete problem and its solutions for the top-1, top-2, .., top-k strings. Rank/Select data structures: B untouched and Elias-Fano's approach with B compressed. Slide
12/11/2015 Exercises.
13/11/2015 Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Other exercises. Sect 6.2 and 6.3 from [MRS].
17/11/2015 Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Slides.
Chap 7 and 9 of [MRS]
19 and 20
11/2015
Lab on Lucene.
You need to configure your laptop as follows: Linux system (may be a virtual machine) with debian-like OS (e.g. Ubuntu 15.10), working Internet connection from the Polo's room, at least 5GB of free disk and 2GB RAM, httrack and pylucene installed (that can be done with sudo apt-get update and sudo apt-get install python-lucene httrack).
In collaboration with Marco Cornolti (cornolti@di.unipi.it).
Slides (crawling) Slides (Lucene)
24/11/2015 Performance measures: precision, recall, F1 and user happiness. Random Walks. Link-based ranking: pagerank and personalized pagerank. Slides.
Chap 8 and 21 from [MRS].
26/11/2015 CoSim Rank and HITS. Recommendation systems and Web advertising. Slides only.
27/11/2015 Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. Slides.
Chap 18 from [MRS].
01/12/2015 Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. How to evaluate those systems. Various approaches to text representation. Slides.
03/12/2015 More on topics annotators and their applications. Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. Slides.
Chap 16 and 17 of [MRS].
10/12/2015 Locality-sensitive hashing: basics, hamming distance, Jaccard similarity, sketch of the main theorem. Slides.
Sect 19.6 of [MRS]
11/12/2015 Exercise
15/12/2015 Exercise