Indice

Information Retrieval - Academic Year 2022/2023

General Information


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Room C (Polo Fibonacci)
Tuesday 9:00 - 11:00 Room C (Polo Fibonacci)

Exams

The exam will consist of a written test including two parts: exercises and “oral” questions. The exam is passed if in both parts the student gets a sufficient score (expressed as 20+10), which are then summed.

The first (exercises) and the second (theory questions) parts of the exam can be split into different exam dates, even of different exam sessions. The exam dates are the ones indicated in the calendar on ESAMI. In the case that the second part is not passed or the student abandons the exam, (s)he can keep the rank of the first exam, but this may occur just once. The second time this happens, the rank of the first part is dropped, and the student has to do both parts again.

Date Room Text Notes
17/01/23, start at 09:00 room E text, solution, results Correction will occur tomorrow Wednesday 18 January, at 16:00 (Ferragina's office). For registration only, just send me an email to accept the rank.
Students that have passed only the “exercises” part can repeat only the “theory” part on any of the following exam dates, they have to register on the portal “ESAMI” writing in the notes “only theory”. Moreover, they can come +45mins after the start of the exam, to join the class that did in the first hour the “exercises” part.
08/02/23, start at 11:00 room A1 text, results, solution Correction will occur Monday 13 January, at 11:00 (Ferragina's office). For registration only, just send me an email to accept the rank.
Students that have passed only the “exercises” part can repeat only the “theory” part on any of the following exam dates, they have to register on the portal “ESAMI” writing in the notes “only theory”. Moreover, they can come +45mins after the start of the exam, to join the class that did in the first hour the “exercises” part.
05/06/2023, start at 16:00 room C text
05/07/2023, start at 11:00 room A1 text
24/07/2023, start at 14:00 room C text
07/09/2023, start at 14:00 room A1 text

Materials for study


Lectures

Video-lectures of last year are available at the link and they are linked just for reference, if you wish to re-check something you listened in class. This year, lectures are in presence and the program of the course could be different.

Date Argument Refs
19.09.2022 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
20.09.2022 Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Slides.
Sections 19.1, 19.2, 19.4 of [MRS].
27.09.2022 Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. Spectral Bloom Filter. Slide.
Sections 20.1, 20.2 of [MRS].
For doubts on Bloom Filters see paper.
03.10.2022 Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. Sect 19.1 and 19.2 of [MRS] and this page and note for consistent hashing.
Sect 20.3 and 20.4 of [MRS].
04.10.2022 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing. Slides 1, Slides 2.
Sect 19.6 of [MRS]
11.10.2022 Min-hashing (with prob property). LSH on integer vectors. Cosine-similarity among vectors of real-features. Exercises on Consistent hashing, Web graph, hamming distance, LSH and shingling.
17.10.2022 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes. Slides.
Chapter 4 of [MRS].
18.10.2022 More on Dynamic indexing: a cascade of indexes. Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Keyword extraction: statistical, chi-square test. Slides.
Sect. 2.1, 2.2, 5.1 and 5.2 of [MRS].
24.10.2022 Rake tool. Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta). Slide.
Suggest reading a paper.
25.10.2022 File Synchronization (rsync, zsync). LSH for cosine similarity estimation. Exercises on file sync and zdelta.
31.10.2022 Posting list compression, codes: gamma, variable bytes (t-nibble), Simple9, Group variable, PForDelta, Elias-Fano indexing. Sect. 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
Slides.
Video
07.11.2022 Query processing: soft-AND. Phrase queries, biword index and positional index. Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance with e-errors via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. An index for e-error matches based on k-gram index (with false positives, no false negatives). Sect. 2.3 and 2.4 of [MRS].
Slides
08.11.2022 Caching and Tiered index. An efficient filter for one-error match (with false positives, no false negatives). Wild-card queries (permuterm, k-gram). Phonetic match. Scoring and ranking spelling errors.
14.11.2022 Text-based ranking: dice, jaccard, tf-idf. Vector space model and cosine similarity doc-doc and query-doc. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query terms, clustering. Sect 6.2 and 6.3 and 7 from [MRS], video, and slides
15.11.2022 Fast top-k retrieval: fancy hits. Exact Top-K: WAND and blocked-WAND.
21.11.2022 Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1, DCG and NDCG. Sect 8.1-8.3 and 9 [MRS].
22.11.2022 Random Walks. Link-based ranking: PageRank. Chap 21 of [MRS]. Slides.
24.11.2022 Extra lecture (16-18, room L1): Lab on ElasticSearch, please bring your own laptop and make sure you have a working installation of a recent version of Python and Anaconda. Use them to create a clean environment specific for this course. The only required package is ElasticSearch: install it via pip install elasticsearch. All material of the lecture is in this github. Slides intro
Slides on ElasticSearch
28.11.2022 Extra lecture: Lab on ElasticSearch and TagMe, please bring your own laptop. Slides on TagMe
29.11.2022 Topic-based pagerank, personalized pagerank. HITS. Application to Text Summarization.
01.12.2022 Extra lecture (room L1, 16 - 18): Entity Linkers and Knowledge Graphs. Projections to smaller spaces: Latent Semantic Indexing (LSI). Chap 18 from [MRS].
Slides tagme and Slides LSI.
05.12.2022 Exercises
06.12.2022 Exercises
13.12.2022 Exercises.