Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir17:start

Information Retrieval - Academic Year 2017/2018

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Lectures Schedule: Monday 11-13 (A1), Tuesday 09-11 (A1).
  • Question time: after lectures or appointment.
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.

Exam

The exam will consist of a written test plus an oral discussion on the exercises.

Date Room Text
30/10/2017, hr 11-13 C1 Midterm exam: text, results, solution
18/12/2017, hr 11-13 C1 Finalterm exam: text, results (registration in any following date of the exams), solution
15/01/2018, hr 09-11 C1 Exam: text, solution
15/02/2018, hr 09-11 C1 Exam: text
15/06/2018, hr 09-11 N1 Exam: text
16/07/2018, hr 09-11 N1 Exam: text
04/09/2018, hr 09-13 E Exam: text

Books

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers pointed out below
  • Some notes by Prof. Paolo Ferragina.

Content of the Lectures

Date Argument Refs
18/09/2017 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
19/09/2017 Web search engine: its structure, difficulties in their design and their ephocs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. Slides.
Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS].
25/09/2017 Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Consistent Hashing. Web graph properties and its compressed storage. Slides.
Sect 19.1 and 19.2, 20.3 and 20.4 of [MRS]. For doubts on Bloom Filter see paper.
26/09/2017 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Slides
02/10/2017 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors, cosine-similarity among vectors of real-features. slides.
Sect 19.6 of [MRS]
03/10/2017 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. slides.
Chapter 4 of [MRS].
09/10/2017 Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. Slides.
Reading a paper.
10/10/2017 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Keyword extraction: statistical, chi-square test, Rake tool. Slides.
Sect. 2.1, 2.2 and 5.1 of [MRS].
13/10/2017 Exercises.
16/10/2017 Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. Slides.
Chap 3 of [MRS].
17/10/2017 Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Query processing: soft-AND, skip pointers, caching, phrase queries. Zone index and tiered index. Slide.
Sect. 2.3 and 2.4 of [MRS]
24/10/2017 Exercises.
06/11/2017 Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta and Elias-Fano. Slides.
Sez 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
07/11/2017 Rank and Select data structures, two approaches: the case of B untouched and extra o(B) bits, and the case of Elias-Fano's approach with B compressed. Slides.
13/11/2017 Succinct representation of binary trees and its navigational operations: heap like notation and LOUDS. Slides
14/11/2017 Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Slides.
Sect 6.2 and 6.3 and 7 from [MRS].
20/11/2017 Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. Chap 8 and 9 of [MRS]
21/11/2017 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank, CoSim rank. HITS. Slides. Chap 21 of [MRS]
27/11/2017 Description of the (optional) software projects (Paolo Cintia e Luca Pappalardo). slide
28/11/2017 Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. Slides.
Chap 18 from [MRS].
01/12/2017 Exercises text and solutions
04/12/2017 Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. How to evaluate those systems. Slides.
05/12/2017 Various approaches to text representation and their applications.
11/12/2017 No lecture
12/12/2017 No lecture
15/12/2017 Extra lecture: 16-18, room A1
magistraleinformatica/ir/ir17/start.txt · Ultima modifica: 04/09/2018 alle 07:24 (6 anni fa) da Paolo Ferragina