Information Retrieval - Academic Year 2009/2010

General Information

Teacher : Paolo Ferragina
Course ID: 346AA
Semester: first
CFU: 6 (theory + lab)
Language: English
Schedule: Monday 14-16 (room C1) and Thursday 9-11 (room I1)
News about this course will be distributed via a Tweeter-channel with hashtag #ir2010

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. We will describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; present and use in the Lab some effective Open-Source Tools (e.g. Lucene and Octave), and introduce some basic algorithmic techniques which are now ubiquitous in any IR application: data streaming, data sketching, data projection, etc..

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course material.

Books, notes, ...

C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
Chapter 2 of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.
Possibly I'll distribute slides and other papers covering topics which are not dealt with in these two books

Lectures

# Hours	Argument	Refs	Speaker
2	Introduction: large data collections and new algorithmic challanges? I/O-efficient sorting as a universal tool to manage large datasets.	notes
2	The Web Graph: Properties, storage (compression).	BV's paper, 19.1-19.2 in [MRS]
2	The Web Graph: visits via (semi-)external algorithms.
2 (+4 home)	Lab: Web Graph Library	web site	Ugo Scaiella
2	Search Engines: structure, crawling.	chap 20 in [MRS]
2	The Bloom Filter and its Spectral variant.	paper
2	Search Engines: Inverted Lists = Dictionary + Postings (statistical models: Zipf, Luhn). Parsing. Construction.	chaps 2.1-2.2, 4, 5.1 in [MRS]
2	Search Engines: Dictionary indexing (Hashing vs trie). Front-coding and LPFC	chaps 5.1-5.2 in [MRS]
3	Search Engines: IL compressed storage, LZ-compression, ZSync et al	chaps 5.3 in [MRS] and notes
3	Search Engines: Document storage via Huffman, its canonical variant and Huffword. A simple word-encoder	pp. 21-41 in [WMB]
2	Search Engines: Text-based ranking. Recommendation Systems (sketch). Quality of the results: Precision, Recall, F-measure.	chap 6 and 8 in [MRS]
2 (+4 home)	Lab: Lucene in action	web site	Ugo Scaiella
4	Search Engines: Link-based ranking; PageRank, HITS, Salsa.	chap 21 in [MRS]	Gianna del Corso
2	Reputation Systems		Gianna del Corso
2	Latent Semantic Indexing and Random Projections.	chap 18 in [MRS]
4	Clustering: sketch of k-means, MST-based, max-cut, spectral.	chap 16 and 17 in [MRS]
2	Suffix Tree and Array for Text Mining	slides
1	Web SPAM and Web Advertising	slides	19.3 in [MRS]
2	Doc duplication and set similarity: shingles, sketches, min-wise permutations	chap 19.6 in [MRS]
8	Automated text categorization	paper	Fabrizio Sebastiani
00	Query-Log mining
00	Advanced data structures: Locality Preserving Hashing