Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir09:start

Information Retrieval - Academic Year 2009/2010

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 346AA
  • Semester: first
  • CFU: 6 (theory + lab)
  • Language: English
  • Schedule: Monday 14-16 (room C1) and Thursday 9-11 (room I1)
  • News about this course will be distributed via a Tweeter-channel with hashtag #ir2010

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. We will describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; present and use in the Lab some effective Open-Source Tools (e.g. Lucene and Octave), and introduce some basic algorithmic techniques which are now ubiquitous in any IR application: data streaming, data sketching, data projection, etc..

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course material.

Books, notes, ...

  • C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Chapter 2 of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.
  • Possibly I'll distribute slides and other papers covering topics which are not dealt with in these two books

Lectures

# Hours Argument Refs Speaker
2 Introduction: large data collections and new algorithmic challanges?
I/O-efficient sorting as a universal tool to manage large datasets.
notes
2 The Web Graph: Properties, storage (compression). BV's paper, 19.1-19.2 in [MRS]
2 The Web Graph: visits via (semi-)external algorithms.
2
(+4 home)
Lab: Web Graph Library web site Ugo Scaiella
2 Search Engines: structure, crawling. chap 20 in [MRS]
2 The Bloom Filter and its Spectral variant. paper
2 Search Engines: Inverted Lists = Dictionary + Postings
(statistical models: Zipf, Luhn). Parsing. Construction.
chaps 2.1-2.2, 4, 5.1 in [MRS]
2 Search Engines: Dictionary indexing (Hashing vs trie). Front-coding and LPFC chaps 5.1-5.2 in [MRS]
3 Search Engines: IL compressed storage, LZ-compression, ZSync et al chaps 5.3 in [MRS] and notes
3 Search Engines: Document storage via Huffman, its canonical variant and Huffword.
A simple word-encoder
pp. 21-41 in [WMB]
2 Search Engines: Text-based ranking. Recommendation Systems (sketch).
Quality of the results: Precision, Recall, F-measure.
chap 6 and 8 in [MRS]
2
(+4 home)
Lab: Lucene in action web site Ugo Scaiella
4 Search Engines: Link-based ranking; PageRank, HITS, Salsa. chap 21 in [MRS] Gianna del Corso
2 Reputation Systems Gianna del Corso
2 Latent Semantic Indexing and Random Projections. chap 18 in [MRS]
4 Clustering: sketch of k-means, MST-based, max-cut, spectral. chap 16 and 17 in [MRS]
2 Suffix Tree and Array for Text Mining slides
1 Web SPAM and Web Advertising slides 19.3 in [MRS]
2 Doc duplication and set similarity: shingles, sketches, min-wise permutations chap 19.6 in [MRS]
8 Automated text categorization paper Fabrizio Sebastiani
00 Query-Log mining
00 Advanced data structures: Locality Preserving Hashing
magistraleinformatica/ir/ir09/start.txt · Ultima modifica: 31/08/2010 alle 17:05 (14 anni fa) da peppe

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki