====== Information Retrieval - Academic Year 2012/2013 ====== ===== General Information ===== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Lectures Schedule:** Monday and Tuesday 14-16, both in room L1 * **Question time:** by appointment. * **Official Lecture's Log:** The schedule and content of the lectures is available with the official [[http://unimap.unipi.it/registri/dettregistriNEW.php?re=81981::::&ri=4673|registro]]. * News about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Tweeter-channel]] ===== Goals ===== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will: * describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; * present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph; * introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching. ===== Exam ===== Project assigned during the course, plus an oral discussion concerning with the project and the course. ^ Date ^ Text of the exam ^ Results ^ | 10 January 2013 | {{:magistraleinformatica:ir:ir12:ir130110.docx|text}} | . | | 12 February 2013 | {{:magistraleinformatica:ir:ir12:ir130212.docx|text}} | . | | 25 June 2013 | {{:magistraleinformatica:ir:ir12:ir130625.docx|text}} | . | | 16 July 2013 | {{:magistraleinformatica:ir:ir12:ir130716.docx|text}} | . | ===== Books, notes, ... ===== * **[MRS]** C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * **[MG]** {{:magistraleinformatica:ir:ir12:reading-mgcompress.pdf|Chapter 2}} "Text compression" of //Managing Gigabytes//, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999. ===== Content of the Lectures ===== ^ Date ^ Argument ^ Refs ^ | | | | | 24/09/2012 | Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. | Chapter 1 of [MRS] | | 25/09/2012 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. {{:magistraleinformatica:ir:ir12:lecture01.pptx|Slides}} of this day | | 01/10/2012 | **NO Lecture** | | | 02/10/2012 | Index construction: multi-way mergesort, BSBI and SPIMI. | Chapter 4 of [MRS]. {{:magistraleinformatica:ir:ir12:02-construction.ppt|Slides}}. | | 08/10/2012 | Distributed and dynamic indexing. Query processing: skip pointers, caching, phrase queries | Chapter 4, Sect. 2.3, 2.4 of [MRS]. {{:magistraleinformatica:ir:ir12:03-query.ppt|Slides}}. | | 09/10/2012 | wild-cards (permuterm, k-gram, dynamic programming). Zone indexes. | Sect. 3.2, 3.3, 6.1 of [MRS]. | | 15/10/2012 | Posting list compression, codes: gamma, delta, variable bytes, PForDelta. | Sect. 5.3 of [MRS]. Read also this {{:magistraleinformatica:ir:ir12:reading-integercodes.pdf|paper}}. | | 16/10/2012 | Notion of Entropy and prefix-free codes. Huffman and Arithmetic coding. LZ77 and gzip. Hashing with chaining, cuckoo hashing. | pag. 21-36, 52-56, 74-79 of [MG]. {{:magistraleinformatica:ir:ir12:04-compression.ppt|Slides}}. {{:magistraleinformatica:ir:ir12:reading-cuckoo.pdf|Paper}} on cuckoo, just description (no proof). | | 18/10/2012 | Bloom filter and its applications. Prefix-search: trie, front-coding and two-level indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. | Sect 6.2, 6.3 from [MRS]. Paper on {{:magistraleinformatica:ir:ir12:reading-bloomfilter.pdf|bloom filter}}. {{:magistraleinformatica:ir:ir12:05-dictsearch.ppt|Slides}}. | | 22/10/2012 | Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion.| Chap 7 and 9 from [MRS]. {{:magistraleinformatica:ir:ir12:06-ranking.ppt|Slides}}. | | 23/10/2012 | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. | Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. {{:magistraleinformatica:ir:ir12:07-theweb.ppt|Slides}}. | | 25/10/2012 | Crawling and consistent hashing. Link-based ranking: pagerank and HITS and weighted variants. Evaluation measures: precision, recall, F1. Recommendation Systems. | Chap 20.1, 20.2, 21 and 8 from [MRS]. {{:magistraleinformatica:ir:ir12:08-web-crawling-ranking.ppt|Slides}}. | | 29/10/2012 | Some notes on XML search engines and Web advertising. | chap 10 from [MRS]. {{:magistraleinformatica:ir:ir12:10-applications.ppt|Slides}}. | | 30/10/2012 | Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. Semantic-annotation tools. | chap 18 and 19.6 from [MRS]. {{:magistraleinformatica:ir:ir12:09-tools.ppt|Slides}}. | | 05/11/2012 | Software project | {{:magistraleinformatica:ir:ir12:progettino_2012c.pdf|description}} | | 08/11/2012 | Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. Co-clustering. | Chap 16 and 17 of [MRS], {{:magistraleinformatica:ir:ir12:clustering.ppt|slides}}. | | 12/11/2012 | Framework map-reduce, hadoop, with examples. | By Claudio Lucchese, {{:magistraleinformatica:ir:ir12:slides_mapreduce.pdf|slides}} and some {{:magistraleinformatica:ir:ir12:mrconnectedcomponents.zip|code}}. | | 13/11/2012 | Beyond hadoop: pig and giraffe. | By Fabrizio Silvestri, {{:magistraleinformatica:ir:ir12:piggiraph1.pptx|slides}}. | | 19/11/2012 | Project discussion | | | 20/11/2012 | Lab at home | | | 26/11/2012 | Lab at home | | | 27/11/2012 | Lab at home | | | 03/12/2012 | Lab at home | | | 04/12/2012 | Lab at home | | | 10/11/2012 | Lab at home | | | 11/11/2012 | Lab at home | | | 17/11/2012 | Lab at home | | | 18/11/2012 | Lab at home | |