====== Information Retrieval - Academic Year 2013/2014 ====== ===== General Information ===== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Lectures Schedule:** Tuesday and Thursday 14-16, both in room C * **Question time:** by appointment. * **Official Lecture's Log:** The schedule and content of the lectures is available with the official [[http://unimap.unipi.it/registri/dettregistriNEW.php?re=108441::::&ri=9142|registro]]. * News about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Tweeter-channel]] ===== Goals ===== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will: * describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; * present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph; * introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching. ===== Exam ===== The exam will consist of a written test, plus an oral discussion on the exercises. We have also proposed a project on [[http://acube.di.unipi.it/tagme/|TAGME]] whose details are given below. The project is optional, its details are described in [[IR Project 2013|this page]]. ^ Date ^ Room ^ Text ^ | 08-01-14, 9:00 | C | {{:magistraleinformatica:ir:ir13:ir140108.docx|written exam}}. | | 29-01-14, 9:00 | C | {{:magistraleinformatica:ir:ir13:ir140129.docx|written exam}} | | 10-02-14, 9:30 | Seminari Ovest, Dip. Informatica | Pitch of the projects, whose deadline is 09-02 at 12:00 | | 09-06-14, 9:00 | L1 | {{:magistraleinformatica:ir:ir13:ir140609.docx|written exam}} | | 30-06-14, 9:00 | F | {{:magistraleinformatica:ir:ir13:ir140630.docx|written exam}} | | 29-07-14, 9:00 | L1 | {{:magistraleinformatica:ir:ir13:ir140729.docx|written exam}} | ===== Books ===== * **[MRS]** C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * **[MG]** {{:magistraleinformatica:ir:ir12:reading-mgcompress.pdf|Chapter 2}} "Text compression" of //Managing Gigabytes//, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, 1999. ===== Content of the Lectures (to be updated) ===== ^ Date ^ Argument ^ Refs ^ | | | | | 24/09/2013 | Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. | Chapter 1 of [MRS] | | 26/09/2013 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Sect. 2.1, 2.2 and 5.1 of [MRS]. {{:magistraleinformatica:ir:ir13:01-intro.ppt|Slides of this week}}. | | 01/10/2013 | Index construction: multi-way mergesort, BSBI and SPIMI. Distributed and dynamic indexing. | Chapter 4 of [MRS]. {{:magistraleinformatica:ir:ir13:02-construction.ppt|Slides.}} | | 03/10/2013 | Query processing: skip pointers (with solution based on dynamic programming), caching, phrase queries, wild-card queries (permuterm, k-gram). | Chapter 4 and Sect. 3.2 of [MRS]. {{:magistraleinformatica:ir:ir13:03-query.ppt|Slides}} | | 08/10/2013 | wild-card queries (dynamic programming), zone indexes. Notion of Entropy and prefix-free codes. | Sect. 3.3 and 6.1 of [MRS]. | | 10/10/2013 | Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Rank and Select primitives: definition and their use. Elias-Fano code and its use for postings compression. | Read this {{:magistraleinformatica:ir:ir12:reading-integercodes.pdf|paper}} for the codes seen in class and Sez 5.3 of [MRS]. {{:magistraleinformatica:ir:ir13:04-compression_integers.ppt|Slides}} | | 15/10/2013 | Huffman and Arithmetic coding, with comparisons with Entropy and comment on their performance. | Read pag. 21-36, 52-56 of [MG]| | 17/10/2013 | LZ77 and gzip. Hashing with chaining, universal hashing, cuckoo hashing, Bloom Filter. | Read pag. 74-79 of [MG]. {{:magistraleinformatica:ir:ir12:reading-cuckoo.pdf|Paper}} on cuckoo, just description (no proof). Paper on {{:magistraleinformatica:ir:ir12:reading-bloomfilter.pdf|bloom filter}} (just things looked in class). | | 22/10/2013 | Prefix-search: trie, front-coding and two-level indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. | Sect 6.2, 6.3 from [MRS]. {{:magistraleinformatica:ir:ir13:05-dictsearch.ppt|Slides}} | | 24/10/2010 | Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion.| Chap 7 and 9 from [MRS]. {{:magistraleinformatica:ir:ir13:06-ranking.ppt|Slides}} | | 29/10/2013 | Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE. The storage of the web-graph. | Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. {{:magistraleinformatica:ir:ir13:07-theweb.ppt|Slides}} | | 31/10/2013 | Crawling and consistent hashing. | Chap 20.1, 20.2. {{:magistraleinformatica:ir:ir13:08a-web-crawling.ppt|Slides}} | | 12/11/2013 | Link-based ranking: pagerank and HITS and weighted variants. Latent Semantic Indexing. | Chap 21 and 18 from [MRS]. {{:magistraleinformatica:ir:ir13:08b-web-ranking.ppt|Slides ranking}}, {{:magistraleinformatica:ir:ir13:09-lsi.ppt|slide LSI}} | | 14/11/2013 | Random projections. Recommendation systems and Web advertising. | {{:magistraleinformatica:ir:ir13:11-applications.ppt|Slides}} | | 19/11/2013 | Semantic-annotation tools: basics and TAGME. | Slides about the optional {{:magistraleinformatica:ir:ir13:ir_project_2013_1_.pdf|project}}. | | 21/11/2013 | Semantic-annotation tools: advanced and some applications. | {{:magistraleinformatica:ir:ir13:12-topic_annotators.pptx|Slides}} | | 26/11/2013 | Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. | Chap 16 and 17 of [MRS], {{:magistraleinformatica:ir:ir13:11-clustering.ppt|Slides}} | | 28/11/2013 | Learning to Rank: definition, BM25 and some of its variants, NDCG@K, RankNet, Genetic Algorithm, SVM. | Joint lecture with Claudio Lucchese. {{:magistraleinformatica:ir:ir13:unipi_-_learning_to_rank_i.pdf|Slides.}}| | 03/12/2013 | Advanced Algorithms for learning-to-rank: Gradient Boosted Regression Trees, Lambda-Mart. Using the implicit user feedback to evaluate a WSE and create a training set. | Joint lecture with Claudio Lucchese. {{:magistraleinformatica:ir:ir13:unipi_-_learning_to_rank_ii.pdf|Slides}}. Read this {{:magistraleinformatica:ir:ir13:2007_tois_feeback_from_clicks.pdf|paper}} and the parts in Chap 1-4 of this {{:magistraleinformatica:ir:ir13:1_-_learning_to_rank.pdf|survey}} dealt with in class. | | 05/12/2013 | Auto-completion search: definition, trie, top-k, cartesian tree, RMQ. | Joint lecture with Rossano Venturini. You can deepen the study at [[http://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformaticanetworking/ae/ae2012/chap8.pdf|Sect 8.4.1]] and Sect 2 and 3 of this [[http://www.cs.sunysb.edu/~bender/newpub/BenderFa00-lca.pdf|paper]].| | 10/12/2013 | Rank/select over binary arrays, succinct tree encodings (LOUDS, BP, DFUDS) and navigational operations over them, succinct Patricia trie. | Joint lecture with Rossano Venturini. {{:magistraleinformatica:ir:ir13:autocompletion2.pdf|Slides}} are enough.| | 12/12/2013 | Q&A and exercises | | | 17/12/2013 | Q&A and exercises | |