This document provides instructions for IR Homework #1, which involves building an inverted index for a text collection. The input will be the ClueWeb09 dataset containing over 1 billion web pages. The output should be inverted index files with a dictionary file listing vocabularies and postings lists showing term occurrences in documents. Optional functionality may include efficiency techniques, tokenization settings, and support for multiple input formats. The program and documentation are due in two weeks and will be evaluated based on correctness and any optional features. Students will submit their work electronically and may be asked to demo if the submission does not run properly.
2. Programming Exercise #1:
Indexing
• Goal: to build an index for a text collection
using inverted index
• Input: a set of text documents
– (to be described later)
• Output: inverted index files
– (exact format to be described later)
3. Input: the Test Collection
• ClueWeb09 dataset
– http://lemurproject.org/clueweb09.php/
– 1,040,809,705 Web pages, in 10 languages
– 5TB, compressed (25TB, uncompressed)
– File format: WARC (Web ARChive file
format)
• http://www.digitalpreservation.gov/formats/fdd
/fdd000236.shtml
• Each file contains about 40,000 Web pages, in 1GB
• Each team will be randomly allocated different
files!
4. Other Test Collections
• Reuters-RCV1: (in the textbook)
http://trec.nist.gov/data/reuters/reuters.html
– About 810,000 English news stories from 1996/08/20 to
1997/08/19 (2.5GB uncompressed)
– Needs to sign agreements
• Reuters-21578:
http://www.daviddlewis.com/resources/testcollection
s/reuters21578/
– 21,578 news articles in 1987 (28.0MB uncompressed)
• Test collections held at University of Glasgow:
http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll
ections/
– LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI
– Ex: The Time Collection: 423 documents (1.5MB)
5. Output: Inverted Index
• Using the standard positional index (Chap. 1 &
2)
• Output format:
– Dictionary file: a sorted list of vocabularies (in
separate lines)
– Postings list: for each term, a list of occurrences in the
original text
• termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2,
…>; …> (as in Fig. 2.11, Sec. 2.4, p.38)
– dfi: document frequency of termi
– tfij: term frequency of termi in docj
• to, 993427:
<1, 6: <7, 18, 33, 72, 86, 231>;
2, 5: <1, 17, 74, 222, 255>; … >
• …
6. Implementation Issues
• Note: pos means the token positions in the
body of documents
– This can facilitate easier implementation in
later steps after indexing, for example,
proximity search
• Document preprocessing should be
handled with care
– Different formats for different collections
– Digits, hyphens, punctuation marks, …
7. Optional Functionality
• Efficiency issues
– A separate data structure (e.g. trie) can be used to
store the vocabularies and postings in your indexer,
but the output should be in the designated format
– Skip pointers
• Tokenization
– Case folding
– Stopword removal
– Stemming
– Able to be turned on/off by a parameter trigger
8. Submission
• Your submission *should* include
– The source code (and optionally your executable file)
– A one-page description that includes the following
• Major features in your work (ex: high efficiency, low storage,
multiple input formats, huge corpus, …)
• Major difficulties encountered
• Special requirements for execution environments (ex: Java
Runtime Environment, special compilers, …)
• Team members list: The names and the responsible parts of
each individual member should be clearly identified
• Due: two weeks (Mar. 29, 2012)
9. Submission Instructions
• Programs or homework in electronic files must
be submitted directly on the submission site:
– Submission site: http://140.124.183.39/IR/
• Username: your student ID
• Password: (Please change your default password at your
first login)
– Preparing your submission file: as one single
compressed file
• Remember to specify the names of your team members and
student ID in the files and documentation
– If you cannot successfully submit your work, please
contact with the TA (@ R1424, Technology Building)
10. Evaluation
• Minimum requirement: correctness
– Using the ClueWeb09 Test Collection (partial)
as the input, and the inverted index generated
by your program will be checked
– Optional features will be considered as bonus
• You might be required to demo if the
program submitted was unable to
compile/run by TA