SlideShare a Scribd company logo
1 of 11
IR Homework #1

  By J. H. Wang
  Mar. 15, 2012
Programming Exercise #1:
             Indexing
• Goal: to build an index for a text collection
  using inverted index
• Input: a set of text documents
  – (to be described later)
• Output: inverted index files
  – (exact format to be described later)
Input: the Test Collection
• ClueWeb09 dataset
  –   http://lemurproject.org/clueweb09.php/
  –   1,040,809,705 Web pages, in 10 languages
  –   5TB, compressed (25TB, uncompressed)
  –   File format: WARC (Web ARChive file
      format)
       • http://www.digitalpreservation.gov/formats/fdd
         /fdd000236.shtml
       • Each file contains about 40,000 Web pages, in 1GB
       • Each team will be randomly allocated different
         files!
Other Test Collections
• Reuters-RCV1: (in the textbook)
  http://trec.nist.gov/data/reuters/reuters.html
   – About 810,000 English news stories from 1996/08/20 to
     1997/08/19 (2.5GB uncompressed)
   – Needs to sign agreements
• Reuters-21578:
  http://www.daviddlewis.com/resources/testcollection
  s/reuters21578/
   – 21,578 news articles in 1987 (28.0MB uncompressed)
• Test collections held at University of Glasgow:
  http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll
  ections/
   – LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI
   – Ex: The Time Collection: 423 documents (1.5MB)
Output: Inverted Index
• Using the standard positional index (Chap. 1 &
  2)
• Output format:
  – Dictionary file: a sorted list of vocabularies (in
    separate lines)
  – Postings list: for each term, a list of occurrences in the
    original text
     • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2,
       …>; …> (as in Fig. 2.11, Sec. 2.4, p.38)
         – dfi: document frequency of termi
         – tfij: term frequency of termi in docj
     • to, 993427:
       <1, 6: <7, 18, 33, 72, 86, 231>;
         2, 5: <1, 17, 74, 222, 255>; … >
     • …
Implementation Issues
• Note: pos means the token positions in the
  body of documents
  – This can facilitate easier implementation in
    later steps after indexing, for example,
    proximity search
• Document preprocessing should be
  handled with care
  – Different formats for different collections
  – Digits, hyphens, punctuation marks, …
Optional Functionality
• Efficiency issues
  – A separate data structure (e.g. trie) can be used to
    store the vocabularies and postings in your indexer,
    but the output should be in the designated format
  – Skip pointers
• Tokenization
  –   Case folding
  –   Stopword removal
  –   Stemming
  –   Able to be turned on/off by a parameter trigger
Submission
• Your submission *should* include
  – The source code (and optionally your executable file)
  – A one-page description that includes the following
     • Major features in your work (ex: high efficiency, low storage,
       multiple input formats, huge corpus, …)
     • Major difficulties encountered
     • Special requirements for execution environments (ex: Java
       Runtime Environment, special compilers, …)
     • Team members list: The names and the responsible parts of
       each individual member should be clearly identified
• Due: two weeks (Mar. 29, 2012)
Submission Instructions
• Programs or homework in electronic files must
  be submitted directly on the submission site:
  – Submission site: http://140.124.183.39/IR/
     • Username: your student ID
     • Password: (Please change your default password at your
       first login)
  – Preparing your submission file: as one single
    compressed file
     • Remember to specify the names of your team members and
       student ID in the files and documentation
  – If you cannot successfully submit your work, please
    contact with the TA (@ R1424, Technology Building)
Evaluation
• Minimum requirement: correctness
  – Using the ClueWeb09 Test Collection (partial)
    as the input, and the inverted index generated
    by your program will be checked
  – Optional features will be considered as bonus
• You might be required to demo if the
  program submitted was unable to
  compile/run by TA
Any Questions or Comments?

More Related Content

What's hot (20)

Types of files
Types of filesTypes of files
Types of files
 
Application layer
Application layerApplication layer
Application layer
 
Disk allocation methods
Disk allocation methodsDisk allocation methods
Disk allocation methods
 
File system implementation
File system implementationFile system implementation
File system implementation
 
Contigious
ContigiousContigious
Contigious
 
Os6
Os6Os6
Os6
 
Free Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFSFree Space Management, Efficiency & Performance, Recovery and NFS
Free Space Management, Efficiency & Performance, Recovery and NFS
 
O svv92014
O svv92014O svv92014
O svv92014
 
Examining Linux File Structures
Examining Linux File StructuresExamining Linux File Structures
Examining Linux File Structures
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
11.file system implementation
11.file system implementation11.file system implementation
11.file system implementation
 
Paxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus AlgorithmPaxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus Algorithm
 
OSCh11
OSCh11OSCh11
OSCh11
 
File paths and programming
File paths and programmingFile paths and programming
File paths and programming
 
File
FileFile
File
 
Bin carver
Bin carverBin carver
Bin carver
 
ITFT_File system interface in Operating System
ITFT_File system interface in Operating SystemITFT_File system interface in Operating System
ITFT_File system interface in Operating System
 
OSCh12
OSCh12OSCh12
OSCh12
 
Examining Mac File Structures
Examining Mac File StructuresExamining Mac File Structures
Examining Mac File Structures
 
Ch11 file system interface
Ch11 file system interfaceCh11 file system interface
Ch11 file system interface
 

Viewers also liked

Viewers also liked (18)

00 intro
00 intro00 intro
00 intro
 
医療系小論文入門より
医療系小論文入門より医療系小論文入門より
医療系小論文入門より
 
Asignacion 1
Asignacion 1Asignacion 1
Asignacion 1
 
Awt
AwtAwt
Awt
 
Ted Bundy
Ted BundyTed Bundy
Ted Bundy
 
Pp presentation
Pp presentationPp presentation
Pp presentation
 
入試問題は最初の対話
入試問題は最初の対話入試問題は最初の対話
入試問題は最初の対話
 
So you wanna build something now what
So you wanna build something now whatSo you wanna build something now what
So you wanna build something now what
 
光合成色素の抽出
光合成色素の抽出光合成色素の抽出
光合成色素の抽出
 
Doa keliling dalam perspektif peperangan rohani (2)
Doa keliling dalam perspektif peperangan rohani (2)Doa keliling dalam perspektif peperangan rohani (2)
Doa keliling dalam perspektif peperangan rohani (2)
 
busquedas avanzadas
busquedas avanzadasbusquedas avanzadas
busquedas avanzadas
 
Semua demi injil kristus
Semua demi injil kristusSemua demi injil kristus
Semua demi injil kristus
 
Diberkati untuk menjadi berkat1
Diberkati untuk menjadi berkat1Diberkati untuk menjadi berkat1
Diberkati untuk menjadi berkat1
 
Karena kasih nya
Karena kasih nyaKarena kasih nya
Karena kasih nya
 
保護者向け進路プレゼン
保護者向け進路プレゼン保護者向け進路プレゼン
保護者向け進路プレゼン
 
Ted Bundy
Ted BundyTed Bundy
Ted Bundy
 
Pekerjaan yang diberkati tuhan
Pekerjaan yang diberkati tuhanPekerjaan yang diberkati tuhan
Pekerjaan yang diberkati tuhan
 
Bahan Khotbah Kristen
Bahan Khotbah KristenBahan Khotbah Kristen
Bahan Khotbah Kristen
 

Similar to IR Homework #1 Indexing and Inverted Index

Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptGaurav Nigam
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptGaurav Nigam
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptGaurav Nigam
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptGaurav Nigam
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...InfluxData
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptCbhaSlide
 
Authoring Tool of AAT with DADT
Authoring Tool of AAT with DADTAuthoring Tool of AAT with DADT
Authoring Tool of AAT with DADTAAT Taiwan
 

Similar to IR Homework #1 Indexing and Inverted Index (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
Jittu
Jittu Jittu
Jittu
 
ppt2
ppt2ppt2
ppt2
 
Jittu
Jittu Jittu
Jittu
 
Jittu
Jittu Jittu
Jittu
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
Jittu
Jittu Jittu
Jittu
 
Jittu
Jittu Jittu
Jittu
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
documentation-testing.ppt
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.ppt
 
Authoring Tool of AAT with DADT
Authoring Tool of AAT with DADTAuthoring Tool of AAT with DADT
Authoring Tool of AAT with DADT
 

IR Homework #1 Indexing and Inverted Index

  • 1. IR Homework #1 By J. H. Wang Mar. 15, 2012
  • 2. Programming Exercise #1: Indexing • Goal: to build an index for a text collection using inverted index • Input: a set of text documents – (to be described later) • Output: inverted index files – (exact format to be described later)
  • 3. Input: the Test Collection • ClueWeb09 dataset – http://lemurproject.org/clueweb09.php/ – 1,040,809,705 Web pages, in 10 languages – 5TB, compressed (25TB, uncompressed) – File format: WARC (Web ARChive file format) • http://www.digitalpreservation.gov/formats/fdd /fdd000236.shtml • Each file contains about 40,000 Web pages, in 1GB • Each team will be randomly allocated different files!
  • 4. Other Test Collections • Reuters-RCV1: (in the textbook) http://trec.nist.gov/data/reuters/reuters.html – About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) – Needs to sign agreements • Reuters-21578: http://www.daviddlewis.com/resources/testcollection s/reuters21578/ – 21,578 news articles in 1987 (28.0MB uncompressed) • Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ – LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI – Ex: The Time Collection: 423 documents (1.5MB)
  • 5. Output: Inverted Index • Using the standard positional index (Chap. 1 & 2) • Output format: – Dictionary file: a sorted list of vocabularies (in separate lines) – Postings list: for each term, a list of occurrences in the original text • termi, dfi: <doc1, tfi1: <pos1, pos2, … >; doc2, tfi2: <pos1, pos2, …>; …> (as in Fig. 2.11, Sec. 2.4, p.38) – dfi: document frequency of termi – tfij: term frequency of termi in docj • to, 993427: <1, 6: <7, 18, 33, 72, 86, 231>; 2, 5: <1, 17, 74, 222, 255>; … > • …
  • 6. Implementation Issues • Note: pos means the token positions in the body of documents – This can facilitate easier implementation in later steps after indexing, for example, proximity search • Document preprocessing should be handled with care – Different formats for different collections – Digits, hyphens, punctuation marks, …
  • 7. Optional Functionality • Efficiency issues – A separate data structure (e.g. trie) can be used to store the vocabularies and postings in your indexer, but the output should be in the designated format – Skip pointers • Tokenization – Case folding – Stopword removal – Stemming – Able to be turned on/off by a parameter trigger
  • 8. Submission • Your submission *should* include – The source code (and optionally your executable file) – A one-page description that includes the following • Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • Team members list: The names and the responsible parts of each individual member should be clearly identified • Due: two weeks (Mar. 29, 2012)
  • 9. Submission Instructions • Programs or homework in electronic files must be submitted directly on the submission site: – Submission site: http://140.124.183.39/IR/ • Username: your student ID • Password: (Please change your default password at your first login) – Preparing your submission file: as one single compressed file • Remember to specify the names of your team members and student ID in the files and documentation – If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building)
  • 10. Evaluation • Minimum requirement: correctness – Using the ClueWeb09 Test Collection (partial) as the input, and the inverted index generated by your program will be checked – Optional features will be considered as bonus • You might be required to demo if the program submitted was unable to compile/run by TA
  • 11. Any Questions or Comments?