Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Searching The United States Code with Solr/Lucene - By Ronald Matamoros

873 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Published in: Technology, Sports
  • Login to see the comments

  • Be the first to like this

Searching The United States Code with Solr/Lucene - By Ronald Matamoros

  1. 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 [email_address]
  2. 2. Searching the United States Code <ul><li>Who are we: </li></ul><ul><ul><li>Paul Nelson, Chief Architect </li></ul></ul><ul><ul><li>Ronald Matamoros, Lead Engineer </li></ul></ul><ul><li>Our Mission: Replace Personal Librarian Search </li></ul><ul><ul><li>A 20-Year-Old Search Engine! </li></ul></ul><ul><li>Key Challenges </li></ul><ul><ul><li>How to index this massive, complex, 85-year-old document? </li></ul></ul><ul><ul><li>How to replicate 20-Year-Old search features? </li></ul></ul><ul><li>Government Documents are Fun! </li></ul>
  3. 3. Search Technologies <ul><li>The largest independent provider of enterprise search expertise and services </li></ul><ul><li>80 full-time dedicated search engine experts </li></ul><ul><li>200+ customers </li></ul><ul><li>Technology Neutral </li></ul><ul><ul><li>(yeah, we know Sphinx too) </li></ul></ul><ul><li>Offices All Over </li></ul><ul><ul><li>DC, NY, CA, MD, OH, UK, CR… </li></ul></ul>
  4. 4. A Quick Civics Lesson… <ul><li>The United States Code </li></ul><ul><ul><li>The general & permanent laws of the U.S. Government – All in one place </li></ul></ul><ul><ul><li>51 titles </li></ul></ul><ul><ul><ul><li>Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… </li></ul></ul></ul><ul><ul><li>First Version: 1926 </li></ul></ul><ul><li>The Office of the Law Revision Council (OLRC) </li></ul><ul><ul><li>20 lawyers who author the U.S. Code </li></ul></ul><ul><ul><li>They report to the Speaker of the House of Representatives </li></ul></ul><ul><li>Bonus Question: Which Title is the largest? </li></ul>
  5. 5. Major Challenges <ul><li>Document Parsing </li></ul><ul><ul><li>A 50 Volume Table Of Contents! </li></ul></ul><ul><li>Query Parsing </li></ul><ul><ul><li>Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…) </li></ul></ul><ul><li>Searching & Highlighting Fields </li></ul><ul><ul><li>Some fields are embedded in the document </li></ul></ul><ul><ul><li>These fields must be highlighted in context </li></ul></ul>
  6. 6. screenshot
  7. 7. screenshot
  8. 8. screenshot
  9. 9.
  10. 10. Part The First: Document Processing
  11. 11. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  12. 12. Field Type 1: Extracted to Index <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … Page Numbers Title Heading Source Credit
  13. 13. Document Processing / Indexing Title 14 ch. 1 ch. 2 ch. 3 pt. A pt. B pt. C sec. 1 sec. 2 sec. 3 … … … USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  14. 14. Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … Public Law Other USC Refs Statute at Large Public Law Public Law
  15. 15. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs
  16. 16. Document Processing / Indexing USC Title Parse & Granularize Repository Construct XHTML Store Xform & Index Solr Embed Refs <ul><li>/US-Code </li></ul><ul><ul><li>/2010 </li></ul></ul><ul><ul><ul><li>/title2 </li></ul></ul></ul><ul><ul><ul><ul><li>/USC-title2-section1532.htm </li></ul></ul></ul></ul><ul><ul><ul><ul><li>/USC-title2-node3-rule5.htm </li></ul></ul></ul></ul>
  17. 17. Part The Second: Token Processing
  18. 18. Token Processing 1 <ul><li>xhtml tag tokenizer </li></ul><!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
  19. 19. Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class=&quot;section-head&quot;>&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class=&quot;statutory-body&quot;>The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class=&quot;source-credit&quot;>(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class=&quot;note-head&quot;>Historical and Revision Notes</h4> <p class=&quot;note-body&quot;>Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;>Amendments</h4> <p class=&quot;note-body&quot;>2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class=&quot;note-head&quot;>Effective Date of 2002 Amendment</h4> <p class=&quot;note-body&quot;>Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …
  20. 20. Token Processing 2 <ul><li>Mark Start and End Tags </li></ul>S/amendment <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of E/amendment <!-- field-start:amendment-note --> <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note -->
  21. 21. Token Processing 3 <ul><li>Remove XHTML Tags </li></ul>S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment S/amendment <h4 class=&quot;note-head&quot;> Amendments </h4> <p class=&quot;note-body&quot;> 2002 Pub L 107 296 Substituted Department of E/amendment
  22. 22. Token Processing 4 <ul><li>Tag Original Case & Lower Case </li></ul>S/amendment O/Amendments L/amendments O/2002 L/2002 O/Pub L/pub O/L L/l O/107 L/107 O/296 L/296 O/Substituted L/substituted O/Department L/department O/of L/of E/amendment S/amendment Amendments 2002 Pub L 107 296 Substituted Department of E/amendment
  23. 23. Token Processing 5 <ul><li>Lemmatize </li></ul><ul><li>Uses dictionary-based lemmatizer based on GCIDE and WordNet </li></ul>S/amendment O/Amendments L/amendments amendment O/2002 L/2002 2002 O/Pub L/Pub pub O/L L/l; l O/107 L/107 107 O/296 L/296 296 O/Substituted L/Substituted substitute O/Department L/Department department O/of L/of of E/amendment S/amendment O/Amendments L/amendments O/2002 L/2002 O/Pub L/pub O/L L/l O/107 L/107 O/296 L/296 O/Substituted L/substituted O/Department L/department O/of L/of E/amendment
  24. 24. Part The Third: Query Processing
  25. 25. Query Processing parse mark phrases lemmatize query template build lucene query mark exact: Query String search <ul><li>Communicates via generic QNode Class </li></ul><ul><ul><li>Simpler to manipulate than Lucene operators </li></ul></ul><ul><li>Can produce FAST FQL as well </li></ul><ul><ul><li>(cue the derisive catcalls) </li></ul></ul><ul><li>But most importantly: </li></ul><ul><ul><li>It is a Query Processing Pipeline </li></ul></ul><ul><ul><ul><li>Mix and match query processing modules </li></ul></ul></ul>(not all stages shown)
  26. 26. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and exact: |FOIA| phrase |top| |secret| amendment: |RECORDS| exact:FOIA “top secret” amendment:RECORDS
  27. 27. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |top| |secret| amendment: exact:FOIA “top secret” amendment:RECORDS |RECORDS|
  28. 28. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| amendment: exact:FOIA “top secret” amendment:RECORDS |records|
  29. 29. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| amendment: exact:FOIA “top secret” amendment:RECORDS |record|
  30. 30. Query Processing parse mark lowercase lemmatize query template build lucene query mark original Query String search and O/FOIA phrase |L/top| |L/secret| between exact:FOIA “top secret” amendment:RECORDS E/amendment S/amendment |record|
  31. 31. The between() Operator <ul><li>between(start-tag, end-tag, pos-clause, neg-clause) </li></ul><ul><li>start-tag  Starting tag, e.g. “S/amendment” </li></ul><ul><li>end-tag  Ending tag, e.g. “E/amendment” </li></ul><ul><li>pos-clause  words which must occur between start and end </li></ul><ul><ul><li>Note: Requires a nested ScanAnd() operator </li></ul></ul><ul><li>neg-clause  words which must not occur between start and end </li></ul>
  32. 32. Part the Fourth: Hierarchical Navigation
  33. 33. screenshot
  34. 34. Hierarchies: Requirements <ul><li>Any number of levels </li></ul><ul><ul><ul><li>Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section </li></ul></ul></ul><ul><li>Levels vary across titles </li></ul><ul><ul><ul><li>Title 1: 3 levels </li></ul></ul></ul><ul><ul><ul><li>Title 26: 8 levels </li></ul></ul></ul><ul><li>Multiple views: </li></ul><ul><ul><ul><li>Children </li></ul></ul></ul><ul><ul><ul><li>Ancestors </li></ul></ul></ul><ul><ul><ul><li>Ancestor’s Siblings </li></ul></ul></ul><ul><li>Multiple search scopes: </li></ul><ul><ul><ul><li>Only children, all descendents, everything </li></ul></ul></ul>
  35. 35. Hierarchies: Ancestor-Siblings <ul><li>US-Code </li></ul><ul><ul><li>Title 1 </li></ul></ul><ul><ul><li>Title 2 </li></ul></ul><ul><ul><ul><li>Chapter 1 </li></ul></ul></ul><ul><ul><ul><li>Chapter 2 </li></ul></ul></ul><ul><ul><ul><ul><li>Part 1 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 2 </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Section 2.1 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Section 2.2 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 3 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Part 4 </li></ul></ul></ul></ul><ul><ul><ul><li>Chapter 3 </li></ul></ul></ul><ul><ul><ul><li>Chapter 4 </li></ul></ul></ul><ul><ul><li>Title 3 </li></ul></ul>
  36. 36. Hierarchies: Fields <ul><li>ancestors </li></ul><ul><ul><li>Searching </li></ul></ul><ul><ul><ul><li>USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-subchapter2 </li></ul></ul></ul><ul><li>encodedAncestors – for display only </li></ul><ul><ul><li>Where the node exists within the hierarchy </li></ul></ul><ul><ul><ul><li>id;heading;subjectTitle//id;heading;subjectTitle//... </li></ul></ul></ul><ul><ul><ul><li>USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform </li></ul></ul></ul><ul><li>parentId – ID of the parent node </li></ul><ul><ul><ul><li>USC-title2-chapter25-subchapter2 </li></ul></ul></ul><ul><li>treesort – Hierarchical sort field, e.g. “ 13/000/0/00882” </li></ul>
  37. 37. Hierarchies: Tree Sort <ul><li>Sorting In Print Order </li></ul><ul><ul><li>Front Matter  Titles  Tables  etc. </li></ul></ul><ul><ul><li>Everything padded to fixed-length </li></ul></ul>01/011/1/02032 01 = USC Title 011 = Title 11 1 = An Appendix Sequence # in file
  38. 38. Hierarchies: Sample Searches <ul><li>Assuming Node = “USC-title2-chapter25” </li></ul><ul><li>Search Children </li></ul><ul><ul><li>parentId:USC-title2-chapter25 </li></ul></ul><ul><li>Search All Descendents </li></ul><ul><ul><li>ancestors:USC-title2-chapter25 </li></ul></ul><ul><li>Ancestor Siblings </li></ul><ul><ul><li>(parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) </li></ul></ul>
  39. 39. Contact <ul><li>Paul Nelson </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Ronald Matamoros </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Search Technologies </li></ul><ul><ul><li>http://searchtechnologies.com </li></ul></ul>

×