SlideShare a Scribd company logo
1 of 25
Download to read offline
TEXT TAGGING WITH FINITE STATE
TRANSDUCERS
David Smiley
Software Systems Engineer, Lead
Text Tagging with
Finite State Transducers
David Smiley
Lucene/Solr Revolution 2013
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley
 Working at MITRE, for 13 years
 web development, Java, search
 Published 1st book on Solr; then 2nd edition (2009, 2011)
 Apache Lucene / Solr committer/PMC member (2012)
 Presented at Lucene Revolution (2010) & Basis O.S. Search
Conference (2011, 2012)
 Taught Solr classes at MITRE (2010, 2011, 2012)
 Solr search consultant within MITRE and its sponsors, and
privately
3
What is “Text Tagging” and “FSTs”?
 First, I need to establish the context:
 JIEDDO’s OpenSextant project
 Though this presentation is not about OpenSextant or
geotagging
 Ultimately, I want to convey how cool Lucene’s FSTs are
 And you may have a need for a text tagger
 Or a geotagger (like OpenSextant)
OpenSextant
A DoD Funded Project: JIEDDO/COIC & NGA
Open Source approval recently obtained
OpenSextant Project
 A geotagging solution for unstructured text
 Finds place name references in natural language
 “… I live near Boston … ”
 Finds “Boston” with input character offset #s
 Often resolves to multiple gazetteer entries: “Boston” has 73
 What’s a Gazetteer?
 A dictionary of place names with metadata like latitude &
longitude
How does it work?
The “Naïve” Tagger
 AKA “Text Tagger”
 Simply consults a dictionary/gazetteer; no fancy NLP
 There’s nothing geospatial about it
 Subsequent NLP processing eliminates low-confidence tags
 Actually, not so simple
 Names vary in word length
 Must find overlapping names
 but not names within names
The Gazetteer
 13 million place name records
 8.1M distinct place names
 Why not 13M?
 Ambiguous names (e.g. San Diego)
 Text analysis normalization (e.g. diacritic removal, etc.)
 2.8M are single-word names (1/3rd)
 2.3 avg. words / name
 14 avg. chars / name
3 Naïve Tagger Implementations
 GATE’s Tagger
 In-memory Aho-Corasick string-matching algorithm
 Requires an estimated 80 GB RAM !! (for our data)
 FAST
 A JIEDDO developed MySQL based Tagger
 “Reasonable” RAM requirements ~4GB
 SLOW (~15x, 20x? not certain). ~1 doc/second
 A JIEDDO developed Solr/FST based Tagger …
Finite State Transducers
Applied to text tagging
Finite State Automata (FSA)
 SortedSet<char[]>:
 mop, moth, pop, slop, sloth, stop, top
Note: a “Trie” data structure is similar but only shares prefixes
Finite State Transducer (FST)
 Adds optional output to each arc
 SortedMap<char[],int>
 mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
Lucene’s FST Implementation
 FST encoded as a byte[]
 Memory efficient! And fast to load from disk.
 Write-once API (immutable)
 Build minimal, acyclic FST from pre-sorted inputs
 Fast (linear time with input size), low memory
 Optional two-pass packing can shrink by ~25%
 SortedMap<int[],T>: arcs are sorted by label
 getByOutput also possible if outputs are sorted
 http://s.apache.org/LuceneFSTs
Based on a
Mihov & Maurel
paper, 2001
FSTs and Text Tagging
 My approach involves two layers of FSTs:
 A word dictionary FST to hold each unique word
 Enables using integers as substitutes for char[]
 Via getByOutput(12345) -> “New”
 Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345
 A word phrase FST comprised of word id string keys
 Ex: “New York City” -> [12345, 5522111, 345]
 Value are arrays of gazetteer primary keys
Memory Use
 Word Dict FST:
 3.3M words with ordinal ids in 26MB of RAM
 Name Phrase FST:
 8.1M word id phrases in 90 MB of RAM
 Plus 82MB of arrays of gazetteer primary key ids
 Total: 198 MB (compare to 80GB GATE Aho-Corasick)
 Building it consumes ~1.5GB Java heap, for 2 minutes
Experimental measurements
 Single FST Experiment
 1 FST of analyzed character word phrase -> int id
 “new york city” -> 6344207
 Theory: more than 2x the memory
 Result: 69 MB! (compare to 26+90) 41% reduction
 Retrospective: What I would have done differently
 Index a field of concatenated terms (custom TokenFilter).
 More disk needed but reduces build time & memory
requirements. Unclear effect on tagging performance.
 Potential to use MemoryPostingsFormat, a Lucene Codec that
uses an FST internally + vInt doc ids, instead of custom FST code.
Tagging Algorithm
It’s complicated! Single-pass (streaming) algorithm
 For each input word, lookup its ordinal id, then:
1. Create an FST arc iterator for name phrase
2. Append the iterator onto a queue of active ones
3. Try to advance all iterators
 Remove those that don’t advance
Iterator linked-list queue:
Head: New, York, City ✔
Head+1: York, City
Head+2: City …
Speed Benchmarks
Docs/Sec RAM (GB)
OpenSextant: GATE Tagger ? 80
OpenSextant: MySQL based Tagger 1.1 4
OpenSextant: Solr/FST Tagger 15.9 2*
Measures single-threaded performance of geotagging 428
documents in the “ACE” collection. OpenSextant tests all had
the same gazetteer.
Integrated with Solr
 As a custom Solr Request Handler
 Builds the FSTs from the index (the gazetteer)
 Configurable
 Text analysis (e.g. phonetic)
 Exclude gazetteer docs by configured query
 Optional partial word phrase matching
 Optional sub-tags tagging
 Solr integration benefits
 Solr as a taxonomy manager! Web-service, searchable,
scalable, easy to update, …
~$ curl -XPOST 'http://localhost:8983/solr/tag
?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston"
{
"responseHeader":{
"status":0,
"QTime":1898},
"tagsCount":1,
"tags":[[
"startOffset",12,
"endOffset",18,
"ids",[1190927,
1099063,
2562742,
2667203,
2684629,
2695904,
2653982,
2657690,
2585165,
2597292,
…
… 11890986,
11891415]]],
"matchingDocs":{"numFound":73,"start":0,"docs":[
{
"id":12719030,
"place_id":"USGS1893700",
"name":"Boston",
"lat":65.01667,
"lon":-163.28333,
"feat_class":"L",
"feat_code":"AREA",
"FIPS_cc":"US",
"ISO_cc":["US"],
"cc":"US",
"ISO3_cc":"USA",
"adm1":"US02",
"adm2":"US02.0180",
"name_bias":0.0,
"id_bias":0.04,
"geo":"65.01667,-163.28333"},
…
Where can you get this?
 https://github.com/openSextant/SolrTextTagger
 An independent module of OpenSextant
 Might seek incubator status at http://www.osgeo.org
 Includes documentation, tests
Concluding Remarks
 Lucene FSTs are awesome!
 Great for storing large amounts of strings in-memory
 Or other string-like data: e.g. IP addresses, geohashes
 The API is hard to use, however
 The text tagger should be useful independent of
OpenSextant
 Tag people/org names or special keywords
 Might be ported to Lucene as an alternative to its synonym
token filter
 I’ve got an idea on applying these concepts to Lucene
“Shingling” as a codec to make it more scalable
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
David Smiley
dsmiley@mitre.org

More Related Content

What's hot

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
 
より深く知るオプティマイザとそのチューニング
より深く知るオプティマイザとそのチューニングより深く知るオプティマイザとそのチューニング
より深く知るオプティマイザとそのチューニングYuto Hayamizu
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム by...
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム  by...[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム  by...
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム by...Insight Technology, Inc.
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
Elasticsearchベースの全文検索システムFess
Elasticsearchベースの全文検索システムFessElasticsearchベースの全文検索システムFess
Elasticsearchベースの全文検索システムFessShinsuke Sugaya
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021StreamNative
 
Xfs file system for linux
Xfs file system for linuxXfs file system for linux
Xfs file system for linuxAjay Sood
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLanandology
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 

What's hot (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
 
より深く知るオプティマイザとそのチューニング
より深く知るオプティマイザとそのチューニングより深く知るオプティマイザとそのチューニング
より深く知るオプティマイザとそのチューニング
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム by...
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム  by...[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム  by...
[db tech showcase Tokyo 2014] B25: [In-Memory DB: SAP HANA] 障害・災害対策のメカニズム by...
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
Elasticsearchベースの全文検索システムFess
Elasticsearchベースの全文検索システムFessElasticsearchベースの全文検索システムFess
Elasticsearchベースの全文検索システムFess
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive(STG402) Amazon EBS Deep Dive
(STG402) Amazon EBS Deep Dive
 
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Xfs file system for linux
Xfs file system for linuxXfs file system for linux
Xfs file system for linux
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 

Viewers also liked

Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoriststjcarter
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_versiontjcarter
 
Adult development theory
Adult development theoryAdult development theory
Adult development theorycccscoetc
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライトShinichiro Abe
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VECharles Palus
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesCharles Palus
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 ScotlandCharles Palus
 

Viewers also liked (10)

Adult Development
Adult DevelopmentAdult Development
Adult Development
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
 
Adult Development
Adult Development Adult Development
Adult Development
 
Adult development theory
Adult development theoryAdult development theory
Adult development theory
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライト
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 Scotland
 

Similar to Text tagging with finite state transducers

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUGAlex Ott
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF UsersAndy Schwartz
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 

Similar to Text tagging with finite state transducers (20)

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUG
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF Users
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 

Recently uploaded (20)

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 

Text tagging with finite state transducers

  • 1. TEXT TAGGING WITH FINITE STATE TRANSDUCERS David Smiley Software Systems Engineer, Lead
  • 2. Text Tagging with Finite State Transducers David Smiley Lucene/Solr Revolution 2013 © 2012 The MITRE Corporation. All rights reserved.
  • 3. About David Smiley  Working at MITRE, for 13 years  web development, Java, search  Published 1st book on Solr; then 2nd edition (2009, 2011)  Apache Lucene / Solr committer/PMC member (2012)  Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011, 2012)  Taught Solr classes at MITRE (2010, 2011, 2012)  Solr search consultant within MITRE and its sponsors, and privately 3
  • 4. What is “Text Tagging” and “FSTs”?  First, I need to establish the context:  JIEDDO’s OpenSextant project  Though this presentation is not about OpenSextant or geotagging  Ultimately, I want to convey how cool Lucene’s FSTs are  And you may have a need for a text tagger  Or a geotagger (like OpenSextant)
  • 5. OpenSextant A DoD Funded Project: JIEDDO/COIC & NGA Open Source approval recently obtained
  • 6. OpenSextant Project  A geotagging solution for unstructured text  Finds place name references in natural language  “… I live near Boston … ”  Finds “Boston” with input character offset #s  Often resolves to multiple gazetteer entries: “Boston” has 73  What’s a Gazetteer?  A dictionary of place names with metadata like latitude & longitude
  • 7.
  • 8. How does it work?
  • 9. The “Naïve” Tagger  AKA “Text Tagger”  Simply consults a dictionary/gazetteer; no fancy NLP  There’s nothing geospatial about it  Subsequent NLP processing eliminates low-confidence tags  Actually, not so simple  Names vary in word length  Must find overlapping names  but not names within names
  • 10. The Gazetteer  13 million place name records  8.1M distinct place names  Why not 13M?  Ambiguous names (e.g. San Diego)  Text analysis normalization (e.g. diacritic removal, etc.)  2.8M are single-word names (1/3rd)  2.3 avg. words / name  14 avg. chars / name
  • 11. 3 Naïve Tagger Implementations  GATE’s Tagger  In-memory Aho-Corasick string-matching algorithm  Requires an estimated 80 GB RAM !! (for our data)  FAST  A JIEDDO developed MySQL based Tagger  “Reasonable” RAM requirements ~4GB  SLOW (~15x, 20x? not certain). ~1 doc/second  A JIEDDO developed Solr/FST based Tagger …
  • 13. Finite State Automata (FSA)  SortedSet<char[]>:  mop, moth, pop, slop, sloth, stop, top Note: a “Trie” data structure is similar but only shares prefixes
  • 14. Finite State Transducer (FST)  Adds optional output to each arc  SortedMap<char[],int>  mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • 15. Lucene’s FST Implementation  FST encoded as a byte[]  Memory efficient! And fast to load from disk.  Write-once API (immutable)  Build minimal, acyclic FST from pre-sorted inputs  Fast (linear time with input size), low memory  Optional two-pass packing can shrink by ~25%  SortedMap<int[],T>: arcs are sorted by label  getByOutput also possible if outputs are sorted  http://s.apache.org/LuceneFSTs Based on a Mihov & Maurel paper, 2001
  • 16. FSTs and Text Tagging  My approach involves two layers of FSTs:  A word dictionary FST to hold each unique word  Enables using integers as substitutes for char[]  Via getByOutput(12345) -> “New”  Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345  A word phrase FST comprised of word id string keys  Ex: “New York City” -> [12345, 5522111, 345]  Value are arrays of gazetteer primary keys
  • 17. Memory Use  Word Dict FST:  3.3M words with ordinal ids in 26MB of RAM  Name Phrase FST:  8.1M word id phrases in 90 MB of RAM  Plus 82MB of arrays of gazetteer primary key ids  Total: 198 MB (compare to 80GB GATE Aho-Corasick)  Building it consumes ~1.5GB Java heap, for 2 minutes
  • 18. Experimental measurements  Single FST Experiment  1 FST of analyzed character word phrase -> int id  “new york city” -> 6344207  Theory: more than 2x the memory  Result: 69 MB! (compare to 26+90) 41% reduction  Retrospective: What I would have done differently  Index a field of concatenated terms (custom TokenFilter).  More disk needed but reduces build time & memory requirements. Unclear effect on tagging performance.  Potential to use MemoryPostingsFormat, a Lucene Codec that uses an FST internally + vInt doc ids, instead of custom FST code.
  • 19. Tagging Algorithm It’s complicated! Single-pass (streaming) algorithm  For each input word, lookup its ordinal id, then: 1. Create an FST arc iterator for name phrase 2. Append the iterator onto a queue of active ones 3. Try to advance all iterators  Remove those that don’t advance Iterator linked-list queue: Head: New, York, City ✔ Head+1: York, City Head+2: City …
  • 20. Speed Benchmarks Docs/Sec RAM (GB) OpenSextant: GATE Tagger ? 80 OpenSextant: MySQL based Tagger 1.1 4 OpenSextant: Solr/FST Tagger 15.9 2* Measures single-threaded performance of geotagging 428 documents in the “ACE” collection. OpenSextant tests all had the same gazetteer.
  • 21. Integrated with Solr  As a custom Solr Request Handler  Builds the FSTs from the index (the gazetteer)  Configurable  Text analysis (e.g. phonetic)  Exclude gazetteer docs by configured query  Optional partial word phrase matching  Optional sub-tags tagging  Solr integration benefits  Solr as a taxonomy manager! Web-service, searchable, scalable, easy to update, …
  • 22. ~$ curl -XPOST 'http://localhost:8983/solr/tag ?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston" { "responseHeader":{ "status":0, "QTime":1898}, "tagsCount":1, "tags":[[ "startOffset",12, "endOffset",18, "ids",[1190927, 1099063, 2562742, 2667203, 2684629, 2695904, 2653982, 2657690, 2585165, 2597292, … … 11890986, 11891415]]], "matchingDocs":{"numFound":73,"start":0,"docs":[ { "id":12719030, "place_id":"USGS1893700", "name":"Boston", "lat":65.01667, "lon":-163.28333, "feat_class":"L", "feat_code":"AREA", "FIPS_cc":"US", "ISO_cc":["US"], "cc":"US", "ISO3_cc":"USA", "adm1":"US02", "adm2":"US02.0180", "name_bias":0.0, "id_bias":0.04, "geo":"65.01667,-163.28333"}, …
  • 23. Where can you get this?  https://github.com/openSextant/SolrTextTagger  An independent module of OpenSextant  Might seek incubator status at http://www.osgeo.org  Includes documentation, tests
  • 24. Concluding Remarks  Lucene FSTs are awesome!  Great for storing large amounts of strings in-memory  Or other string-like data: e.g. IP addresses, geohashes  The API is hard to use, however  The text tagger should be useful independent of OpenSextant  Tag people/org names or special keywords  Might be ported to Lucene as an alternative to its synonym token filter  I’ve got an idea on applying these concepts to Lucene “Shingling” as a codec to make it more scalable
  • 25. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT David Smiley dsmiley@mitre.org