SlideShare a Scribd company logo
1 of 24
© Copyright 2013
Open Source Search FTW!
Grant Ingersoll
Lucene/Solr Committer, Apache Soft.
Found.
CTO, LucidWorks
@gsingers
© 2013 LucidWorks
2
Preaching to the Converted!
• Embrace fuzziness!
• Search is a system building
block
• If the algorithms fit,
use them!
• Search use leads to search
abuse
• Scoring features are
everywhere
http://cheezburger.com/5243950080
© 2013 LucidWorks
3
Topics
• Quick Intro to Lucene and Solr
• What‘s new in Lucene and Solr 4.x?
- Lucene/Solr for Info Retrieval
• (Ab)Using Search Engine Tech. for Fun and Profit
© 2013 LucidWorks
4
Quick Intro to Lucene and Solr
© 2013 LucidWorks
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional
search engine usages
- Object stores, Record linkage, Social, mobile -> web
• Open Dev. > Open Source
• ―The Apache Way‖
- Meritocracy – Those who do, decide!
• Always Be Testing
- Randomized system tests are all the rage
- http://vimeo.com/32087114
• Patches Welcome!
© 2013 LucidWorks
© Copyright 2013
© 2013 LucidWorks
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
- Per segment
- FieldCache can be controlled to only load new segments
- Soft commit -- faster without fsync, allows quicker update
visibility
• DWPT (Document Writer per Thread)
- Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• String -> BytesRef
- Much improved data structure
- … means less memory and less garbage collection effort
© 2013 LucidWorks
Up and to the Right
• http://people.apache.org/~mikemccand/lucenebench/in
dexing.html
9
© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats
- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc
- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
- Decoupled from TF/IDF
- Built in alternatives include BM25 & DFR, and others
» http://en.wikipedia.org/wiki/Okapi_BM25
» http://terrier.org/docs/v3.5/dfr_description.html
- Add your own
© 2013 LucidWorks
FS(A|T)
• Keys:
- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted)
- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra
• Uses:
- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
- http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0
- ―Smaller Representation of Finite State Automata‖
» Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011,
vol. 6807, 2011, pp. 118—192.
© Copyright 2013
© 2013 LucidWorks
Solr 4: New Features
• Search/Faceting/Relevance
- New Relevance Function Queries (tf, df, others)
- Pivot Faceting
- Pseudo-join
- Improved Spatial (more later)
- Full support for Lucene Codecs, pluggable scoring
• Indexing
- New Update Processors, including scripting option
- Near real time
• Codec/Similarity support from Lucene 4
• Other
- New Admin UI
© 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
- "geo‖:‖43.17614,-90.57341‖
- ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖
- ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖
• Searching:
- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))‖
© 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search
- Auto distributes updates and queries to appropriate shards
- Near Real Time (NRT) indexing capable
• Dynamically scalable
- New SolrCloud instances add indexing and query capacity
- Supports re-balancing
• Reliable
- No single point of failure
- Transactions logged
- Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
© 2013 LucidWorks
16
New in 4.4 (just released)
• HDFS backed directory for storing index and
transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI
© Copyright 2013
Hacking Search
Engines for Fun and
Profit
17
© 2013 LucidWorks
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value
store
- Bonus: search your values!
• NoSQL before NoSQL was cool
• Solr: distributed key/value
- Durable, Isolated, Redundant, Fast,
Real-time
- Joins, Column Storage
• Solr or Tika + Lucene can index
popular office formats
• Solr can backup/replicate and
scale as content grows
© 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search
- with search used to build cross recommendation!
• Recommend content to people who exhibit certain
behaviors (clicks, query terms, other)
• (Ab)use of a search engine
- but not as a search engine for content
- more like a search engine for behavior
• See Ted Dunning‘s talk from Berlin Buzzwords on Multi-
modal Recommendation Algorithms
- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-
algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
© 2013 LucidWorks
20
… Avoid Delays
© 2013 LucidWorks
21
… Time travel?
• Leverage Solr‘s new
spatial capabilities to
index non-spatial data,
such as time ranges
- Useful for Open Hours, Shifts,
etc.
• Query using rectangle
intersections
- q = shift:"Intersects(0 19 23
365)‖
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
© 2013 LucidWorks
22
Boldly go forth and rank!
• Faster
• More Flexible
• Easier than ever scaling
• More reliable than ever
• Reduced cost of experimentation
© 2013 LucidWorks
• Lucene/Solr EU
Conference:
- Dublin, IE, November 4-7:
http://lucenerevolution.org/
- CFP Open Now
Where to Next?
• Lucene/Solr
- http://lucene.apache.org
- {java-user|solr-user}@lucene.apache.org
- SIGIR ‗12 Open Source Workshop
» http://opensearchlab.otago.ac.nz/paper_
10.pdf
• LucidWorks
- http://www.lucidworks.com
- Commercial support, products, etc. for
Lucene/Solr
• Me
- grant@lucidworks.com
- @gsingers on Twitter
- ―Taming Text‖ – Engineer‘s guide to open
source search and NLP
» http:///www.manning.com/ingersoll
23
© 2013 LucidWorks
24
Credits
• All of the Lucene/Solr committers and contributors
• Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html
• Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1
• Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring-
The-American.jpg
• Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/
• Love: http://www.msruntheus.com/above-all-love-each-other-deeply/
• TARDIS: http://2.bp.blogspot.com/-
ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do
ctor_who.jpg

More Related Content

What's hot

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 

What's hot (20)

Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
PyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch LandPyTorch 04 What's New in PyTorch Land
PyTorch 04 What's New in PyTorch Land
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Dev Con 2014
Dev Con 2014Dev Con 2014
Dev Con 2014
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Solr 101
Solr 101Solr 101
Solr 101
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Intro to Solr in Drupal
Intro to Solr in Drupal Intro to Solr in Drupal
Intro to Solr in Drupal
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 

Similar to Open Source Search FTW

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

Similar to Open Source Search FTW (20)

What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache Solr
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Solr 4
Solr 4Solr 4
Solr 4
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
EnterpriseSearch
EnterpriseSearchEnterpriseSearch
EnterpriseSearch
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 

More from Grant Ingersoll

Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 

More from Grant Ingersoll (13)

Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Open Source Search FTW

  • 1. © Copyright 2013 Open Source Search FTW! Grant Ingersoll Lucene/Solr Committer, Apache Soft. Found. CTO, LucidWorks @gsingers
  • 2. © 2013 LucidWorks 2 Preaching to the Converted! • Embrace fuzziness! • Search is a system building block • If the algorithms fit, use them! • Search use leads to search abuse • Scoring features are everywhere http://cheezburger.com/5243950080
  • 3. © 2013 LucidWorks 3 Topics • Quick Intro to Lucene and Solr • What‘s new in Lucene and Solr 4.x? - Lucene/Solr for Info Retrieval • (Ab)Using Search Engine Tech. for Fun and Profit
  • 4. © 2013 LucidWorks 4 Quick Intro to Lucene and Solr
  • 5. © 2013 LucidWorks Relax, You’re Among Friends • Large, diverse search community with many non-traditional search engine usages - Object stores, Record linkage, Social, mobile -> web • Open Dev. > Open Source • ―The Apache Way‖ - Meritocracy – Those who do, decide! • Always Be Testing - Randomized system tests are all the rage - http://vimeo.com/32087114 • Patches Welcome!
  • 8. © 2013 LucidWorks Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments - Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort
  • 9. © 2013 LucidWorks Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html 9
  • 10. © 2013 LucidWorks Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR, and others » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html - Add your own
  • 11. © 2013 LucidWorks FS(A|T) • Keys: - byte[] – write-once - Linear time build of min. automata (nlogn if not sorted) - Compression - Reverse lookups - Weights (used for auto-suggest) - Pluggable Algebra • Uses: - Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others - FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: - http://slidesha.re/vKtpVA - http://bit.ly/Pkjyu0 - ―Smaller Representation of Finite State Automata‖ » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  • 13. © 2013 LucidWorks Solr 4: New Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) - Pivot Faceting - Pseudo-join - Improved Spatial (more later) - Full support for Lucene Codecs, pluggable scoring • Indexing - New Update Processors, including scripting option - Near real time • Codec/Similarity support from Lucene 4 • Other - New Admin UI
  • 14. © 2013 LucidWorks Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: - "geo‖:‖43.17614,-90.57341‖ - ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖ - ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖ • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))‖
  • 15. © 2013 LucidWorks Scaling Solr • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity - Supports re-balancing • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  • 16. © 2013 LucidWorks 16 New in 4.4 (just released) • HDFS backed directory for storing index and transaction logs in Apache Hadoop • New Core discovery capabilities • Schemaless/External Schema/Field Guessing • Schema APIs • Add documents from the Admin UI
  • 17. © Copyright 2013 Hacking Search Engines for Fun and Profit 17
  • 18. © 2013 LucidWorks … Find your Keys, Store Your Content • Lucene/Solr is a fast key-value store - Bonus: search your values! • NoSQL before NoSQL was cool • Solr: distributed key/value - Durable, Isolated, Redundant, Fast, Real-time - Joins, Column Storage • Solr or Tika + Lucene can index popular office formats • Solr can backup/replicate and scale as content grows
  • 19. © 2013 LucidWorks … Find Love! Upsell! Cross-sell! • Cross recommendation as search - with search used to build cross recommendation! • Recommend content to people who exhibit certain behaviors (clicks, query terms, other) • (Ab)use of a search engine - but not as a search engine for content - more like a search engine for behavior • See Ted Dunning‘s talk from Berlin Buzzwords on Multi- modal Recommendation Algorithms - http://berlinbuzzwords.com/sessions/multi-modal-recommendation- algorithms • Go get Mahout/Myrrix or just do it in y(our) search engine
  • 21. © 2013 LucidWorks 21 … Time travel? • Leverage Solr‘s new spatial capabilities to index non-spatial data, such as time ranges - Useful for Open Hours, Shifts, etc. • Query using rectangle intersections - q = shift:"Intersects(0 19 23 365)‖ https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  • 22. © 2013 LucidWorks 22 Boldly go forth and rank! • Faster • More Flexible • Easier than ever scaling • More reliable than ever • Reduced cost of experimentation
  • 23. © 2013 LucidWorks • Lucene/Solr EU Conference: - Dublin, IE, November 4-7: http://lucenerevolution.org/ - CFP Open Now Where to Next? • Lucene/Solr - http://lucene.apache.org - {java-user|solr-user}@lucene.apache.org - SIGIR ‗12 Open Source Workshop » http://opensearchlab.otago.ac.nz/paper_ 10.pdf • LucidWorks - http://www.lucidworks.com - Commercial support, products, etc. for Lucene/Solr • Me - grant@lucidworks.com - @gsingers on Twitter - ―Taming Text‖ – Engineer‘s guide to open source search and NLP » http:///www.manning.com/ingersoll 23
  • 24. © 2013 LucidWorks 24 Credits • All of the Lucene/Solr committers and contributors • Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html • Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1 • Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring- The-American.jpg • Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/ • Love: http://www.msruntheus.com/above-all-love-each-other-deeply/ • TARDIS: http://2.bp.blogspot.com/- ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do ctor_who.jpg

Editor's Notes

  1. Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more
  2. What is Lucene?What is Solr?
  3. Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  4. Oh, BTW, it can do search over the valuesKeys can be anything, not just strings