SlideShare a Scribd company logo
1 of 26
luceneSolr = new
                                LuceneSolr(4.x)

                                Grant Ingersoll
                                CTO, LucidWorks




Confidential © Copyright 2012
Search is dead, long live search


    • Embrace fuzziness!

    • Search is a system building
      block

    • If the algorithms fit,
                use them!

    • Search use leads to search
      abuse
        - Denormalization frees your mind
        - Scoring is just a sparse matrix
          multiply
                                            http://cheezburger.com/5243950080
    • Scoring features are
      everywhere
    Confidential and Proprietary
2   © 2012 LucidWorks
Search (R)evolution


• “T’ain’t your father’s search engine”
    - Non free text usages abound

• NoSQL before NoSQL was cool
    - Many DB-like features


• Flexibility during indexing and scoring

• Finite State Transducers FTW!

• Scale


Confidential and Proprietary
© 2012 LucidWorks
Agenda

• What’s new In Lucene 4?

• What’s new in Solr 4?

• Sneak Peek: what’s ahead?




Confidential and Proprietary
© 2012 LucidWorks
Lucene 4




Confidential © Copyright 2012
Up and to the Right




    • http://people.apache.org/~mikemccand/lucenebench/in
      dexing.html

    Confidential and Proprietary
6   © 2012 LucidWorks
Lucene: Flexibility

• Flexible Index Formats
    - New posting list codecs: Block, Simple Text, Append (HDFS..),
      etc
    - Pulsing codec: improves performance of primary key searches,
      inlining docs, positions, and payloads, saves disk seeks


• Pluggable Scoring
    - Decoupled from TF/IDF
    - Built in alternatives include BM25 & DFR
          » http://en.wikipedia.org/wiki/Okapi_BM25
          » http://terrier.org/docs/v3.5/dfr_description.html




Confidential and Proprietary
© 2012 LucidWorks
Lucene: Speed and Memory

• Native Near Real Time (NRT) support
    - Per segment
    - FieldCache can be controlled to only load new segments
• Soft commit
    - Faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
    - Faster more consistent index speed
• Faster fuzzy & wildcard query processing
    - Higher performance searching
• String -> BytesRef
    - Much improved data structure
    - … means less memory and less garbage collection effort

Confidential and Proprietary
© 2012 LucidWorks
BytesRef memory management improvements


    • On a Wikipedia index (11M documents)
        - Time to perform the first query with sorting (no warmup queries)
          Solr 3x: 13 seconds, Solr 4: 6 seconds.

        - Memory consumption Solr 3x: 1,040M, Solr 4: 366M.

        - Number of objects on the heap. Solr 3x: 19.4M, Solr 4: 80K. No,
          that’s not a typo.

        - http://searchhub.org/2012/04/06/memory-comparisons-between-
          solr-3x-and-trunk/




    Confidential and Proprietary
9   © 2012 LucidWorks
FuzzyQuery

     • http://people.apache.org/~mikemccand/lucenebench/F
       uzzy2.html




     Confidential and Proprietary
10   © 2012 LucidWorks
QPS (primary key lookup)

     • http://people.apache.org/~mikemccand/lucenebench/P
       KLookup.html




     Confidential and Proprietary
11   © 2012 LucidWorks
Lucene: Features

• Doc Values                          • DirectSpellChecker
    - Store data in column order       - No more sidecar index!
    - Tradeoffs when using vs.        • Geospatial improvements
      FieldCache                        (more later)

    - http://searchhub.org/2013/04
      /02/fun-with-docvalues-in-
      solr-4-2/
    - http://www.slideshare.net/luc
      enerevolution/willnauer-
      simon-doc-values-column-
      stride-fields-in-lucene



Confidential and Proprietary
© 2012 LucidWorks
Solr 4




Confidential © Copyright 2012
Solr 4: Features

• Search/Faceting/Relevance
    -   New Relevance Function Queries (tf, df, others)
    -   Pivot Faceting
    -   Pseudo-join
    -   DirectSpellChecker support
    -   Improved Spatial (more later)
• Indexing
    - New Update Processors, including scripting option
    - NRT
• Other
    - DocTransformer pluggability
    - New Admin UI


Confidential and Proprietary
© 2012 LucidWorks
Geospatial improvements

• Multiple values per field
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle

• Indexing:
    - "geo”:”43.17614,-90.57341”
    - “geo”:”Circle(4.56,1.23 d=0.0710)”
    - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
    - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
    - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
      0, -10 30)))”

Confidential and Proprietary
© 2012 LucidWorks
/solr




Confidential and Proprietary
© 2012 LucidWorks
SolrCloud

• Distributed/sharded indexing & search
    - Auto distributes updates and queries to appropriate shards
    - Near Real Time (NRT) indexing capable
• Dynamically scalable
    - New SolrCloud instances add indexing and query capacity
• Reliable
    - No single point of failure
    - Transactions logged
    - Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud



Confidential and Proprietary
© 2012 LucidWorks
Confidential and Proprietary
18   © 2012 LucidWorks
SolrCloud’s capabilities

• Transaction log
    - All updates are added to the transaction log. The tlog provides support for: durability for
      updates that have not yet been committed, peer syncing, real-time get (retrieve documents
      by unique id) always up to date because it checks the tlog first, does not require opening a
      new searcher to see changes
• Near Real Time (NRT) indexing
    - Soft commits make updates visible
    - Hard commits make updates durable
• Durability
    - Updates to Solr may be in several different states: buffered in memory, flushed, but not
      committed or viewable, soft committed (flushed and viewable), committed (durable)
    - The transaction log ensures data is not lost in any of these states if Solr crashes.
• Recovery
    - Solr uses the transaction log for recovery; on startup Solr checks to see if the tlog is in a
      committed state, if not updates since the last commit are applied
• Optimistic locking
    - Solr maintains a document version (_version_ field); updates can now specify _version_;
      updates to incorrect version will fail


Confidential and Proprietary
© 2012 LucidWorks
SolrCloud details

     • “Leaders” and “replicas”
         - Leaders are automatically elected
     • Leaders are just a replica with some coordination
       responsibilities for the associated replicas
     • If a leader goes down, one of the associated replicas is
       elected as the new leader
     • New nodes are automatically assigned a shard and
       role, and replicate/recover as needed
     • SolrJ’s CloudSolrServer
     • Replication in Solr 4
         - Used for new and recovering replicas
         - Or for traditional master/slave configuration

     Confidential and Proprietary
20   © 2012 LucidWorks
Solr as NoSQL

• Characteristics
    -   Non-traditional data stores
    -   Not designed for SQL type queries
    -   Distributed fault tolerant architecture
    -   Document oriented, data format agnostic(JSON, XML, CSV,
        binary)
• Updated durability via transaction log
• Real-time /get fetches latest version w/o hard commit
• Versioning and optimistic locking
    - w/ Real Time GET, allows read/write/update w/o conflicts
• Atomic updates
    - Can add/remove/change and increment a field in existing doc
      w/o re-indexing


Confidential and Proprietary
© 2012 LucidWorks
Distributed Key / Value Pair Database

     • Real-time Get combined with Solr Cloud make a very
       powerful key/value pair database
         -   Durable (tlog)
         -   Isolated (Optimistic locking)
         -   Redundant (Solr Cloud Replicas)
         -   Distributed & scalable (billions of keys, Solr Cloud Sharding)
         -   Efficient Multi-tenant (Solr Cloud document routing, Solr 4.1)
         -   Fast (milli-second response time, Pulsing Codec)
         -   Real-time (tlog)




     Confidential and Proprietary
22   © 2012 LucidWorks
Routing

     • Allows you to route documents and queries to a subset
       of shards
     • Provides efficient multi-tenancy
     • Indexing:
         - A shard key can be prepended to the unique document id:
           shard_key!unique_id
         - Documents with the same shard_key will reside on the same
           shard.
     • Querying: shard.keys=shard_key1!...
         - Much more efficient then searching the entire collection.




     Confidential and Proprietary
23   © 2012 LucidWorks
Looking ahead

• Automatic shard splitting
• Query parsing: rich query tree control via JSON/XML

• “Schemaless”
    - Marketing term meaning convention over configuration for fields


• More programmatic control over system

• Continually improving performance, scalability, and
  robustness


Confidential and Proprietary
© 2012 LucidWorks
• Want to learn more?

     • Join us in San Diego April 29 – May 2, 2013

     • http://lucenerevolution.org/

     • http://lucenerevolution.org/2013/agenda




     Confidential and Proprietary
25   © 2012 LucidWorks
Resources

• Lucene/Solr
    - http://lucene.apache.org


• Me
    - @gsingers, grant@lucidworks.com
    - http://www.manning.com/ingersoll


• Company
    - http://www.lucidworks.com
    - http://www.searchhub.org
    - Products, Support, Training on and around Lucene and Solr



Confidential and Proprietary
© 2012 LucidWorks

More Related Content

What's hot

Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresmkorremans
 
Learning Oracle with Oracle VM VirtualBox Whitepaper
Learning Oracle with Oracle VM VirtualBox WhitepaperLearning Oracle with Oracle VM VirtualBox Whitepaper
Learning Oracle with Oracle VM VirtualBox WhitepaperLeighton Nelson
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...Frank Munz
 
Overview of some popular distributed databases
Overview of some popular distributed databasesOverview of some popular distributed databases
Overview of some popular distributed databasessagar chaturvedi
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...Lucidworks
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityPythian
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
Lessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsLessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsBlueData, Inc.
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesTanel Poder
 
2008 2086 Gangler
2008 2086 Gangler2008 2086 Gangler
2008 2086 GanglerSecure-24
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2PgTraining
 
Oracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerOracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerGuatemala User Group
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLspil-engineering
 
WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013Michel Schildmeijer
 
01 upgrade to my sql8
01 upgrade to my sql8 01 upgrade to my sql8
01 upgrade to my sql8 Ted Wennmark
 
MySQL Performance - Best practices
MySQL Performance - Best practices MySQL Performance - Best practices
MySQL Performance - Best practices Ted Wennmark
 

What's hot (20)

Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
Learning Oracle with Oracle VM VirtualBox Whitepaper
Learning Oracle with Oracle VM VirtualBox WhitepaperLearning Oracle with Oracle VM VirtualBox Whitepaper
Learning Oracle with Oracle VM VirtualBox Whitepaper
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
 
Overview of some popular distributed databases
Overview of some popular distributed databasesOverview of some popular distributed databases
Overview of some popular distributed databases
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
Lessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark WorkloadsLessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark Workloads
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
 
2008 2086 Gangler
2008 2086 Gangler2008 2086 Gangler
2008 2086 Gangler
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2
 
Oracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with DockerOracle WebLogic Server 12c with Docker
Oracle WebLogic Server 12c with Docker
 
Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...
Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...
Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...
 
Database TCO
Database TCODatabase TCO
Database TCO
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRL
 
WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013WebLogic on ODA - Oracle Open World 2013
WebLogic on ODA - Oracle Open World 2013
 
01 upgrade to my sql8
01 upgrade to my sql8 01 upgrade to my sql8
01 upgrade to my sql8
 
MySQL Performance - Best practices
MySQL Performance - Best practices MySQL Performance - Best practices
MySQL Performance - Best practices
 

Viewers also liked

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 

Viewers also liked (12)

Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Taming Text
Taming TextTaming Text
Taming Text
 

Similar to LuceneSolr Evolution and Revolution

Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaCominvent AS
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache SolrAnshum Gupta
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaLucidworks
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1Stefan Schmidt
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Sematext Group, Inc.
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - ChicagoErik Hatcher
 

Similar to LuceneSolr Evolution and Revolution (20)

Solr 4
Solr 4Solr 4
Solr 4
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache Solr
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 

More from Grant Ingersoll

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

More from Grant Ingersoll (11)

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

LuceneSolr Evolution and Revolution

  • 1. luceneSolr = new LuceneSolr(4.x) Grant Ingersoll CTO, LucidWorks Confidential © Copyright 2012
  • 2. Search is dead, long live search • Embrace fuzziness! • Search is a system building block • If the algorithms fit, use them! • Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply http://cheezburger.com/5243950080 • Scoring features are everywhere Confidential and Proprietary 2 © 2012 LucidWorks
  • 3. Search (R)evolution • “T’ain’t your father’s search engine” - Non free text usages abound • NoSQL before NoSQL was cool - Many DB-like features • Flexibility during indexing and scoring • Finite State Transducers FTW! • Scale Confidential and Proprietary © 2012 LucidWorks
  • 4. Agenda • What’s new In Lucene 4? • What’s new in Solr 4? • Sneak Peek: what’s ahead? Confidential and Proprietary © 2012 LucidWorks
  • 5. Lucene 4 Confidential © Copyright 2012
  • 6. Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html Confidential and Proprietary 6 © 2012 LucidWorks
  • 7. Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html Confidential and Proprietary © 2012 LucidWorks
  • 8. Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments • Soft commit - Faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing - Higher performance searching • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort Confidential and Proprietary © 2012 LucidWorks
  • 9. BytesRef memory management improvements • On a Wikipedia index (11M documents) - Time to perform the first query with sorting (no warmup queries) Solr 3x: 13 seconds, Solr 4: 6 seconds. - Memory consumption Solr 3x: 1,040M, Solr 4: 366M. - Number of objects on the heap. Solr 3x: 19.4M, Solr 4: 80K. No, that’s not a typo. - http://searchhub.org/2012/04/06/memory-comparisons-between- solr-3x-and-trunk/ Confidential and Proprietary 9 © 2012 LucidWorks
  • 10. FuzzyQuery • http://people.apache.org/~mikemccand/lucenebench/F uzzy2.html Confidential and Proprietary 10 © 2012 LucidWorks
  • 11. QPS (primary key lookup) • http://people.apache.org/~mikemccand/lucenebench/P KLookup.html Confidential and Proprietary 11 © 2012 LucidWorks
  • 12. Lucene: Features • Doc Values • DirectSpellChecker - Store data in column order - No more sidecar index! - Tradeoffs when using vs. • Geospatial improvements FieldCache (more later) - http://searchhub.org/2013/04 /02/fun-with-docvalues-in- solr-4-2/ - http://www.slideshare.net/luc enerevolution/willnauer- simon-doc-values-column- stride-fields-in-lucene Confidential and Proprietary © 2012 LucidWorks
  • 13. Solr 4 Confidential © Copyright 2012
  • 14. Solr 4: Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) - Pivot Faceting - Pseudo-join - DirectSpellChecker support - Improved Spatial (more later) • Indexing - New Update Processors, including scripting option - NRT • Other - DocTransformer pluggability - New Admin UI Confidential and Proprietary © 2012 LucidWorks
  • 15. Geospatial improvements • Multiple values per field • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: - "geo”:”43.17614,-90.57341” - “geo”:”Circle(4.56,1.23 d=0.0710)” - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))” Confidential and Proprietary © 2012 LucidWorks
  • 17. SolrCloud • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud Confidential and Proprietary © 2012 LucidWorks
  • 18. Confidential and Proprietary 18 © 2012 LucidWorks
  • 19. SolrCloud’s capabilities • Transaction log - All updates are added to the transaction log. The tlog provides support for: durability for updates that have not yet been committed, peer syncing, real-time get (retrieve documents by unique id) always up to date because it checks the tlog first, does not require opening a new searcher to see changes • Near Real Time (NRT) indexing - Soft commits make updates visible - Hard commits make updates durable • Durability - Updates to Solr may be in several different states: buffered in memory, flushed, but not committed or viewable, soft committed (flushed and viewable), committed (durable) - The transaction log ensures data is not lost in any of these states if Solr crashes. • Recovery - Solr uses the transaction log for recovery; on startup Solr checks to see if the tlog is in a committed state, if not updates since the last commit are applied • Optimistic locking - Solr maintains a document version (_version_ field); updates can now specify _version_; updates to incorrect version will fail Confidential and Proprietary © 2012 LucidWorks
  • 20. SolrCloud details • “Leaders” and “replicas” - Leaders are automatically elected • Leaders are just a replica with some coordination responsibilities for the associated replicas • If a leader goes down, one of the associated replicas is elected as the new leader • New nodes are automatically assigned a shard and role, and replicate/recover as needed • SolrJ’s CloudSolrServer • Replication in Solr 4 - Used for new and recovering replicas - Or for traditional master/slave configuration Confidential and Proprietary 20 © 2012 LucidWorks
  • 21. Solr as NoSQL • Characteristics - Non-traditional data stores - Not designed for SQL type queries - Distributed fault tolerant architecture - Document oriented, data format agnostic(JSON, XML, CSV, binary) • Updated durability via transaction log • Real-time /get fetches latest version w/o hard commit • Versioning and optimistic locking - w/ Real Time GET, allows read/write/update w/o conflicts • Atomic updates - Can add/remove/change and increment a field in existing doc w/o re-indexing Confidential and Proprietary © 2012 LucidWorks
  • 22. Distributed Key / Value Pair Database • Real-time Get combined with Solr Cloud make a very powerful key/value pair database - Durable (tlog) - Isolated (Optimistic locking) - Redundant (Solr Cloud Replicas) - Distributed & scalable (billions of keys, Solr Cloud Sharding) - Efficient Multi-tenant (Solr Cloud document routing, Solr 4.1) - Fast (milli-second response time, Pulsing Codec) - Real-time (tlog) Confidential and Proprietary 22 © 2012 LucidWorks
  • 23. Routing • Allows you to route documents and queries to a subset of shards • Provides efficient multi-tenancy • Indexing: - A shard key can be prepended to the unique document id: shard_key!unique_id - Documents with the same shard_key will reside on the same shard. • Querying: shard.keys=shard_key1!... - Much more efficient then searching the entire collection. Confidential and Proprietary 23 © 2012 LucidWorks
  • 24. Looking ahead • Automatic shard splitting • Query parsing: rich query tree control via JSON/XML • “Schemaless” - Marketing term meaning convention over configuration for fields • More programmatic control over system • Continually improving performance, scalability, and robustness Confidential and Proprietary © 2012 LucidWorks
  • 25. • Want to learn more? • Join us in San Diego April 29 – May 2, 2013 • http://lucenerevolution.org/ • http://lucenerevolution.org/2013/agenda Confidential and Proprietary 25 © 2012 LucidWorks
  • 26. Resources • Lucene/Solr - http://lucene.apache.org • Me - @gsingers, grant@lucidworks.com - http://www.manning.com/ingersoll • Company - http://www.lucidworks.com - http://www.searchhub.org - Products, Support, Training on and around Lucene and Solr Confidential and Proprietary © 2012 LucidWorks

Editor's Notes

  1. Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more
  2. Search has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  3. Okapi BM25 & DFR divergence from randomness
  4. Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  5. CharacteristicsConflicts from other clients
  6. Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  7. Thanks, LinkedIn for sponsoring!