SlideShare a Scribd company logo
1 of 75
Download to read offline
Solr, Lucene
            & Hadoop
             @



Monday, May 14, 12
david@etsy
                     4 Years Lucene and
                         Solr @ Etsy

Monday, May 14, 12
History of Search
                          @ Etsy
                     Hadoop + HBase
                        Indexing
                       (in development)

                       Replication
Monday, May 14, 12
About
                      Us

Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
13MM Listings
                     39MM Unique Visitors
                      880K Shops / 150
                         Countries
                       100+ Engineers
Monday, May 14, 12
Architectur
                     e Overview

Monday, May 14, 12
Overview
                      Search       Web         Database
                     +n slaves    +n webs     +n db shards




                                 Memcached
                                  +n caches




Monday, May 14, 12
Thrift

            Search                                Web

               slave                              web
                        query = hats for cats

               slave                              web
                        result = 402, 283, 837

           +n slaves                             +n webs




Monday, May 14, 12
Hydration
                      Database

                        shard


                        shard
                                   Web

                                   web
                      +n shards


                                   web

                     Memcached
                                  +n webs

                       cache


                       cache


                      +n caches

Monday, May 14, 12
The Results




Monday, May 14, 12
History of
                      Search
Monday, May 14, 12
History of Search
               2007
                     •1 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Twisted >
                     Stored Proc > TSearch
Monday, May 14, 12
History of Search
               2008
                     •2 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Solr
                     •4 Solr Slaves + 2
Monday, May 14, 12
History of Search
               2009
                     •4 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Solr
                     •6 Solr Slaves + 2
Monday, May 14, 12
History of Search
               2010
                     •7 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Thrift > Solr
                     •10 Solr Slaves + 1
Monday, May 14, 12
History of Search
               2011
              •10 Million Listings
              •“Master” Postgres
              Database + DB SHARDS!
              •PHP > Thrift > Solr
              •24 Solr Slaves + 1
Monday, May 14, 12
Future of Search
               2012
                     •?? Million Listings
                     •MORE DB SHARDS!
                     •PHP > Thrift > Solr
                     •?? Solr Slaves + 1
                     Master
Monday, May 14, 12
What Did
                     We Learn?
Monday, May 14, 12
Lucene + Solr
                       > TSearch
  http://www.depesz.com/2010/10/17/why-im-
             not-fan-of-tsearch-2/

Monday, May 14, 12
Love Lucene +
      Solr Trunk!


Monday, May 14, 12
Run, Don’t
                       Walk...



Monday, May 14, 12
Deployinator
      Fork it: https://github.com/etsy/deployinator


Monday, May 14, 12
Smoker


Monday, May 14, 12
StatsD, Graph
                      Everything!
                Fork it: https://github.com/etsy/statsd


Monday, May 14, 12
Monday, May 14, 12
95th Percentile


Monday, May 14, 12
start · build_query · perform_search ·
               receive_search_ads · search_side_response ·
               create_event_logger · set_tpl_vars · tpl_render ·
               receive_search_ads_post_render
Monday, May 14, 12
Solr Top Level
                        Cache >
                      Memcached
Monday, May 14, 12
etsy-
                     index.properties
    $ cat /search/data/person/index/etsy-index.properties
    #Tue Mar 27 13:05:51 EDT 2012
    max_update_time=2012-03-27T17:05:51.955Z

Monday, May 14, 12
Check Index Size
  Don’t Install if < 50%
      Current Size

Monday, May 14, 12
Check if Index is
                   Too Old
               Don’t Update if >
                 10 Days Old
Monday, May 14, 12
What Did We Learn?




                     Store Nothing


Monday, May 14, 12
Keep
                     Denormalized
                         Data

Monday, May 14, 12
DB Shard



                                    PHP        JSON    Search
                     DB Shard   Denormalizer          Database




                     DB Shard




Monday, May 14, 12
Full       Apply
                                             Install
                     Reindex   Incremental




Monday, May 14, 12
Full       Apply         Apply
                                                       Install
                 Reindex   Incremental   Incremental




Monday, May 14, 12
r


                              Database
                      exe
                     Ind




Monday, May 14, 12
HBase +
                     Hadoop
Monday, May 14, 12
HBase + Hadoop




                     Why HBase?


Monday, May 14, 12
HBase + Hadoop

                     DB Shard



                                    PHP        JSON
                     DB Shard   Denormalizer          HBase




                     DB Shard




Monday, May 14, 12
HBase + Hadoop

               listings_denormalized
              {NAME => 'listings_denormalized', FAMILIES
              => [{NAME => 'listing_data', BLOOMFILTER =>
              'ROW', REPLICATION_SCOPE => '0',
              COMPRESSION => 'SNAPPY', VERSIONS => '1',
              TTL => '-1', BLOCKSIZE => '65536',
              IN_MEMORY => 'false', BLOCKCACHE =>




Monday, May 14, 12
HBase + Hadoop

               listings_denormalized_m
               odified_index
              {NAME =>
              'listings_denormalized_modified_index',
              FAMILIES => [{NAME => 'pks', BLOOMFILTER
              => 'ROW', REPLICATION_SCOPE => '0',
              COMPRESSION => 'SNAPPY', VERSIONS => '1',
              TTL => '-1', BLOCKSIZE => '65536',




Monday, May 14, 12
HBase + Hadoop




                      SOLR-1301
                https://issues.apache.org/jira/browse/
                              SOLR-1301

Monday, May 14, 12
HBase + Hadoop


                             Disk   •Solr
                 Solr
             Output Format          Document
                             HDFS
                                    Converter
                                    •Solr Requires

Monday, May 14, 12
HBase + Hadoop

                     •Not Great with
                     Multi-Core Configs
                     •Added Solr Multi-Core
                     Support
                     • Solr Config Issues
                     •Added ENV support
Monday, May 14, 12
HBase + Hadoop



                SolrInputDocume
                    ntWritable
    public class SolrInputDocumentWritable extends SolrInputDocument
    implements org.apache.hadoop.io.Writable {



Monday, May 14, 12
HBase + Hadoop




                     Oozie


Monday, May 14, 12
HBase + Hadoop



                     Oozie + HBase?


Monday, May 14, 12
HBase + Hadoop



               ScanStringGenera
                      tor
   http://blog.ozbuyucusu.com/2011/07/21/
 using-hbase-tablemapper-via-oozie-workflow/

Monday, May 14, 12
HBase + Hadoop
                              Hadoop           Indexer


                     Oozie                      Start




                      Map              HBase    Copy




                     Reduce            HDFS    Merge




                      Solr
                                       Disk     Install
                     Output




Monday, May 14, 12
HBase + Hadoop



               IndexerActionMai
                      n

Monday, May 14, 12
HBase + Hadoop




                     Deployinator


Monday, May 14, 12
HBase + Hadoop




                     IndexCompare


Monday, May 14, 12
HBase + Hadoop

    $ ./compare

    ERROR: please provide two index directories

    example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588
    options:
        -p --percent= percent of the index to check
        -i --id=      primary key id field in the index
        -h --hash=    comparison or hash field in the index
        <index> <index>




Monday, May 14, 12
HBase + Hadoop
      $ ./compare 
      /search/data/person/index-1332867952588/ 
      /search/data/person/index-1335378487672

        id field: user_id
      hash field: hash
      percentage: 0.0010
           files: /search/data/person/index-1332867952588/ /search/
      data/person/index-1335378487672

      /search/data/person/index-1332867952588 contains 1515512 docs
      /search/data/person/index-1335378487672 contains 14837972 docs
      1516 of 1516 documents are the same




Monday, May 14, 12
HBase + Hadoop




                     Copy and Merge


Monday, May 14, 12
HBase + Hadoop




                     Open Source


Monday, May 14, 12
Replication

Monday, May 14, 12
Replication




Monday, May 14, 12
Replication

                              Slaves

                     Master




                              +n slaves




Monday, May 14, 12
Monday, May 14, 12
BitTorrent
                     Replication
Monday, May 14, 12
Bit Torrent

  Using BitTornado:




Monday, May 14, 12
Replication
               Bit Torrent + Solr




Monday, May 14, 12
Replication
               Bit Torrent + Solr




Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
Replication

                     Fork of TTorent: https://github.com/
                                 etsy/ttorrent
                              Multi-File Support
                              Large File Support

                       Fork BitTorrent: Comming Soon




Monday, May 14, 12
Need a job?

Monday, May 14, 12
Monday, May 14, 12
Thanks!

Monday, May 14, 12
david@etsy

Monday, May 14, 12

More Related Content

Viewers also liked

Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanGregg Donovan
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionLucidworks
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 

Viewers also liked (10)

Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg Donovan
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with Fusion
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 

Similar to Solr, Lucene and Hadoop @ Etsy

Pinterest arch summit august 2012 - scaling pinterest
Pinterest arch summit   august 2012 - scaling pinterestPinterest arch summit   august 2012 - scaling pinterest
Pinterest arch summit august 2012 - scaling pinterestdrewz lin
 
Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Giuseppe Maxia
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012trisberg
 
Web5 pushing the web forward.apr
Web5 pushing the web forward.aprWeb5 pushing the web forward.apr
Web5 pushing the web forward.aprArnout Kazemier
 
Testing mysql creatively in a sandbox
Testing mysql creatively in a sandboxTesting mysql creatively in a sandbox
Testing mysql creatively in a sandboxGiuseppe Maxia
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012andersonjohnd
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services PHP Conference Argentina
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceMatias Paterlini
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeSpyros Passas
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Projectroumia
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
Puppet Module Writing 201
Puppet Module Writing 201Puppet Module Writing 201
Puppet Module Writing 201eshamow
 
PuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppet
 
OpenSky Infrastructure
OpenSky InfrastructureOpenSky Infrastructure
OpenSky InfrastructureJonathan Wage
 
Complex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxComplex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxLance Ball
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deploymentzeeg
 
Building Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudBuilding Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudCarl Mercier
 

Similar to Solr, Lucene and Hadoop @ Etsy (18)

Pinterest arch summit august 2012 - scaling pinterest
Pinterest arch summit   august 2012 - scaling pinterestPinterest arch summit   august 2012 - scaling pinterest
Pinterest arch summit august 2012 - scaling pinterest
 
Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012
 
Web5 pushing the web forward.apr
Web5 pushing the web forward.aprWeb5 pushing the web forward.apr
Web5 pushing the web forward.apr
 
Testing mysql creatively in a sandbox
Testing mysql creatively in a sandboxTesting mysql creatively in a sandbox
Testing mysql creatively in a sandbox
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP Conference
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappe
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Project
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Measure Everything
Measure EverythingMeasure Everything
Measure Everything
 
Puppet Module Writing 201
Puppet Module Writing 201Puppet Module Writing 201
Puppet Module Writing 201
 
PuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable Modules
 
OpenSky Infrastructure
OpenSky InfrastructureOpenSky Infrastructure
OpenSky Infrastructure
 
Complex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxComplex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBox
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
 
Building Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudBuilding Scalable Web Applications For The Cloud
Building Scalable Web Applications For The Cloud
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Solr, Lucene and Hadoop @ Etsy

  • 1. Solr, Lucene & Hadoop @ Monday, May 14, 12
  • 2. david@etsy 4 Years Lucene and Solr @ Etsy Monday, May 14, 12
  • 3. History of Search @ Etsy Hadoop + HBase Indexing (in development) Replication Monday, May 14, 12
  • 4. About Us Monday, May 14, 12
  • 8. 13MM Listings 39MM Unique Visitors 880K Shops / 150 Countries 100+ Engineers Monday, May 14, 12
  • 9. Architectur e Overview Monday, May 14, 12
  • 10. Overview Search Web Database +n slaves +n webs +n db shards Memcached +n caches Monday, May 14, 12
  • 11. Thrift Search Web slave web query = hats for cats slave web result = 402, 283, 837 +n slaves +n webs Monday, May 14, 12
  • 12. Hydration Database shard shard Web web +n shards web Memcached +n webs cache cache +n caches Monday, May 14, 12
  • 14. History of Search Monday, May 14, 12
  • 15. History of Search 2007 •1 Million Listings •A Single “Master” Postgres Database •PHP > Twisted > Stored Proc > TSearch Monday, May 14, 12
  • 16. History of Search 2008 •2 Million Listings •A Single “Master” Postgres Database •PHP > Solr •4 Solr Slaves + 2 Monday, May 14, 12
  • 17. History of Search 2009 •4 Million Listings •A Single “Master” Postgres Database •PHP > Solr •6 Solr Slaves + 2 Monday, May 14, 12
  • 18. History of Search 2010 •7 Million Listings •A Single “Master” Postgres Database •PHP > Thrift > Solr •10 Solr Slaves + 1 Monday, May 14, 12
  • 19. History of Search 2011 •10 Million Listings •“Master” Postgres Database + DB SHARDS! •PHP > Thrift > Solr •24 Solr Slaves + 1 Monday, May 14, 12
  • 20. Future of Search 2012 •?? Million Listings •MORE DB SHARDS! •PHP > Thrift > Solr •?? Solr Slaves + 1 Master Monday, May 14, 12
  • 21. What Did We Learn? Monday, May 14, 12
  • 22. Lucene + Solr > TSearch http://www.depesz.com/2010/10/17/why-im- not-fan-of-tsearch-2/ Monday, May 14, 12
  • 23. Love Lucene + Solr Trunk! Monday, May 14, 12
  • 24. Run, Don’t Walk... Monday, May 14, 12
  • 25. Deployinator Fork it: https://github.com/etsy/deployinator Monday, May 14, 12
  • 27. StatsD, Graph Everything! Fork it: https://github.com/etsy/statsd Monday, May 14, 12
  • 30. start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render Monday, May 14, 12
  • 31. Solr Top Level Cache > Memcached Monday, May 14, 12
  • 32. etsy- index.properties $ cat /search/data/person/index/etsy-index.properties #Tue Mar 27 13:05:51 EDT 2012 max_update_time=2012-03-27T17:05:51.955Z Monday, May 14, 12
  • 33. Check Index Size Don’t Install if < 50% Current Size Monday, May 14, 12
  • 34. Check if Index is Too Old Don’t Update if > 10 Days Old Monday, May 14, 12
  • 35. What Did We Learn? Store Nothing Monday, May 14, 12
  • 36. Keep Denormalized Data Monday, May 14, 12
  • 37. DB Shard PHP JSON Search DB Shard Denormalizer Database DB Shard Monday, May 14, 12
  • 38. Full Apply Install Reindex Incremental Monday, May 14, 12
  • 39. Full Apply Apply Install Reindex Incremental Incremental Monday, May 14, 12
  • 40. r Database exe Ind Monday, May 14, 12
  • 41. HBase + Hadoop Monday, May 14, 12
  • 42. HBase + Hadoop Why HBase? Monday, May 14, 12
  • 43. HBase + Hadoop DB Shard PHP JSON DB Shard Denormalizer HBase DB Shard Monday, May 14, 12
  • 44. HBase + Hadoop listings_denormalized {NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => Monday, May 14, 12
  • 45. HBase + Hadoop listings_denormalized_m odified_index {NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', Monday, May 14, 12
  • 46. HBase + Hadoop SOLR-1301 https://issues.apache.org/jira/browse/ SOLR-1301 Monday, May 14, 12
  • 47. HBase + Hadoop Disk •Solr Solr Output Format Document HDFS Converter •Solr Requires Monday, May 14, 12
  • 48. HBase + Hadoop •Not Great with Multi-Core Configs •Added Solr Multi-Core Support • Solr Config Issues •Added ENV support Monday, May 14, 12
  • 49. HBase + Hadoop SolrInputDocume ntWritable public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable { Monday, May 14, 12
  • 50. HBase + Hadoop Oozie Monday, May 14, 12
  • 51. HBase + Hadoop Oozie + HBase? Monday, May 14, 12
  • 52. HBase + Hadoop ScanStringGenera tor http://blog.ozbuyucusu.com/2011/07/21/ using-hbase-tablemapper-via-oozie-workflow/ Monday, May 14, 12
  • 53. HBase + Hadoop Hadoop Indexer Oozie Start Map HBase Copy Reduce HDFS Merge Solr Disk Install Output Monday, May 14, 12
  • 54. HBase + Hadoop IndexerActionMai n Monday, May 14, 12
  • 55. HBase + Hadoop Deployinator Monday, May 14, 12
  • 56. HBase + Hadoop IndexCompare Monday, May 14, 12
  • 57. HBase + Hadoop $ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588 options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index> Monday, May 14, 12
  • 58. HBase + Hadoop $ ./compare /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672 id field: user_id hash field: hash percentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/ data/person/index-1335378487672 /search/data/person/index-1332867952588 contains 1515512 docs /search/data/person/index-1335378487672 contains 14837972 docs 1516 of 1516 documents are the same Monday, May 14, 12
  • 59. HBase + Hadoop Copy and Merge Monday, May 14, 12
  • 60. HBase + Hadoop Open Source Monday, May 14, 12
  • 63. Replication Slaves Master +n slaves Monday, May 14, 12
  • 65. BitTorrent Replication Monday, May 14, 12
  • 66. Bit Torrent Using BitTornado: Monday, May 14, 12
  • 67. Replication Bit Torrent + Solr Monday, May 14, 12
  • 68. Replication Bit Torrent + Solr Monday, May 14, 12
  • 71. Replication Fork of TTorent: https://github.com/ etsy/ttorrent Multi-File Support Large File Support Fork BitTorrent: Comming Soon Monday, May 14, 12
  • 72. Need a job? Monday, May 14, 12