Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ElasticSearch in Production: lessons learned

With Proquest Udini, we have created the worlds largest online article store, and aim to be the center for researchers all over the world. We connect to a 700M solr cluster for search, but have recently also implemented a search component with ElasticSearch. We will discuss how we did this, and how we want to use the 30M index for scientific citation recognition. We will highlight lessons learned in integrating ElasticSearch in our virtualized EC2 environments, and challenges aligning with our continuous deployment processes.

  • Be the first to comment

ElasticSearch in Production: lessons learned

  1. ElasticSearch in Production lessons learnedAnne Veling, ApacheCon EU, November 6, 2012
  2. agendaIntroductionElasticSearchUdiniUpcoming ToolLessons Learned
  3. introductionAnne Veling, @annevelingSelf-employed contractor Software Architect Agile process management Performance optimization Lucene/SOLR/ElasticSearch implementations & training
  4. ElasticSearchApache LuceneStarted in 2010 by Shay BanonOpen Source – Apache LicenseA company was formed in 2012: ElasticSearch Training, support and developmentCareful feature development vs. build because you can
  5. ElasticSearchScalable Distributed, Node Discovery Automatic sharding Query distributionRESTful, HTTP API With API wrappers for Ruby, Java, Scala, … JSON in, JSON outDocument Model Maps “schemaless” -> field type recognition Keeps source, keeps „version‟ number
  6. ElasticSearchIntegrated faceting With statistical aggregates (sum/avg/…) for freeField types and analyzers String, numerics, geo, attachment, … Arrays, subdocuments, nested documentsIntegrated sharding Routing and alias Cross-index searching / multi-document type
  7. udini.proquest.comProQuestThe World‟s Article StoreStack Amazon EC2 Scala with Unfiltered MongoDB, ElasticSearch
  8. architecture Fulfillment Udini Summon ProvidersPDF pipeline mongo Elastic SOLR DB Search
  9. SOLR at UdiniConnecting to Summon API 700M SOLR ClusterIn Udini, we serve a subset of 160M full text articles Including fulfillment mechanisms PDF and HTML5 viewing and annotation
  10. ElasticSearch at UdiniLocal index to search your articlesMany small user libraries, searching only locally User-id as sharding key Include key in all queries
  11. Exciting new productDeveloping for ProQuestExciting new research tool for scientific researchersCreating a large ElasticSearch index for journal articlecanonicalizationCurrently in private beta, launching in the coming months
  12. Lessons Learned Very fast indexing Bulk indexing ftw Set up without replicas (replicas = 0, not 1) Play with bulk size Simple write to disk and CURL it in, is very fast 1M records in 40sfor f in ${BATCH_DIR}/batch-*.jsondo echo "about to index $f" curl --silent --show-error --request POST --data-binary @$f localhost:9200/_bulk > /dev/null echodone
  13. Lessons learnedSchema(less)?Automatic field type recognition Can miss types Strict about types #duhMapping of subfields (doc.title vs doc.publication.title) Version dependentIn reality Schema still needed Mapping changes still non trivial
  14. Lessons learnedLearn to trust ElasticSearch Analyzers: do not pretokenize queries yourself…Difference between “term” and “text” type queries tokenized or notElasticSearch probably already does what you want it todo Search for it Try it
  15. Lessons learnedIssues with automated testing and node discovery/startupStart/stop hundreds of times during Jenkins test jobs ordevelopment boxes Takes time Locally sometimes picks up previous versionsMemory issues: ElasticSearch manages a large part of itsmemory outside of the heap Do not simply increase -Xmx
  16. Lessons learnedNew tools every monthwaitForYellowStatusAliases, routing allow for clever control
  17. APIElasticSearch is new, connection libraries still in infancy,documentation growingIssues using the Java API in ScalaHappy with Scalastic now synchronous asynchronous bulk prepare
  18. #nodbElasticSearch used as a full nosql datastore?Using “version” and optimistic locking schemeCould replace MongoDb in our setupElasticSearch is actually a store optimized for getting stuffout, not for getting stuff in With free faceting Who needs multi-table transactions anyway?
  19. SOLR vs ElasticSearchSOLR ElasticSearch Well-known, many New kid on the block tools, extensions Very easy to configure Feels clunky to Handles document to configure lucene mapping Manual document to Horizontally scalable lucene mapping Easy replication Replication and But: shard key indexing in a cluster non-trivial New schoolOld school ;-)
  20. search evolution • Custom indexers • Inverted index • Segment merges • Custom analyzers • Faceting • Configuration of analyzers • Faceting, Geospatial • Document mapping • Sub-document queries • Replication • JSON document input • Faceting, complex queries just work
  21. conclusionsElasticSearch benefits Easy to setup Very clever architectureDrawbacks Very new software, tool support limited But lots of movement Change sharding in a full index non-trivialElasticSearch Clever architecture, fast, stable Does exactly what you need
  22. thank youAre you still using Solr?Come on, it’s 2012 already ;-) @anneveling