Elasticsearch has quickly become the leading open source technology for scaling search and building document services on. Many software providers have come to rely on it to serve the needs of high-performance, production applications.
In this talk, we’ll go deep on lessons learned from three years in production scaling from a few shards to more than 100 spread across 100s of nodes on AWS--to serve real-time queries against 100s of millions of documents.
Attendees will learn:
* How to capacity plan for ES on AWS
* How to scale and reshard on AWS with zero downtime
* What AWS and ES metrics to collect and alert on
* Tips on day to day ES operations
Session sponsored by SignalFx.
2. What to Expect from the Session
• Elasticsearch (ES) usage at SignalFx
• What do we use ES for?
• How ES is deployed on AWS?
• Backup/restore of ES on Amazon S3
• Important ES/AWS metrics to monitor; what to alert on
• ES capacity planning
• Zero-downtime re-sharding
• SignalFx metadata storage architecture overview
• Scaling up and zero-downtime re-sharding on AWS
10. Key Detectors
• High CPU usage, low disk size
• Sustained high heap usage
• Master nodes availability
• Cluster state (green/yellow/red)
• Unassigned shards
• Thread pool rejections (search, bulk, index are the most
critical)
13. Capacity Factors
• Indexing
• CPU/IO utilization can be considerable
• Merges are CPU/IO intensive. Improved in ES 2.0
• Queries
• CPU load
• Memory load
15. Sizing Shards
• Create an index with one shard
• Simulate what you expect your indexing load to be –
measure CPU/IO load, find where it breaks
• Do the same with queries
• Determine disk consumption (average document size)
17. Why Re-shard?
• Required if you can’t scale up indexing by adding more
nodes
• If the index is read-only, you could implement a simpler
approach using aliases
• If the index is being written to, it’s more complicated
35. Handling Failures
• Bulk re-indexing can fail (and it does); you don’t want to
re-start from scratch
• Use a “partition” field
• Migrate partition ranges
• Deletions could be a problem. We handle that by using
“deletion markers” instead then cleaning up
36. Performance Considerations
• Migrate using partition ranges to avoid holding segments
for a long time
• Add temporary nodes to handle the load
• Disable refreshes on the target index (so worth it!)
• Start with no replica (or one just in case)
• Avoid ”hot” shards by sorting on a field (a timestamp for
example)
• Have throttling controls to control indexing load