SlideShare a Scribd company logo
1 of 71
Download to read offline
Large-Scale
Automated Storage
on Kubernetes
Matt Schallert @mattschallert
SRE @ Uber
About Me
● SRE @ Uber since 2016
● M3 - Uber’s OSS Metrics Platform
@mattschallert
Agenda
● Background: Metrics @ Uber
● Scaling Challenges
● Automation
● M3DB on Kubernetes
@mattschallert
2015
@mattschallert
Mid 2016
@mattschallert
● Cassandra stack scaled to needs but was expensive
○ Human cost: operational overhead
○ Infra cost: lots of high-IO servers
● Scaling challenges inevitable, needed to reduce
complexity
Mid 2016
@mattschallert
M3DB
● Open source distributed time series database
● Highly compressed (11x), fast, compaction-free storage
● Sharded, replicated (zone & rack-aware)
● Production late 2016
@mattschallert
M3DB
● Reduced operational overhead
● Runs on commodity hardware with SSDs
● Strongly consistent cluster membership backed by etcd
@mattschallert
m3cluster
● Reusable cluster management libraries created for M3DB
○ Topology representation, shard distribution algos
○ Service discovery, runtime config, leader election
○ Etcd (strongly consistent KV store) as source of truth
for all components
● Similar to Apache Helix
○ Helix is Java-heavy, we’re a Go-based stack
m3cluster
● Topologies represented as desired goal states, M3DB
converges
@mattschallert
m3cluster: Placement
@mattschallert
Late 2016
@mattschallert
Post-M3DB Deployment
● M3DB solved most pressing issue (Cassandra)
○ Improved stability, reduced costs
● System as a whole: still a high operational overhead
@mattschallert
Fragmented representation of components
@mattschallert
System Representation (2016)
@mattschallert
System Representation (2016)
System Representation (2016)
System Representation (2016)
System Representation (2016)
System Representation
● Systems represented as…
○ Lists fetched from Uber config service
○ Static lists deployed via Puppet
○ Static lists coupled with service deploys
○ m3cluster placements
@mattschallert
Fragmentation
● Had smaller components rather than one unifying view
of the system as a whole
● Difficult to reason about
○ “Where do I need to make this change?”
○ Painful onboarding for new engineers
● Impediment to automation
○ Replace host: update config, build N services, deploy
System Representation (2016)
System Representation: 2018
@mattschallert
m3cluster benefits
● Common libraries for all workloads: stateful (M3DB),
semi-stateful (aggregation tier), stateless (ingestion)
○ Implement “distribute shards across racks according
to disk size” once, reuse everywhere
● Single source of truth: everything stored in etcd
@mattschallert
Operational Improvements
● Easier to reason about the entire system
○ Unifying view: Placements
● Operations much easier, possible to automate
○ Just modifying goal state in etcd
● One config per deployment (no instance-specific)
○ Cattle vs. pets: instances aren’t special, treat them as
a whole
M3DB Goal State
● Shard assignments stored in etcd
○ M3DB nodes observe desired state, react, update
shard state
● Imperative actions such as “add a node” are really
changes in declarative state
○ Easy for both people or tooling to interact with
@mattschallert
M3DB: Adding a Node
@mattschallert
M3DB: Adding a Node
M3DB: Adding a Node
@mattschallert
M3DB: Adding a Node
@mattschallert
M3DB Goal State
● Goal-state primitives built with future automation in mind
● m3cluster interacts with single source of truth
● Vs. imperative changes:
○ No instance-level operations
○ No state or steps to keep track of: can always
reconcile state of world (restarts, etc. safe)
@mattschallert
System Representation: 2018
@mattschallert
M3DB Operations
● Retooled entire stack as goal state-driven, dynamic
○ Operations simplified, but still triggered by a person
● M3DB requires more care to manage than stateless
○ Takes time to stream data
@mattschallert
2016
Clusters
2
@mattschallert
2018
Clusters
40+
@mattschallert
Automating M3DB
● Wanted goal state at a higher level
○ Clusters as cattle
● M3DB can only do so much; needed to expand scope
● Chose to build on top of Kubernetes
○ Value for M3DB OSS community
○ Internal Uber use cases
@mattschallert
Kubernetes @ 35,000 ft
● Container orchestration system
● Support for stateful workloads
● Self-healing
● Extensible
○ CRD: define your own API resources
○ Operator: custom controller
@mattschallert
● Pod: base unit of work
○ Group of like containers deployed together
● Pods can come and go
○ Self-healing → pods move in response to failures
● No pods are special
Kubernetes Driving Principles
@mattschallert
Kubernetes Controllers
for {
desired := getDesiredState()
current := getCurrentState()
makeChanges(desired, current)
}
Writing Controllers Guide
M3DB & Kubernetes
M3DB:
● Goal state driven
● Single source of truth
● Nodes work to converge
Kubernetes:
● Goal state driven
● Single source of truth
● Components work to
converge
@mattschallert
M3DB Operator
● Manages M3DB clusters running on Kubernetes
○ Creation, deletion, scaling, maintenance
● Performs actions a person would have to do in full
cluster lifecycle
○ Updating M3DB clusters
○ Adding more instances
@mattschallert
M3DB Operator
● Users define desired M3DB clusters
○ “9 node cluster, spread across 3 zones, with remote
SSDs attached”
● Operator updates desired Kubernetes state, waits for
convergence, updates M3DB desired state
@mattschallert
M3DB Operator: Create Example
@mattschallert
@mattschallert
@mattschallert
@mattschallert
M3DB Operator: Day 2 Operations
● Scaling a cluster
○ M3DB adds 1-by-1, operation handles many
● Replacing instances, updating config, etc.
● All operations follow same pattern
○ Look at Kubernetes state, look at M3DB state, affect
change
@mattschallert
M3DB Operator: Success Factors
● Bridge between goal state-driven APIS (anti-footgun)
○ Can’t accidentally remove more pods than desired
○ Can’t accidentally remove two M3DB instances
○ Operator can be restarted, pick back up
● Share similar concepts, expanded scope
○ Failure domains, resource quantities, etc.
● Finding common abstractions made mental model easier
○ Implementation can be separate, interface the same
● + Goal-state driven approach simplified operations
○ External factors (failures, deploys) don’t disrupt steady
state
Lessons Learned: M3
@mattschallert
● Embracing Kubernetes principles in M3DB made
Kubernetes onboarding easier
○ No pods are special, may move around
○ Self-healing after failures
● Instances as cattle, not pets
○ No instance-level operations or config
Lessons Learned: Kubernetes
@mattschallert
● Any orchestration system implies complexity
○ Your own context: is it worth it?
○ How does your own system respond to failures,
rescheduling, etc.?
● Maybe not a good fit if have single special instances
○ M3DB can tolerate failures
Considerations
@mattschallert
● Requirements & guarantees of your system will influence
your requirements for automation
○ M3DB strongly consistent: wanted strongly
consistent cluster membership
Considerations
@mattschallert
● Building common abstractions for all our workloads
eased mental model
● Following goal-state driven approach eased automation
● Closely aligned with Kubernetes principles
Recap
@mattschallert
● github.com/m3db/m3
○ M3DB, m3cluster, m3aggregator
○ github.com/m3db/m3db-operator
● m3db.io & docs.m3db.io
● Please leave feedback on O’Reilly site!
Thank you!
@mattschallert
Pre-2015
Early 2015
M3DB Operator: Today
● [ THIS SLIDE IS SKIPPED ]
● https://github.com/m3db/m3db-operator
● Managing 1 M3DB clusters ~as easy as 10
Late 2015
Mid 2016
Post-M3DB Deployment
System Interfaces
System Interfaces
● Ingestion path used multiple text-based line protocols
○ Difficult to encode metric metadata
○ Redundant serialization costs
● Storage layers used various RPC formats
System Representation (2016)
M3DB Operator [SLIDE SKIPPED]
ROUGH DRAFT DIAGRAM
● [ Transition to Kubernetes, possibly using edge vs. level
triggered ]
○ Level triggered: reacting to current state. Actions
such as “replace this node” can’t be missed.
○ Edge triggered: reacting to events. Events can be
missed.
● [ Maybe something about idempotency requirement ? ]
Edge vs. Level Triggered Logic
m3cluster: Placement
message Instance {
string id
string isolation_group
string zone
uint32 weight
string endpoint
repeated Shard shards
...
}
message Placement {
map<string, Instance> instances
uint32 replica_factor
uint32 num_shards
bool is_sharded
}
m3cluster: Shard
enum ShardState {
INITIALIZING = 0;
AVAILABLE = 1;
LEAVING = 2;
}
message Shard {
uint32 id = 1;
ShardState state = 2;
string source_id = 3;
}
M3cluster: all
enum ShardState {
INITIALIZING = 0;
AVAILABLE = 1;
LEAVING = 2;
}
message Shard {
uint32 id = 1;
ShardState state = 2;
string source_id = 3;
}
message Instance {
string id = 1;
string isolation_group = 2;
string zone = 3;
uint32 weight = 4;
string endpoint = 5;
repeated Shard shards = 6;
...
}
message Placement {
map<string, Instance> instances = 1;
uint32 replica_factor = 2;
uint32 num_shards = 3;
bool is_sharded = 4;
}
● Consistency of operations
○ Many actors in the system
○ Failures can happen, may re-reconcile states
○ Kubernetes & m3cluster leverage etcd
Prerequisites for Orchestration
● [ Lessons learned orchestrating M3DB generally ]
● [ Learnings specific to Kubernetes ]
Lessons Learned

More Related Content

What's hot

Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelinesTimothy Farkas
 
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSMariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSJelastic Multi-Cloud PaaS
 
Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And TechniquesKnoldus Inc.
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerRaghavendra Prabhu
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)Alexander Kukushkin
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoScyllaDB
 
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster ManagementApache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster ManagementAnant Corporation
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseScyllaDB
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQLMariaDB plc
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven MicroservicesFabrizio Fortino
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose themDatio Big Data
 
ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016Tzach Livyatan
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 

What's hot (20)

Stateful streaming data pipelines
Stateful streaming data pipelinesStateful streaming data pipelines
Stateful streaming data pipelines
 
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaSMariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
MariaDB Auto-Clustering, Vertical and Horizontal Scaling within Jelastic PaaS
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And Techniques
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task manager
 
PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
What Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
 
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster ManagementApache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management
Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management
 
2013 05 ny
2013 05 ny2013 05 ny
2013 05 ny
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 

Similar to Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019

Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEANMadhukara Phatak
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLMariaDB plc
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1MariaDB plc
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackIntroducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMarkus Mäkelä
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud EraMydbops
 
Design Summit - Automate roadmap - Madhu Kanoor
Design Summit - Automate roadmap - Madhu KanoorDesign Summit - Automate roadmap - Madhu Kanoor
Design Summit - Automate roadmap - Madhu KanoorManageIQ
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Databricks
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 

Similar to Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019 (20)

Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEAN
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQL
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Introducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStackIntroducing MagnetoDB, a key-value storage sevice for OpenStack
Introducing MagnetoDB, a key-value storage sevice for OpenStack
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database ProxyMariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
MariaDB MaxScale: an Intelligent Database Proxy
MariaDB MaxScale:  an Intelligent Database ProxyMariaDB MaxScale:  an Intelligent Database Proxy
MariaDB MaxScale: an Intelligent Database Proxy
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
Design Summit - Automate roadmap - Madhu Kanoor
Design Summit - Automate roadmap - Madhu KanoorDesign Summit - Automate roadmap - Madhu Kanoor
Design Summit - Automate roadmap - Madhu Kanoor
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 

Recently uploaded

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019

  • 1. Large-Scale Automated Storage on Kubernetes Matt Schallert @mattschallert SRE @ Uber
  • 2. About Me ● SRE @ Uber since 2016 ● M3 - Uber’s OSS Metrics Platform @mattschallert
  • 3. Agenda ● Background: Metrics @ Uber ● Scaling Challenges ● Automation ● M3DB on Kubernetes @mattschallert
  • 6. ● Cassandra stack scaled to needs but was expensive ○ Human cost: operational overhead ○ Infra cost: lots of high-IO servers ● Scaling challenges inevitable, needed to reduce complexity Mid 2016 @mattschallert
  • 7. M3DB ● Open source distributed time series database ● Highly compressed (11x), fast, compaction-free storage ● Sharded, replicated (zone & rack-aware) ● Production late 2016 @mattschallert
  • 8. M3DB ● Reduced operational overhead ● Runs on commodity hardware with SSDs ● Strongly consistent cluster membership backed by etcd @mattschallert
  • 9. m3cluster ● Reusable cluster management libraries created for M3DB ○ Topology representation, shard distribution algos ○ Service discovery, runtime config, leader election ○ Etcd (strongly consistent KV store) as source of truth for all components ● Similar to Apache Helix ○ Helix is Java-heavy, we’re a Go-based stack
  • 10. m3cluster ● Topologies represented as desired goal states, M3DB converges @mattschallert
  • 13. Post-M3DB Deployment ● M3DB solved most pressing issue (Cassandra) ○ Improved stability, reduced costs ● System as a whole: still a high operational overhead @mattschallert
  • 14. Fragmented representation of components @mattschallert
  • 20. System Representation ● Systems represented as… ○ Lists fetched from Uber config service ○ Static lists deployed via Puppet ○ Static lists coupled with service deploys ○ m3cluster placements @mattschallert
  • 21. Fragmentation ● Had smaller components rather than one unifying view of the system as a whole ● Difficult to reason about ○ “Where do I need to make this change?” ○ Painful onboarding for new engineers ● Impediment to automation ○ Replace host: update config, build N services, deploy
  • 24. m3cluster benefits ● Common libraries for all workloads: stateful (M3DB), semi-stateful (aggregation tier), stateless (ingestion) ○ Implement “distribute shards across racks according to disk size” once, reuse everywhere ● Single source of truth: everything stored in etcd @mattschallert
  • 25. Operational Improvements ● Easier to reason about the entire system ○ Unifying view: Placements ● Operations much easier, possible to automate ○ Just modifying goal state in etcd ● One config per deployment (no instance-specific) ○ Cattle vs. pets: instances aren’t special, treat them as a whole
  • 26. M3DB Goal State ● Shard assignments stored in etcd ○ M3DB nodes observe desired state, react, update shard state ● Imperative actions such as “add a node” are really changes in declarative state ○ Easy for both people or tooling to interact with @mattschallert
  • 27. M3DB: Adding a Node @mattschallert
  • 29. M3DB: Adding a Node @mattschallert
  • 30. M3DB: Adding a Node @mattschallert
  • 31. M3DB Goal State ● Goal-state primitives built with future automation in mind ● m3cluster interacts with single source of truth ● Vs. imperative changes: ○ No instance-level operations ○ No state or steps to keep track of: can always reconcile state of world (restarts, etc. safe) @mattschallert
  • 33. M3DB Operations ● Retooled entire stack as goal state-driven, dynamic ○ Operations simplified, but still triggered by a person ● M3DB requires more care to manage than stateless ○ Takes time to stream data @mattschallert
  • 36. Automating M3DB ● Wanted goal state at a higher level ○ Clusters as cattle ● M3DB can only do so much; needed to expand scope ● Chose to build on top of Kubernetes ○ Value for M3DB OSS community ○ Internal Uber use cases @mattschallert
  • 37. Kubernetes @ 35,000 ft ● Container orchestration system ● Support for stateful workloads ● Self-healing ● Extensible ○ CRD: define your own API resources ○ Operator: custom controller @mattschallert
  • 38. ● Pod: base unit of work ○ Group of like containers deployed together ● Pods can come and go ○ Self-healing → pods move in response to failures ● No pods are special Kubernetes Driving Principles @mattschallert
  • 39. Kubernetes Controllers for { desired := getDesiredState() current := getCurrentState() makeChanges(desired, current) } Writing Controllers Guide
  • 40. M3DB & Kubernetes M3DB: ● Goal state driven ● Single source of truth ● Nodes work to converge Kubernetes: ● Goal state driven ● Single source of truth ● Components work to converge @mattschallert
  • 41. M3DB Operator ● Manages M3DB clusters running on Kubernetes ○ Creation, deletion, scaling, maintenance ● Performs actions a person would have to do in full cluster lifecycle ○ Updating M3DB clusters ○ Adding more instances @mattschallert
  • 42. M3DB Operator ● Users define desired M3DB clusters ○ “9 node cluster, spread across 3 zones, with remote SSDs attached” ● Operator updates desired Kubernetes state, waits for convergence, updates M3DB desired state @mattschallert
  • 43. M3DB Operator: Create Example @mattschallert
  • 47. M3DB Operator: Day 2 Operations ● Scaling a cluster ○ M3DB adds 1-by-1, operation handles many ● Replacing instances, updating config, etc. ● All operations follow same pattern ○ Look at Kubernetes state, look at M3DB state, affect change @mattschallert
  • 48. M3DB Operator: Success Factors ● Bridge between goal state-driven APIS (anti-footgun) ○ Can’t accidentally remove more pods than desired ○ Can’t accidentally remove two M3DB instances ○ Operator can be restarted, pick back up ● Share similar concepts, expanded scope ○ Failure domains, resource quantities, etc.
  • 49. ● Finding common abstractions made mental model easier ○ Implementation can be separate, interface the same ● + Goal-state driven approach simplified operations ○ External factors (failures, deploys) don’t disrupt steady state Lessons Learned: M3 @mattschallert
  • 50. ● Embracing Kubernetes principles in M3DB made Kubernetes onboarding easier ○ No pods are special, may move around ○ Self-healing after failures ● Instances as cattle, not pets ○ No instance-level operations or config Lessons Learned: Kubernetes @mattschallert
  • 51. ● Any orchestration system implies complexity ○ Your own context: is it worth it? ○ How does your own system respond to failures, rescheduling, etc.? ● Maybe not a good fit if have single special instances ○ M3DB can tolerate failures Considerations @mattschallert
  • 52. ● Requirements & guarantees of your system will influence your requirements for automation ○ M3DB strongly consistent: wanted strongly consistent cluster membership Considerations @mattschallert
  • 53. ● Building common abstractions for all our workloads eased mental model ● Following goal-state driven approach eased automation ● Closely aligned with Kubernetes principles Recap @mattschallert
  • 54. ● github.com/m3db/m3 ○ M3DB, m3cluster, m3aggregator ○ github.com/m3db/m3db-operator ● m3db.io & docs.m3db.io ● Please leave feedback on O’Reilly site! Thank you! @mattschallert
  • 55.
  • 58. M3DB Operator: Today ● [ THIS SLIDE IS SKIPPED ] ● https://github.com/m3db/m3db-operator ● Managing 1 M3DB clusters ~as easy as 10
  • 63. System Interfaces ● Ingestion path used multiple text-based line protocols ○ Difficult to encode metric metadata ○ Redundant serialization costs ● Storage layers used various RPC formats
  • 65. M3DB Operator [SLIDE SKIPPED] ROUGH DRAFT DIAGRAM
  • 66. ● [ Transition to Kubernetes, possibly using edge vs. level triggered ] ○ Level triggered: reacting to current state. Actions such as “replace this node” can’t be missed. ○ Edge triggered: reacting to events. Events can be missed. ● [ Maybe something about idempotency requirement ? ] Edge vs. Level Triggered Logic
  • 67. m3cluster: Placement message Instance { string id string isolation_group string zone uint32 weight string endpoint repeated Shard shards ... } message Placement { map<string, Instance> instances uint32 replica_factor uint32 num_shards bool is_sharded }
  • 68. m3cluster: Shard enum ShardState { INITIALIZING = 0; AVAILABLE = 1; LEAVING = 2; } message Shard { uint32 id = 1; ShardState state = 2; string source_id = 3; }
  • 69. M3cluster: all enum ShardState { INITIALIZING = 0; AVAILABLE = 1; LEAVING = 2; } message Shard { uint32 id = 1; ShardState state = 2; string source_id = 3; } message Instance { string id = 1; string isolation_group = 2; string zone = 3; uint32 weight = 4; string endpoint = 5; repeated Shard shards = 6; ... } message Placement { map<string, Instance> instances = 1; uint32 replica_factor = 2; uint32 num_shards = 3; bool is_sharded = 4; }
  • 70. ● Consistency of operations ○ Many actors in the system ○ Failures can happen, may re-reconcile states ○ Kubernetes & m3cluster leverage etcd Prerequisites for Orchestration
  • 71. ● [ Lessons learned orchestrating M3DB generally ] ● [ Learnings specific to Kubernetes ] Lessons Learned