SlideShare a Scribd company logo
1 of 29
Download to read offline
Seattle 2018
@danielhochman / Engineer / Lyft
Instrumenting and Scaling
Cloud-Native Databases with Envoy
Seattle 2018
Database outage
1. Disk I/O wait spikes briefly
2. Client opens more connections
3. Slowdown due to auth overhead of new
connections
4. Client opens more connections
5. Hit max connection limit
Seattle 2018
Databases in the cloud
Instantly provision resilient, high-throughput infrastructure
No access to underlying VM and/or shared hardware
Limited access to telemetry
Limited access to configuration
Closed source or no ability to run custom binary
Seattle 2018
Cloud Native
Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as
public, private, and hybrid clouds.
Seattle 2018
Service Mesh topology
Service mesh
Edge
DiscoveryEnvoy Proxy is deployed at every hop
Seattle 2018
Instance topology
Application communicates over locally to Envoy
which will proxy all traffic
localhost:6001
localhost:6101
localhost:7000
…
(internal services)
(third-party services)
(cloud services)
and more!
Seattle 2018
Layer 3 / 4: Proxying TCP
- DNS aware
- Load balancing: round robin, least request, ring hash, random, etc
- Impose an idle timeout
- Healthchecking
- Access logging
localhost:7000
Stats
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
Other benefits
iot.us-east-1.amazonaws.com
174.217.14.202
174.217.14.234
Seattle 2018
Layer 5 / 6: Offloading SSL
Stats
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
Other benefits
- Efficient
- Up-to-date and secure (TLS 1.3)
- SNI, cert pinning, session resumption, etc.
- Easier to upgrade
localhost:7000 172.217.14.202:443
Seattle 2018
Layer 7: Managing HTTP
Stats
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
Other benefits
- Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed)
- Manage request retries and timeouts
- Access logging
- Offload GZIP decompression
HTTP/1
HTTP/2
Seattle 2018
Statistics
TCP (L3/L4) SSL (L5/L6) HTTP (L7)
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
cx_length_ms (hist)
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
and more!
Seattle 2018
Dashboards
Live templating
or {% macro envoy_stats(origin, destination) %}
Seattle 2018
Observability
Homogenous telemetry data makes it easier
to observe and correlate behavior in large
systems.
Seattle 2018
Observability
Libraries are heterogenous!
SSL ciphers? Status code metrics? Retry?
import pynamodb
use AwsDynamoDbDynamoDbClient;
import "github.com/aws/dynamodb"
&aws.Config{
Endpoint:aws.String("http://localhost:8000")
}
e.g.
Envoy provides standard access logs, stats,
alarms, retry, etc
Seattle 2018
Layer 7: Beyond HTTP
Envoy supports three other database-specific L7 protocols today
Seattle 2018
DynamoDB
- Protocol: JSON over HTTP
- Cloudwatch telemetry
- min, avg, max latency
- per-table capacity unit throughput
- per-minute
- Benefits of Envoy:
- Histogram of latency (percentiles)
- Custom windowing of metrics
- Per-host, per-zone, and per-cluster statistics
Seattle 2018
DynamoDB with codec
Seattle 2018
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.GetItem
{
"TableName": "pets",
"Key": {
"Name": {"S": "Patty"}
}
}
DynamoDB with codec
dynamodb.table.pets.GetItem.upstream_rq_time
Seattle 2018
DynamoDB
What was the per-30s p99 for write requests from the
users-streamlistener canary to the pets table?
ts(
envoy.dynamodb.pets.PutItem.upstream_rq_time.p99,
window=30,
group=users-streamlistener,
canary=true,
)
Seattle 2018
MongoDB
- Protocol: Binary JSON (BSON)
- Benefits of Envoy in TCP mode:
- Per-host, per-cluster, per-zone network I/O
- Benefits of Envoy with Mongo codec:
- Per-operation latency
- Count size and number of documents
- Count scattered gets in sharded cluster
How did the number of documents returned by queries
change in us-east-1a after the 3pm deploy of my service?
Seattle 2018
MongoDB at scale
Help! My Mongo database is experiencing outages:
- Disk I/O wait spikes briefly
- Client opens more connections
- Slowdown due to auth overhead of new connections
- Open more connections
- Hit max connection limit
Envoy will rate limit new connections to apply backpressure so that query
times can recover.
Seattle 2018
MongoDB at scale
Help! I deleted an index. I read the code but it was in a 3,000 line class.
The index was still in use and everything fell over until we could
recreate it.
Envoy will efficiently log all Mongo queries in JSON format so that a week
of logs can be audited for usage of the index's fields.
Have you tried the built-in query profiler?
Yes, it caused a serious outage because it's expensive and results in 3x
CPU usage.
Seattle 2018
MongoDB at scale
Envoy will:
- globally rate limit new connections
- efficiently log all Mongo queries
- track the number of queries with no timeout set
- parse the $comment field of a query so we can time and count queries of
individual application methods, log how many records they returned, etc.
… for applications in 3 different languages across 8 clusters.
… 6 months and several outages later ...
Seattle 2018
/var/log/envoy/mongo/0.log
{
"time": "2018-10-13T21:17:08.483Z",
"upstream_host": "172.18.3.19:27817"
"message": {
"opcode": "OP_QUERY",
"query": {
"findAndModify": "user",
"query": {"_id": 903730},
"update": {"$set": {"stats.rating": 4.9}},
"$comment": "{
"hostname": "users-3ae3r",
"httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7",
"callingFunction": "users.UpdateRating" }"
},
},
}
envoy.mongo.callsite.users.UpdateRating.reply_time_ms
Seattle 2018
Redis partitioning proxy
Consistent hashingRedis protocol
+=
Seattle 2018
Redis at scale
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
Seattle 2018
Approaches
TCP
HTTP
…
Bump-in-the-wire Fully routing
vs
Seattle 2018
Future codecs
Seattle 2018
Roadmap
- More codecs
- Full L7 capability vs bump-in-the-wire
- Better integration of tracing
- More fault injection coverage
- Role-based access control
Seattle 2018
Thanks!
@danielhochman

More Related Content

What's hot

SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesThe {code} Team
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesAjeet Singh Raina
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redisDvir Volk
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseAltinity Ltd
 
Microservices With Istio Service Mesh
Microservices With Istio Service MeshMicroservices With Istio Service Mesh
Microservices With Istio Service MeshNatanael Fonseca
 
DevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to KubernetesDevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to KubernetesRonny Trommer
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouseAltinity Ltd
 
Kubernetes dealing with storage and persistence
Kubernetes  dealing with storage and persistenceKubernetes  dealing with storage and persistence
Kubernetes dealing with storage and persistenceJanakiram MSV
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATS
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATSDeploy Secure and Scalable Services Across Kubernetes Clusters with NATS
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATSNATS
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Idan Atias
 
Kubernetes and service mesh application
Kubernetes  and service mesh applicationKubernetes  and service mesh application
Kubernetes and service mesh applicationThao Huynh Quang
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
 
오픈소스로 구축하는 클라우드 이야기
오픈소스로 구축하는 클라우드 이야기오픈소스로 구축하는 클라우드 이야기
오픈소스로 구축하는 클라우드 이야기Nalee Jang
 
Egeria and graphs
Egeria and graphsEgeria and graphs
Egeria and graphsODPi
 
Introduction to Chef
Introduction to ChefIntroduction to Chef
Introduction to ChefKnoldus Inc.
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 

What's hot (20)

SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best Practices
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
 
Microservices With Istio Service Mesh
Microservices With Istio Service MeshMicroservices With Istio Service Mesh
Microservices With Istio Service Mesh
 
DevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to KubernetesDevJam 2019 - Introduction to Kubernetes
DevJam 2019 - Introduction to Kubernetes
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Kubernetes dealing with storage and persistence
Kubernetes  dealing with storage and persistenceKubernetes  dealing with storage and persistence
Kubernetes dealing with storage and persistence
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATS
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATSDeploy Secure and Scalable Services Across Kubernetes Clusters with NATS
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATS
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
 
Kubernetes and service mesh application
Kubernetes  and service mesh applicationKubernetes  and service mesh application
Kubernetes and service mesh application
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
오픈소스로 구축하는 클라우드 이야기
오픈소스로 구축하는 클라우드 이야기오픈소스로 구축하는 클라우드 이야기
오픈소스로 구축하는 클라우드 이야기
 
OpenStack Keystone
OpenStack KeystoneOpenStack Keystone
OpenStack Keystone
 
Egeria and graphs
Egeria and graphsEgeria and graphs
Egeria and graphs
 
Introduction to Chef
Introduction to ChefIntroduction to Chef
Introduction to Chef
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 

Similar to Instrumenting and Scaling Databases with Envoy

Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsCisco Canada
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)jaxLondonConference
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Canada
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQlxfontes
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationThomasGraf42
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 

Similar to Instrumenting and Scaling Databases with Envoy (20)

Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
 
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQ
 
Bigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_appBigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_app
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Instrumenting and Scaling Databases with Envoy

  • 1. Seattle 2018 @danielhochman / Engineer / Lyft Instrumenting and Scaling Cloud-Native Databases with Envoy
  • 2. Seattle 2018 Database outage 1. Disk I/O wait spikes briefly 2. Client opens more connections 3. Slowdown due to auth overhead of new connections 4. Client opens more connections 5. Hit max connection limit
  • 3. Seattle 2018 Databases in the cloud Instantly provision resilient, high-throughput infrastructure No access to underlying VM and/or shared hardware Limited access to telemetry Limited access to configuration Closed source or no ability to run custom binary
  • 4. Seattle 2018 Cloud Native Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.
  • 5. Seattle 2018 Service Mesh topology Service mesh Edge DiscoveryEnvoy Proxy is deployed at every hop
  • 6. Seattle 2018 Instance topology Application communicates over locally to Envoy which will proxy all traffic localhost:6001 localhost:6101 localhost:7000 … (internal services) (third-party services) (cloud services) and more!
  • 7. Seattle 2018 Layer 3 / 4: Proxying TCP - DNS aware - Load balancing: round robin, least request, ring hash, random, etc - Impose an idle timeout - Healthchecking - Access logging localhost:7000 Stats cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total Other benefits iot.us-east-1.amazonaws.com 174.217.14.202 174.217.14.234
  • 8. Seattle 2018 Layer 5 / 6: Offloading SSL Stats handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires Other benefits - Efficient - Up-to-date and secure (TLS 1.3) - SNI, cert pinning, session resumption, etc. - Easier to upgrade localhost:7000 172.217.14.202:443
  • 9. Seattle 2018 Layer 7: Managing HTTP Stats cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout Other benefits - Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed) - Manage request retries and timeouts - Access logging - Offload GZIP decompression HTTP/1 HTTP/2
  • 10. Seattle 2018 Statistics TCP (L3/L4) SSL (L5/L6) HTTP (L7) cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total cx_length_ms (hist) handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout and more!
  • 11. Seattle 2018 Dashboards Live templating or {% macro envoy_stats(origin, destination) %}
  • 12. Seattle 2018 Observability Homogenous telemetry data makes it easier to observe and correlate behavior in large systems.
  • 13. Seattle 2018 Observability Libraries are heterogenous! SSL ciphers? Status code metrics? Retry? import pynamodb use AwsDynamoDbDynamoDbClient; import "github.com/aws/dynamodb" &aws.Config{ Endpoint:aws.String("http://localhost:8000") } e.g. Envoy provides standard access logs, stats, alarms, retry, etc
  • 14. Seattle 2018 Layer 7: Beyond HTTP Envoy supports three other database-specific L7 protocols today
  • 15. Seattle 2018 DynamoDB - Protocol: JSON over HTTP - Cloudwatch telemetry - min, avg, max latency - per-table capacity unit throughput - per-minute - Benefits of Envoy: - Histogram of latency (percentiles) - Custom windowing of metrics - Per-host, per-zone, and per-cluster statistics
  • 17. Seattle 2018 POST / HTTP/1.1 X-Amz-Target: DynamoDB_20120810.GetItem { "TableName": "pets", "Key": { "Name": {"S": "Patty"} } } DynamoDB with codec dynamodb.table.pets.GetItem.upstream_rq_time
  • 18. Seattle 2018 DynamoDB What was the per-30s p99 for write requests from the users-streamlistener canary to the pets table? ts( envoy.dynamodb.pets.PutItem.upstream_rq_time.p99, window=30, group=users-streamlistener, canary=true, )
  • 19. Seattle 2018 MongoDB - Protocol: Binary JSON (BSON) - Benefits of Envoy in TCP mode: - Per-host, per-cluster, per-zone network I/O - Benefits of Envoy with Mongo codec: - Per-operation latency - Count size and number of documents - Count scattered gets in sharded cluster How did the number of documents returned by queries change in us-east-1a after the 3pm deploy of my service?
  • 20. Seattle 2018 MongoDB at scale Help! My Mongo database is experiencing outages: - Disk I/O wait spikes briefly - Client opens more connections - Slowdown due to auth overhead of new connections - Open more connections - Hit max connection limit Envoy will rate limit new connections to apply backpressure so that query times can recover.
  • 21. Seattle 2018 MongoDB at scale Help! I deleted an index. I read the code but it was in a 3,000 line class. The index was still in use and everything fell over until we could recreate it. Envoy will efficiently log all Mongo queries in JSON format so that a week of logs can be audited for usage of the index's fields. Have you tried the built-in query profiler? Yes, it caused a serious outage because it's expensive and results in 3x CPU usage.
  • 22. Seattle 2018 MongoDB at scale Envoy will: - globally rate limit new connections - efficiently log all Mongo queries - track the number of queries with no timeout set - parse the $comment field of a query so we can time and count queries of individual application methods, log how many records they returned, etc. … for applications in 3 different languages across 8 clusters. … 6 months and several outages later ...
  • 23. Seattle 2018 /var/log/envoy/mongo/0.log { "time": "2018-10-13T21:17:08.483Z", "upstream_host": "172.18.3.19:27817" "message": { "opcode": "OP_QUERY", "query": { "findAndModify": "user", "query": {"_id": 903730}, "update": {"$set": {"stats.rating": 4.9}}, "$comment": "{ "hostname": "users-3ae3r", "httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7", "callingFunction": "users.UpdateRating" }" }, }, } envoy.mongo.callsite.users.UpdateRating.reply_time_ms
  • 24. Seattle 2018 Redis partitioning proxy Consistent hashingRedis protocol +=
  • 25. Seattle 2018 Redis at scale localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  • 28. Seattle 2018 Roadmap - More codecs - Full L7 capability vs bump-in-the-wire - Better integration of tracing - More fault injection coverage - Role-based access control