SlideShare a Scribd company logo
1 of 23
Enabling Real-time Scenarios
At Scale Using Kafka
Nishant Gupta, Engineering@Salesforce
Who am I?
Nishant Gupta - Sr. Director, Engineering
1.5 years at Salesforce (15+ years in software development)
Focus: Distributed systems, data pipelines, data analytics
linkedin.com/in/nigupta
@ngupta77
medium.com/@ngupta
What are we talking today?
Overview of real time data analysis platform at Salesforce and zoom into how Kafka
fits into this.
Let’s start from the beginning...
We wanted to understand system health (at host level) across global data centers in
real time. I.e. generate system health insights from host level metric events.
Thus, the genesis of project Ajna.
Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according
to Hindu tradition. … it is believed to reveal insights…
More here - https://en.wikipedia.org/wiki/Ajna
What we achieved?
Mechanism for system health monitoring for all clusters across all Salesforce
global data centers.
Health monitoring and alerting for new clusters is config driven with zero code
change
From event to insight - E2E latency is under 10 sec (including network delays)
Kafka is the key technology that enabled it.
Vision
Build a multi-tenant platform to enable real time scenarios at scale
Architecture
Stream processing
Text Indexing
MR Jobs
Machine Learning
Raw Store
Etc.
Collection agentHost
Cluster level Kafka (data ingest)
Production Cluster
MirrorMaker
Local subscriber
DMZ
Aggregate Kafka
Let’s talk numbers
● # of clusters per production cluster: 1
● # of aggregate clusters in DMZ: 1
● # of topics: Data specific. Ranges from 1 to 100s.
● # of partitions: Data specific. Ranges from 1 to 16
● # of replications: 3 (across all data, all clusters)
● Data retention: Data specific. Ranges up to 4 days
● Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9
● SSL enabled: Yes.
MirrorMaker
● Not all data is mirrored. Selective topics are white listed.
● Max message size - 2MB (working to reduce this)
● Batch.num.messages - 500
● Queueing.buffer.max.ms - 5000ms
● Fetch.max.message.size - 2MB
● Partition strategy - round robin
● Garbage collection - G1GC with max heap size of 28GB
● We have modified 0.8.2 MM to enable SSL
Scale of operation
30+ Kafka clusters
~300 MBPS Aggregate throughput of the system
100s of Billions Avg events/day
>30 sec P95 for latency
Scenarios we enable
● SR visibility into system health
● Transaction and performance visibility publically -
http://trust.salesforce.com/trust/performance
● App log analysis - application performance, business insights etc.
● Network monitoring
● Security and Compliance monitoring
● App’s event based communication and notification scenarios
● … more
Operational challenges
● Large number of clusters to manage
● Non-homogeneous hardware for brokers. Capacity planning is hard
● 0.8.x MM is buggy. Data loss is possible.
● No built-in support for QoS
● Operations on the box are manual - need to log into every box to get info
● Lack of traceability of data across system. We operate in Kafka → MM → Kafka
configuration
● No self onboarding for customers is possible.
Learnings
● Aggregate clusters rather than too many small clusters.
● Use homogenous hardware as much as possible.
● MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss.
Upgrade ASAP!
● Use SSDs where possible! It will help in increasing throughput
● Consider using a dedicated cluster for latency sensitive scenarios
● Consider default number of partitions to be some multiple of number of
brokers. That way disk consumption is uniform.
● Monitor everything!
Monitoring Ajna
Service level checks
● SLA alert: Latency increases over x min for a given topic
● SLA alert: Throughput decreases under x bytes/sec for a given topic
Host checks
● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity
● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s
BytesOneMinuteRatePerSec
● Nagios: process checks → alerts. E.g. number of processes for Kafka user
Alerts
● Argus → email notifications → pager duty integration.
Tools we use
● Ajna auditor - timestamps every event; calculates per topic latency, data loss,
throughput etc.
● Argus - Hbase based time series monitoring and alerting platform
(https://github.com/SalesforceEng/Argus)
● DCT (Dashboard creation tool) - for Argus and Graphite
● Funnel - HTTP endpoint for ingestion of metrics data
The road ahead
10X scale
Management API
Multi-tenancy
AuthZ/N
Throttling
Self-serve UI
Cluster Management
Ajna health monitoring using Prometheus or similar
?s

More Related Content

What's hot

Hyper-Converged Infrastructure: Concepts
Hyper-Converged Infrastructure: ConceptsHyper-Converged Infrastructure: Concepts
Hyper-Converged Infrastructure: ConceptsNick Scuola
 
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint Platform
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint PlatformDallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint Platform
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint PlatformAdam DesJardin
 
Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemFabian Reinartz
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesJonathan Katz
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
 
Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2FIWARE
 
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...Amazon Web Services
 
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...NGINX, Inc.
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stackSimon Hanmer
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Michael Elder
 
Privileged Access Management (PAM)
Privileged Access Management (PAM)Privileged Access Management (PAM)
Privileged Access Management (PAM)danb02
 
Building Event Driven Systems
Building Event Driven SystemsBuilding Event Driven Systems
Building Event Driven SystemsWSO2
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기Ji-Woong Choi
 
Edge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and KubernetesEdge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and KubernetesVirtualTech Japan Inc.
 
Hw09 Large Scale Transaction Analysis
Hw09   Large Scale Transaction AnalysisHw09   Large Scale Transaction Analysis
Hw09 Large Scale Transaction AnalysisCloudera, Inc.
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Site24x7
 

What's hot (20)

API strategy with IBM API connect
API strategy with IBM API connectAPI strategy with IBM API connect
API strategy with IBM API connect
 
Hyper-Converged Infrastructure: Concepts
Hyper-Converged Infrastructure: ConceptsHyper-Converged Infrastructure: Concepts
Hyper-Converged Infrastructure: Concepts
 
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint Platform
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint PlatformDallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint Platform
Dallas Mulesoft Meetup - Log Aggregation and Elastic Stack on Anypoint Platform
 
Prometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring SystemPrometheus – a next-gen Monitoring System
Prometheus – a next-gen Monitoring System
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
 
Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2Fiware IoT_IDAS_intro_ul20_v2
Fiware IoT_IDAS_intro_ul20_v2
 
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...
Amazon Aurora Storage Demystified: How It All Works (DAT363) - AWS re:Invent ...
 
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...
Modernizing Applications by Replacing F5 with the NGINX Application Delivery ...
 
Centralised logging with ELK stack
Centralised logging with ELK stackCentralised logging with ELK stack
Centralised logging with ELK stack
 
Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...Introducing github.com/open-cluster-management – How to deliver apps across c...
Introducing github.com/open-cluster-management – How to deliver apps across c...
 
Privileged Access Management (PAM)
Privileged Access Management (PAM)Privileged Access Management (PAM)
Privileged Access Management (PAM)
 
Building Event Driven Systems
Building Event Driven SystemsBuilding Event Driven Systems
Building Event Driven Systems
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
 
Edge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and KubernetesEdge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and Kubernetes
 
Hw09 Large Scale Transaction Analysis
Hw09   Large Scale Transaction AnalysisHw09   Large Scale Transaction Analysis
Hw09 Large Scale Transaction Analysis
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)
 

Viewers also liked

Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Secure Kafka at Salesforce.com
Secure Kafka at Salesforce.comSecure Kafka at Salesforce.com
Secure Kafka at Salesforce.comRajasekar Elango
 
Durable Streaming and Enterprise Messaging
Durable Streaming and Enterprise MessagingDurable Streaming and Enterprise Messaging
Durable Streaming and Enterprise MessagingSalesforce Developers
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developersHeitor Souza
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarSalesforce Developers
 
Embedded Mirror Maker
Embedded Mirror MakerEmbedded Mirror Maker
Embedded Mirror MakerSimon Suo
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)Henning Spjelkavik
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormEdureka!
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should knowRatish Ravindran
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by SalesforceThinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Data Con LA
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Understanding the Salesforce Architecture: How We Do the Magic We Do
Understanding the Salesforce Architecture: How We Do the Magic We DoUnderstanding the Salesforce Architecture: How We Do the Magic We Do
Understanding the Salesforce Architecture: How We Do the Magic We DoSalesforce Developers
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit
 

Viewers also liked (20)

Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Secure Kafka at Salesforce.com
Secure Kafka at Salesforce.comSecure Kafka at Salesforce.com
Secure Kafka at Salesforce.com
 
Durable Streaming and Enterprise Messaging
Durable Streaming and Enterprise MessagingDurable Streaming and Enterprise Messaging
Durable Streaming and Enterprise Messaging
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Demystifying salesforce for developers
Demystifying salesforce for developersDemystifying salesforce for developers
Demystifying salesforce for developers
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes WebinarExtreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
 
Embedded Mirror Maker
Embedded Mirror MakerEmbedded Mirror Maker
Embedded Mirror Maker
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)101 mistakes FINN.no has made with Kafka (Baksida meetup)
101 mistakes FINN.no has made with Kafka (Baksida meetup)
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
 
Hbase at Salesforce.com
Hbase at Salesforce.comHbase at Salesforce.com
Hbase at Salesforce.com
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Salesforce REST API
Salesforce  REST API Salesforce  REST API
Salesforce REST API
 
Understanding the Salesforce Architecture: How We Do the Magic We Do
Understanding the Salesforce Architecture: How We Do the Magic We DoUnderstanding the Salesforce Architecture: How We Do the Magic We Do
Understanding the Salesforce Architecture: How We Do the Magic We Do
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 

Similar to Salesforce enabling real time scenarios at scale using kafka

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksClarotech_Events
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and FutureKartik Paramasivam
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper diveRobert Kubiś
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and futureEd Yakabosky
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Amazon CloudWatch - Observability and Monitoring
Amazon CloudWatch - Observability and MonitoringAmazon CloudWatch - Observability and Monitoring
Amazon CloudWatch - Observability and MonitoringRick Hwang
 

Similar to Salesforce enabling real time scenarios at scale using kafka (20)

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
QueueMetrics - Tips and Tricks
QueueMetrics - Tips and TricksQueueMetrics - Tips and Tricks
QueueMetrics - Tips and Tricks
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Amazon CloudWatch - Observability and Monitoring
Amazon CloudWatch - Observability and MonitoringAmazon CloudWatch - Observability and Monitoring
Amazon CloudWatch - Observability and Monitoring
 

Recently uploaded

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Recently uploaded (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Salesforce enabling real time scenarios at scale using kafka

  • 1. Enabling Real-time Scenarios At Scale Using Kafka Nishant Gupta, Engineering@Salesforce
  • 2. Who am I? Nishant Gupta - Sr. Director, Engineering 1.5 years at Salesforce (15+ years in software development) Focus: Distributed systems, data pipelines, data analytics linkedin.com/in/nigupta @ngupta77 medium.com/@ngupta
  • 3. What are we talking today? Overview of real time data analysis platform at Salesforce and zoom into how Kafka fits into this.
  • 4. Let’s start from the beginning... We wanted to understand system health (at host level) across global data centers in real time. I.e. generate system health insights from host level metric events. Thus, the genesis of project Ajna. Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according to Hindu tradition. … it is believed to reveal insights… More here - https://en.wikipedia.org/wiki/Ajna
  • 5. What we achieved? Mechanism for system health monitoring for all clusters across all Salesforce global data centers. Health monitoring and alerting for new clusters is config driven with zero code change From event to insight - E2E latency is under 10 sec (including network delays) Kafka is the key technology that enabled it.
  • 6. Vision Build a multi-tenant platform to enable real time scenarios at scale
  • 7.
  • 9. Stream processing Text Indexing MR Jobs Machine Learning Raw Store Etc. Collection agentHost Cluster level Kafka (data ingest) Production Cluster MirrorMaker Local subscriber DMZ Aggregate Kafka
  • 10. Let’s talk numbers ● # of clusters per production cluster: 1 ● # of aggregate clusters in DMZ: 1 ● # of topics: Data specific. Ranges from 1 to 100s. ● # of partitions: Data specific. Ranges from 1 to 16 ● # of replications: 3 (across all data, all clusters) ● Data retention: Data specific. Ranges up to 4 days ● Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9 ● SSL enabled: Yes.
  • 11. MirrorMaker ● Not all data is mirrored. Selective topics are white listed. ● Max message size - 2MB (working to reduce this) ● Batch.num.messages - 500 ● Queueing.buffer.max.ms - 5000ms ● Fetch.max.message.size - 2MB ● Partition strategy - round robin ● Garbage collection - G1GC with max heap size of 28GB ● We have modified 0.8.2 MM to enable SSL
  • 12. Scale of operation 30+ Kafka clusters ~300 MBPS Aggregate throughput of the system 100s of Billions Avg events/day >30 sec P95 for latency
  • 13. Scenarios we enable ● SR visibility into system health ● Transaction and performance visibility publically - http://trust.salesforce.com/trust/performance ● App log analysis - application performance, business insights etc. ● Network monitoring ● Security and Compliance monitoring ● App’s event based communication and notification scenarios ● … more
  • 14. Operational challenges ● Large number of clusters to manage ● Non-homogeneous hardware for brokers. Capacity planning is hard ● 0.8.x MM is buggy. Data loss is possible. ● No built-in support for QoS ● Operations on the box are manual - need to log into every box to get info ● Lack of traceability of data across system. We operate in Kafka → MM → Kafka configuration ● No self onboarding for customers is possible.
  • 16. ● Aggregate clusters rather than too many small clusters. ● Use homogenous hardware as much as possible. ● MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss. Upgrade ASAP! ● Use SSDs where possible! It will help in increasing throughput ● Consider using a dedicated cluster for latency sensitive scenarios ● Consider default number of partitions to be some multiple of number of brokers. That way disk consumption is uniform. ● Monitor everything!
  • 18. Service level checks ● SLA alert: Latency increases over x min for a given topic ● SLA alert: Throughput decreases under x bytes/sec for a given topic Host checks ● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity ● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s BytesOneMinuteRatePerSec ● Nagios: process checks → alerts. E.g. number of processes for Kafka user Alerts ● Argus → email notifications → pager duty integration.
  • 20. ● Ajna auditor - timestamps every event; calculates per topic latency, data loss, throughput etc. ● Argus - Hbase based time series monitoring and alerting platform (https://github.com/SalesforceEng/Argus) ● DCT (Dashboard creation tool) - for Argus and Graphite ● Funnel - HTTP endpoint for ingestion of metrics data
  • 22. 10X scale Management API Multi-tenancy AuthZ/N Throttling Self-serve UI Cluster Management Ajna health monitoring using Prometheus or similar
  • 23. ?s