Salesforce enabling real time scenarios at scale using kafka

Enabling Real-time Scenarios
At Scale Using Kafka
Nishant Gupta, Engineering@Salesforce

Who am I?
Nishant Gupta - Sr. Director, Engineering
1.5 years at Salesforce (15+ years in software development)
Focus: Distributed systems, data pipelines, data analytics
linkedin.com/in/nigupta
@ngupta77
medium.com/@ngupta

What are we talking today?
Overview of real time data analysis platform at Salesforce and zoom into how Kafka
fits into this.

Let’s start from the beginning...
We wanted to understand system health (at host level) across global data centers in
real time. I.e. generate system health insights from host level metric events.
Thus, the genesis of project Ajna.
Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according
to Hindu tradition. … it is believed to reveal insights…
More here - https://en.wikipedia.org/wiki/Ajna

What we achieved?
Mechanism for system health monitoring for all clusters across all Salesforce
global data centers.
Health monitoring and alerting for new clusters is config driven with zero code
change
From event to insight - E2E latency is under 10 sec (including network delays)
Kafka is the key technology that enabled it.

Vision
Build a multi-tenant platform to enable real time scenarios at scale

Stream processing
Text Indexing
MR Jobs
Machine Learning
Raw Store
Etc.
Collection agentHost
Cluster level Kafka (data ingest)
Production Cluster
MirrorMaker
Local subscriber
DMZ
Aggregate Kafka

Let’s talk numbers
● # of clusters per production cluster: 1
● # of aggregate clusters in DMZ: 1
● # of topics: Data specific. Ranges from 1 to 100s.
● # of partitions: Data specific. Ranges from 1 to 16
● # of replications: 3 (across all data, all clusters)
● Data retention: Data specific. Ranges up to 4 days
● Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9
● SSL enabled: Yes.

MirrorMaker
● Not all data is mirrored. Selective topics are white listed.
● Max message size - 2MB (working to reduce this)
● Batch.num.messages - 500
● Queueing.buffer.max.ms - 5000ms
● Fetch.max.message.size - 2MB
● Partition strategy - round robin
● Garbage collection - G1GC with max heap size of 28GB
● We have modified 0.8.2 MM to enable SSL

Scale of operation
30+ Kafka clusters
~300 MBPS Aggregate throughput of the system
100s of Billions Avg events/day
>30 sec P95 for latency

Scenarios we enable
● SR visibility into system health
● Transaction and performance visibility publically -
http://trust.salesforce.com/trust/performance
● App log analysis - application performance, business insights etc.
● Network monitoring
● Security and Compliance monitoring
● App’s event based communication and notification scenarios
● … more

Operational challenges
● Large number of clusters to manage
● Non-homogeneous hardware for brokers. Capacity planning is hard
● 0.8.x MM is buggy. Data loss is possible.
● No built-in support for QoS
● Operations on the box are manual - need to log into every box to get info
● Lack of traceability of data across system. We operate in Kafka → MM → Kafka
configuration
● No self onboarding for customers is possible.

● Aggregate clusters rather than too many small clusters.
● Use homogenous hardware as much as possible.
● MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss.
Upgrade ASAP!
● Use SSDs where possible! It will help in increasing throughput
● Consider using a dedicated cluster for latency sensitive scenarios
● Consider default number of partitions to be some multiple of number of
brokers. That way disk consumption is uniform.
● Monitor everything!

Service level checks
● SLA alert: Latency increases over x min for a given topic
● SLA alert: Throughput decreases under x bytes/sec for a given topic
Host checks
● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity
● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s
BytesOneMinuteRatePerSec
● Nagios: process checks → alerts. E.g. number of processes for Kafka user
Alerts
● Argus → email notifications → pager duty integration.

● Ajna auditor - timestamps every event; calculates per topic latency, data loss,
throughput etc.
● Argus - Hbase based time series monitoring and alerting platform
(https://github.com/SalesforceEng/Argus)
● DCT (Dashboard creation tool) - for Argus and Graphite
● Funnel - HTTP endpoint for ingestion of metrics data

10X scale
Management API
Multi-tenancy
AuthZ/N
Throttling
Self-serve UI
Cluster Management
Ajna health monitoring using Prometheus or similar

Salesforce enabling real time scenarios at scale using kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Salesforce enabling real time scenarios at scale using kafka

Similar to Salesforce enabling real time scenarios at scale using kafka (20)

Recently uploaded

Recently uploaded (20)

Salesforce enabling real time scenarios at scale using kafka