Nishant Gupta from Salesforce talked about Ajna, a service for monitoring system health across global data centers in real time, and how Kafka is at the center of this system. The talk covers the scenario, key challenges, learnings and best practices.
2. Who am I?
Nishant Gupta - Sr. Director, Engineering
1.5 years at Salesforce (15+ years in software development)
Focus: Distributed systems, data pipelines, data analytics
linkedin.com/in/nigupta
@ngupta77
medium.com/@ngupta
3. What are we talking today?
Overview of real time data analysis platform at Salesforce and zoom into how Kafka
fits into this.
4. Let’s start from the beginning...
We wanted to understand system health (at host level) across global data centers in
real time. I.e. generate system health insights from host level metric events.
Thus, the genesis of project Ajna.
Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according
to Hindu tradition. … it is believed to reveal insights…
More here - https://en.wikipedia.org/wiki/Ajna
5. What we achieved?
Mechanism for system health monitoring for all clusters across all Salesforce
global data centers.
Health monitoring and alerting for new clusters is config driven with zero code
change
From event to insight - E2E latency is under 10 sec (including network delays)
Kafka is the key technology that enabled it.
9. Stream processing
Text Indexing
MR Jobs
Machine Learning
Raw Store
Etc.
Collection agentHost
Cluster level Kafka (data ingest)
Production Cluster
MirrorMaker
Local subscriber
DMZ
Aggregate Kafka
10. Let’s talk numbers
● # of clusters per production cluster: 1
● # of aggregate clusters in DMZ: 1
● # of topics: Data specific. Ranges from 1 to 100s.
● # of partitions: Data specific. Ranges from 1 to 16
● # of replications: 3 (across all data, all clusters)
● Data retention: Data specific. Ranges up to 4 days
● Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9
● SSL enabled: Yes.
11. MirrorMaker
● Not all data is mirrored. Selective topics are white listed.
● Max message size - 2MB (working to reduce this)
● Batch.num.messages - 500
● Queueing.buffer.max.ms - 5000ms
● Fetch.max.message.size - 2MB
● Partition strategy - round robin
● Garbage collection - G1GC with max heap size of 28GB
● We have modified 0.8.2 MM to enable SSL
12. Scale of operation
30+ Kafka clusters
~300 MBPS Aggregate throughput of the system
100s of Billions Avg events/day
>30 sec P95 for latency
13. Scenarios we enable
● SR visibility into system health
● Transaction and performance visibility publically -
http://trust.salesforce.com/trust/performance
● App log analysis - application performance, business insights etc.
● Network monitoring
● Security and Compliance monitoring
● App’s event based communication and notification scenarios
● … more
14. Operational challenges
● Large number of clusters to manage
● Non-homogeneous hardware for brokers. Capacity planning is hard
● 0.8.x MM is buggy. Data loss is possible.
● No built-in support for QoS
● Operations on the box are manual - need to log into every box to get info
● Lack of traceability of data across system. We operate in Kafka → MM → Kafka
configuration
● No self onboarding for customers is possible.
16. ● Aggregate clusters rather than too many small clusters.
● Use homogenous hardware as much as possible.
● MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss.
Upgrade ASAP!
● Use SSDs where possible! It will help in increasing throughput
● Consider using a dedicated cluster for latency sensitive scenarios
● Consider default number of partitions to be some multiple of number of
brokers. That way disk consumption is uniform.
● Monitor everything!
18. Service level checks
● SLA alert: Latency increases over x min for a given topic
● SLA alert: Throughput decreases under x bytes/sec for a given topic
Host checks
● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity
● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s
BytesOneMinuteRatePerSec
● Nagios: process checks → alerts. E.g. number of processes for Kafka user
Alerts
● Argus → email notifications → pager duty integration.
20. ● Ajna auditor - timestamps every event; calculates per topic latency, data loss,
throughput etc.
● Argus - Hbase based time series monitoring and alerting platform
(https://github.com/SalesforceEng/Argus)
● DCT (Dashboard creation tool) - for Argus and Graphite
● Funnel - HTTP endpoint for ingestion of metrics data