At Stitch Fix, we maintain a distributed Kafka Connect cluster running several hundred connectors. Over the years, we've learned invaluable lessons for keeping our connectors going 24/7. As many conference goers probably know, event driven applications require a new way of thinking. With this new paradigm comes unique operational considerations, which I will delve into. Specifically, this talk will be an overview of: 1) Our deployment model and use case (we have a large distributed Kafka Connect cluster that powers a self-service data integration platform tailored to the needs of our Data Scientists). 2) Our favorite operational tools that we have built for making things run smoothly (the jobs, alerts and dashboards we find most useful. A quick run down of the admin service we wrote that sits on top of Kafka Connect). 3) Our approach to end-to-end integrity monitoring (our tracer bullet system that we built to constantly monitor all our sources and sinks). 4) Lessons learned from production issues and painful migrations (why, oh why did we not use schemas from the beginning?? Pausing connectors doesn't do what you think it does... rebalancing is tricky... jar hell problems are a thing of the past, upgrade and use plugin.path!). 5) Future areas of improvement. The target audience member is an engineer who is curious about Kafka Connect or currently maintains a small to medium sized Kafka Connect cluster. They should walk away from the talk with increased confidence in using and maintaining a large Kafka Connect cluster, and should be armed with the hard won experiences of our team. For the most part, we've been very happy with our Kafka Connect powered data integration platform, and we'd love to share our lessons learned with the community in order to drive adoption.
11. Jobs
● Restart crashed connectors 🎬
○ Post to slack, include crash reason
● Verify resources in external systems 🕵
● Health checks 😷
● Autoscale # of tasks ⚖
● Backup connect-configs topic 💾
○ Store in Github to easily review history
● Expire Elasticsearch indices ☠
● CI/CD jobs to build custom connectors 🚢
11
15. Single Message Transforms
Standard Transforms
● TimestampRouter
○ Used in Elasticsearch connectors
Custom Transforms
● RegexFieldFilter
○ Filter out messages by matching Regex to Json path
○ Skip corrupt data
● DropField
○ Delete fields, e.g. PII
○ Make Elasticsearch happy
15
18. Admin Service
● Wraps Kafka Connect API 🎁
● Manages resources in external data systems 👉
○ Create tables in the Hive Metastore
○ Configure alerts
○ Create/dete topics
○ TODO: create index patterns in Kibana
● Sample topics in Kafka 🔬
18
21. API Wrapper
● Hard codes default values
○ AWS defaults
○ Schema related defaults
○ Security related defaults
○ Kafka hostnames
○ Kafka Connect class name configurations
● Injects custom fields
○ Connector owner metadata
21
22. API Wrapper Design
Requirements
● Adding new connectors should be very easy
● Adding new fields to connectors should very easy
● Typos should be hard to make, ideally statically enforced by compiler
Design
● Redesigned 3 times
● Used Jackson Json Annotations for Json SerDe
● Very hierarchical OOP design
● Lots of tests...
22
23. API Wrapper Lessons Learned
Pros 👍
● Greatly simplified the web UI React code
● Exposed simple API for other teams to script against
● Consolidated default configs in one place
● Totally abstracts Kafka Connect
Cons 👎
● Friction when installing new connectors
23
25. End-to-End Monitoring Requirements
● Monitor for
○ Delayed data
○ Duplicated data
○ Previously delivered data has disappeared
○ “Late arriving” data is not correctly delivered
○ Failures to publish data
● All ingresses and egresses are monitored 24/7
● Use user-level interfaces only
● ❗Very important❗: staging must have all the same monitoring as prod!!
25
27. Tracer Bullets Overview
● Send synthetic events (“tracer bullets”) to all ingress points
● Funnel all tracer bullets into tracer_bullet topic
● Create at least 1 of every kind of Sink Connector consuming from the
tracer_bullet topic
● Create tracer bullets with logical timestamps
● Send current and “late arriving” data
● Persist metadata for tracer bullets in external system (i.e. Elasticsearch)
27
33. Tracer Bullet Payloads
Must include at least
● Unique id
● ❗Very important❗ LOGICAL timestamp (a.k.a event time)
33
34. Tracer Bullets Metadata
Tracer bullet metadata should include
● Timestamp sent
● Tracer bullet’s unique ID
● Ingress point
● Expected egress points
● Send status (SUCCESS, SEND_FAILED, etc.)
● # of duplicates found in sink
● Delivery timestamp (if delivered)
● Environment (dev/staging/prod)
34
35. Tracer Bullet Polling
● Continuously queries every destination
○ Queries a sliding window of several hours
○ Queries the “late arriving” tracer bullets
● Check for duplicates
● Updates metadata when tracer bullets are delivered (or disappear!)
35
36. Tracer Bullet Alerting
● Poll the tracer bullet metadata and alert when:
○ Duplicates were found
○ Tracer bullets are delayed (according to your SLAs)
○ Previously delivered tracer bullets mysteriously disappear
○ Tracer bullets failed to send in the first place
● Alert if there is no tracer bullet metadata
36
38. Tracer Bullet Summary Job ❗Very Important❗
● Generate daily reports
○ Overall end-to-end latency
○ No. of tracer bullets sent per sink/source
● Indicates tracer bullets are functioning properly
38
39. Tracer Bullets Lessons Learned
● Make them extensible!
● Tracer bullets: just a big streaming application… write good tests!
● Dev environments are hard…
● Tracer bullets allowed for fewer integration tests
39
41. Pausing Connectors
● PUT /connectors/my-connector/pause
● What does pausing connectors actually do?
○ Pauses poll loop
● How to really stop a connector?
○ Delete and recreate
○ Shut down Kafka Connect
41
42. Jar Hell
[2018-02-18 10:28:33,290] INFO Loading plugin from: /home/ubuntu/kafka_2.11-1.0.0/plugins/lib
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:179)
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
at org.reflections.Reflections.<init>(Reflections.java:126)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:258)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:201)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:193)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:153)
at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:70)f
42
43. Jar Hell
[2018-02-18 10:28:33,290] INFO Loading plugin from: /home/ubuntu/kafka_2.11-1.0.0/plugins/lib
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:179)
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
at org.reflections.Reflections.<init>(Reflections.java:126)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:258)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:201)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:193)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:153)
at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:70)f
43
44. Jar Hell
● As of version 0.11.0.0 (CP 3.3.0), Kafka Connect has plugin.path
● USE PLUGIN.PATH!!!!
44
45. Schemas
● Migrating to schemas is really tough
● Use schemas from the beginning
● Some connectors require schemas
○ JDBC Sink Connector
45
46. Rebalancing
What is it
● Created/deleted connectors & added/removed workers trigger rebalances
Why is it bad
● API becomes unresponsive (responds with 409 status)
● Thundering rebalances caused instability in Elasticsearch
● S3 Connector + RegexFieldFilter transform caused big problems
● As of 2.3.0 (CP 5.3.0) rebalancing is much better!!! Upgrade, upgrade upgrade!
○ Shout out to Konstantine Karantasis for Cooperative Incremental Rebalances
46
48. Conclusion
● Kafka Connect is awesome! 😄😄😄
● Metrics 📈
○ ❗Very Important❗: per topic message/sec❗AND❗ bytes/sec
● Write an admin service to tie together your whole system 🎁
● End-to-end monitoring 🔥🔥🔥
○ ❗Very Important❗: have monitoring in both prod and staging
○ ❗Very Important❗: generate daily summary reports
○ ❗Very Important❗: universal support for logical timestamps
● Upgrade early and upgrade often! 💯
48