Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Bennett, Confluent) Kafka Summit SF 2019

Kafka Connect
Operational Lessons Learned
From the Trenches
Liz Bennett
Software Engineer, Conﬂuent

About Me
Liz Bennett
Currently @
As of last week! 😄
Formerly @
Formerly @
Formerly @
2

Get ready for a potpourri of Kafka Connect
operational advice, lessons learned, and
random information!
3

We’ll cover...
● Kafka Connect deployment and use cases
● Useful operational tools
● End-to-End monitoring
● Lessons learned from outages
4

Deployment Model and Use Cases

The Cluster
● Distributed Kafka Connect Cluster
● Self-service for Data Scientists
● ~50 workers
● ~500 connectors
● ~1000 tasks
● Low/medium Kafka volume
● 6 different connectors installed
○ RabbitMQ Source & Sink, Elasticsearch Sink, S3 Sink, SQS Source, Feature Store Sink
● All data is json formatted
6

Read my blog post to learn more!
Putting the Power of Apache Kafka into the Hands of Data Scientists
8

Useful Operational Utilities
Jobs, Metrics, Alerts, Dashboards, Services, Transforms

Jobs
● Restart crashed connectors 🎬
○ Post to slack, include crash reason
● Verify resources in external systems 🕵
● Health checks 😷
● Autoscale # of tasks ⚖
● Backup connect-configs topic 💾
○ Store in Github to easily review history
● Expire Elasticsearch indices ☠
● CI/CD jobs to build custom connectors 🚢
11

Dashboards & Metrics
● Kafka Connect heap usage dashboard and alerts
● Kafka consumer lags dashboard and alerts
● ❗Very Important❗: per topic messages/sec❗AND❗ bytes/sec
○ For debugging, alerting and auto-scaling
● # of tasks
● # of connectors
● # of open resources (e.g. TCP connections)
13

Single Message Transforms
Standard Transforms
● TimestampRouter
○ Used in Elasticsearch connectors
Custom Transforms
● RegexFieldFilter
○ Filter out messages by matching Regex to Json path
○ Skip corrupt data
● DropField
○ Delete ﬁelds, e.g. PII
○ Make Elasticsearch happy
15

Kafka Connect
Admin Service
17

Admin Service
● Wraps Kafka Connect API 🎁
● Manages resources in external data systems 👉
○ Create tables in the Hive Metastore
○ Conﬁgure alerts
○ Create/dete topics
○ TODO: create index patterns in Kibana
● Sample topics in Kafka 🔬
18

Example: Kafka Connect S3 API
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector", "s3.region": "us-east-1","topics.dir":
"dw","flush.size": "100","tasks.max": "1",
"timezone": "UTC","sf.s3.borderland.api.config": "foo.com”,"sf.metadata.owner.team": ":team:infra",
"transforms": "EventNameRegexFilter","sf.s3.hinterland.api.config": "foo.com,
"transforms.EventNameRegexFilter.field.path": "$.event_name","locale": "en","sf.s3.formatter.table.has.writets":
"true",
"sf.s3.partitioner.db.name": "db_name",
"format.class": "com.stitchfix.datahighway.sfs3connector.SFS3FormatDailyPartitions",
"schema.generator.class":
"io.confluent.connect.storage.hive.schema.TimeBasedSchemaGenerator","sf.s3.formatter.event.name.regex.match.list":
"^.*$",
"sf.s3.formatter.payload.field": "$.payload",
"sf.s3.formatter.field.timestamp.transformers.map": "{"$.metadata.timestamp":"yyyy-MM-dd
HH:mm:ss.SSSSSS","$.payload.ts_posted":"yyyy-MM-dd HH:mm:ss.SSSSSS"}",
"s3.bucket.name": "stitchfix.aa.event","partition.duration.ms": "3600000","sf.s3.formatter.timestamp.field":
"$.metadata.timestamp",
"topics": "my_topic","sf.s3.partitioner.table.name": "my_table_in_s3","sf.s3.formatter.s3.timestamps.format":
"yyyy-MM-dd HH:mm:ss.SSS000",
"transforms.EventNameRegexFilter.regex": "^.*$","partitioner.class":
"com.stitchfix.datahighway.sfs3connector.SFS3PartitionerDailyPartitions",
"name": "my-s3-connector","sf.metadata.owner.email": "foo.bar@stitchfix.com",
"storage.class": "io.confluent.connect.s3.storage.S3Storage","rotate.schedule.interval.ms": "30000",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH","transforms.EventNameRegexFilter.type":
"com.stitchfix.datahighway.transforms.FieldRegexFilter",
"schemas.enable": "false","sf.s3.formatter.event.name.field": "$.event_name"
}
19

Example: Wrapped API
{
"processing_threads": 1,
"name": "my-s3-connector",
"topics": ["my_topic"],
"table_name": "my_table_in_s3",
"flush_size": 100,
"flush_interval_seconds": 30,
"owner_email": "bob.ross@happytrees.com",
"owner_team": ":team:painters"
} 20

API Wrapper
● Hard codes default values
○ AWS defaults
○ Schema related defaults
○ Security related defaults
○ Kafka hostnames
○ Kafka Connect class name conﬁgurations
● Injects custom ﬁelds
○ Connector owner metadata
21

API Wrapper Design
Requirements
● Adding new connectors should be very easy
● Adding new ﬁelds to connectors should very easy
● Typos should be hard to make, ideally statically enforced by compiler
Design
● Redesigned 3 times
● Used Jackson Json Annotations for Json SerDe
● Very hierarchical OOP design
● Lots of tests...
22

API Wrapper Lessons Learned
Pros 👍
● Greatly simpliﬁed the web UI React code
● Exposed simple API for other teams to script against
● Consolidated default conﬁgs in one place
● Totally abstracts Kafka Connect
Cons 👎
● Friction when installing new connectors
23

End-to-End Monitoring Requirements
● Monitor for
○ Delayed data
○ Duplicated data
○ Previously delivered data has disappeared
○ “Late arriving” data is not correctly delivered
○ Failures to publish data
● All ingresses and egresses are monitored 24/7
● Use user-level interfaces only
● ❗Very important❗: staging must have all the same monitoring as prod!!
25

Tracer Bullets Overview
● Send synthetic events (“tracer bullets”) to all ingress points
● Funnel all tracer bullets into tracer_bullet topic
● Create at least 1 of every kind of Sink Connector consuming from the
tracer_bullet topic
● Create tracer bullets with logical timestamps
● Send current and “late arriving” data
● Persist metadata for tracer bullets in external system (i.e. Elasticsearch)
27

Tracer Bullet Payloads
Must include at least
● Unique id
● ❗Very important❗ LOGICAL timestamp (a.k.a event time)
33

Tracer Bullets Metadata
Tracer bullet metadata should include
● Timestamp sent
● Tracer bullet’s unique ID
● Ingress point
● Expected egress points
● Send status (SUCCESS, SEND_FAILED, etc.)
● # of duplicates found in sink
● Delivery timestamp (if delivered)
● Environment (dev/staging/prod)
34

Tracer Bullet Polling
● Continuously queries every destination
○ Queries a sliding window of several hours
○ Queries the “late arriving” tracer bullets
● Check for duplicates
● Updates metadata when tracer bullets are delivered (or disappear!)
35

Tracer Bullet Alerting
● Poll the tracer bullet metadata and alert when:
○ Duplicates were found
○ Tracer bullets are delayed (according to your SLAs)
○ Previously delivered tracer bullets mysteriously disappear
○ Tracer bullets failed to send in the ﬁrst place
● Alert if there is no tracer bullet metadata
36

Tracer Bullet Summary Job ❗Very Important❗
● Generate daily reports
○ Overall end-to-end latency
○ No. of tracer bullets sent per sink/source
● Indicates tracer bullets are functioning properly
38

Tracer Bullets Lessons Learned
● Make them extensible!
● Tracer bullets: just a big streaming application… write good tests!
● Dev environments are hard…
● Tracer bullets allowed for fewer integration tests
39

Lessons Learned the Hard Way
Production Issues, Annoying Gotchas

Pausing Connectors
● PUT /connectors/my-connector/pause
● What does pausing connectors actually do?
○ Pauses poll loop
● How to really stop a connector?
○ Delete and recreate
○ Shut down Kafka Connect
41

Jar Hell
[2018-02-18 10:28:33,290] INFO Loading plugin from: /home/ubuntu/kafka_2.11-1.0.0/plugins/lib
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:179)
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
at org.reflections.Reflections.<init>(Reflections.java:126)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:258)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:201)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:193)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:153)
at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:70)f
42

Jar Hell
[2018-02-18 10:28:33,290] INFO Loading plugin from: /home/ubuntu/kafka_2.11-1.0.0/plugins/lib
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:179)
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
at org.reflections.Reflections.<init>(Reflections.java:126)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:258)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:201)
at
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:193)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:153)
at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:70)f
43

Jar Hell
● As of version 0.11.0.0 (CP 3.3.0), Kafka Connect has plugin.path
● USE PLUGIN.PATH!!!!
44

Schemas
● Migrating to schemas is really tough
● Use schemas from the beginning
● Some connectors require schemas
○ JDBC Sink Connector
45

Rebalancing
What is it
● Created/deleted connectors & added/removed workers trigger rebalances
Why is it bad
● API becomes unresponsive (responds with 409 status)
● Thundering rebalances caused instability in Elasticsearch
● S3 Connector + RegexFieldFilter transform caused big problems
● As of 2.3.0 (CP 5.3.0) rebalancing is much better!!! Upgrade, upgrade upgrade!
○ Shout out to Konstantine Karantasis for Cooperative Incremental Rebalances
46

Conclusion
● Kafka Connect is awesome! 😄😄😄
● Metrics 📈
○ ❗Very Important❗: per topic message/sec❗AND❗ bytes/sec
● Write an admin service to tie together your whole system 🎁
● End-to-end monitoring 🔥🔥🔥
○ ❗Very Important❗: have monitoring in both prod and staging
○ ❗Very Important❗: generate daily summary reports
○ ❗Very Important❗: universal support for logical timestamps
● Upgrade early and upgrade often! 💯
48

Thank you!
lbennett@conﬂuent.io ✉
@zzbennett

Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Bennett, Confluent) Kafka Summit SF 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Bennett, Confluent) Kafka Summit SF 2019

Similar to Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Bennett, Confluent) Kafka Summit SF 2019 (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Bennett, Confluent) Kafka Summit SF 2019