TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
1. Real-time Data Integration with Apache
Flink & Kafka @Bouygues Telecom
Mohamed Amine ABDESSEMED
Flink Forward, Berlin, October 2015
2. About Me
• Software engineer & solution architect @ Bouygues
Telecom
• My daily Toolbox
– Hadoop ecosystem
– Apache Flink
– Apache Kafka
– Elasticsearch
– Apache Camel
– And more.
• If I don’t see me coding, I am probably outside running
2
3. Outline
• Who we are
• Logged User eXperience
• Challenges
• Typical Data Flow pipeline on Hadoop
• Real-time Data Integration
• Apache Flink : The Elegant
• Data Integration use case
• What we loved using Flink
• What’s Next ?
3
10. Challenges
1. Data movement
2. Data Processing
– Data is generally too raw to be used directly.
– How can we transform it ?
– How can we make the results available as soon as possible ?
10
11. Typical Data Flow pipeline on Hadoop
HADOOP
FTP
HTTP
SCOOP
FLUME
FSPUT
Client
1
Client
2
Client
...
Client
x
Impala
Hive
Hbase
TO
System
1
System
2
System
...
System
x
Raw
Data
DFS
Enriched
Data
V1
DFS
Enriched
Data
V...
DFS
Enriched
Data
Vx
DFS
Data
source1
Data
Source 2
Data
Source x
Data
Source ...
Batch Batch Batch
11
13. Real-time Data Integration
• Inspired by Linkedin’s Kafka Data
Integration design pattern.
• Take all the data and it into a central log
repository for real-time subscription.
What if the data is too raw to
used, even binary encoded with
no visible business logic
information?
13
14. Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
• Not viable:
– Data sources have limited computation resources dedicated to
log collection.
– Not scalable.
– Too hard to maintain.
We’ve to push it RAW.
14
15. Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
• Solution 02: The consumers/Data subscribers have to
process Data before using it.
– Drawbacks:
• All the consumers must implement the same business logic, run it
against the same Data.
• Any changes/updates in the processing logic will require an update of
all the consumers.
Provide a useable Data format to each consumer/Data
subscriber.
15
16. Real-time Data Integration
Solution 01: Process Data before pushing it to Kafka
Solution 02: The consumers/Data subscribers have to process Data before using it
Solution 03: Process Kafka’s raw Data and push it back in
decoded/enriched format for subscribers.
Benefits:
– The business logic will be implemented in one place.
– Resource efficient.
– Data subscribers can focus more on their own business logic.
– Simple handling of sources/clients evolution.
Challenges:
– Keep the Data moving real time.
– The Data processing pipeline must be very fast
16
18. Real-time Data Integration with
Apache Kafka and Spark?
• Started a POC on Spark Streaming.
• Didn’t answer our needs:
– Poor back pressure handling
Jobs kept failing OOM on busy hours.
– Micro-batching & Latency.
– Many configuration parameters:
– The tested version used an HDFS WAL as fault tolerance
mechanism but this should be handled by Kafka.
18
19. Apache Flink : The Elegant
• True streaming, no more
micro-batching !
• Nice back pressure
handling.
• Fault tolerant, exactly once
processing.
• High-Throughput.
• Scalable.
An open source platform for distributed stream
and batch data processing.
• Rich functional APIs.
• (Almost) no constraints on
serialization.
• Control of parallelism at all
execution levels.
• Flexibility and ease of
extension.
• And more nice stuffs.
19
20. IoT / Mobile Applications
20
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
22. LUX: Logged User eXperience
4
Billions
raw
events/day
700 GB/day
(Raw Data
/ snappy
compress
ed)
100 Data
Sources
6Main
Data
Types
26 Kafka
Topics
• 6 raw
• 20 enriched
CDH5
Cluster
2 Brokers
Kafka
Cluster
20 Data
Nodes
750
TB
22
23. 23
Mobile CDR use case
Client
1
Client
2
Client
...
Client
x
HADOOP
CDR_BIN
CDR_DECODED
CDR_ENRICHED
CDR_ENRICHED_ELASTIC
Planck-
Collector
DECODED
CDR
HDFS
Elasticsearch
2 Weeksretention
Other ITSystems(
commercial, ...)
KPI
Views
HDFS
Each Machine
generatesa binary file
every 5min or 2MB
Binary
Decoder
Common
CDR
Enricher
Lookup Live
ReferenceData
Alarms/Live
Counters
Zabbix
Elasticsearch
Formater
15 min
Window
Counters
Historical
Data
ENRICHED
CDR
HDFS
BINARY
CDR
HDFS
REFERENCE
DATA
HDFS
K2ES
LOGSTASH
Kafka
Mirroring
Lookup Live
ReferenceData
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment
24. And it Rocks !
• We ran stress tests on
our biggest raw Kafka
topic:
– A day of Data.
– 2 Billions events
(480Gib compressed).
– 10 Kafka partitions
– 10 Flink
TaskManagers (Only
1GB Memory each)
Enrich Rate (in Tickets/second)
Total Processing time (ms)
Kafka I/O Duration (ms)
24
25. And it Rocks !
• We ran stress tests on
our biggest raw Kafka
topic:
– A day of Data.
– 2 Billions events
(480Gib compressed).
– 10 Kafka partitions
– 10 Flink
TaskManagers (Only
1GB Memory each)
Total Processing time (ms)
Kafka I/O Duration (ms)
25
500 000 events/sec
1 day data
processed in 1h.
Enrich Rate (in Tickets/second)
Less than 200 ms
Processing Time
26. What we loved using
Flink/Notable features
• Development cost.
• Ease of testing & development.
–Works exactly the way you expect it to work.
–Local Execution mode.
• No more OOM.
• Efficient resource management.
• Excellent performance even with limited
resources.
26
27. What we loved using
Flink/Notable features
Viele danke Data-Artisans !
Merci Beaucoup
• True streaming from different sources including
Kafka
–Exactly-once, low-latency, High-throughput
stream processing.
• Yarn mode features:
–Yarn yarn.maximum-failed-containers
–Yarn detached-mode
27
28. What’s Next?
• Connect LUX to new sources.
• Use of JobManager High Availability
• Archive Data on HDFS using the new
filesystem sink.
• Index Elasticsearch Data using the new
Elasticsearch sink.
• Flink ML.
• Contributions to the Flink Project.
28