Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Real-time Data Integration with Apache
Flink & Kafka @Bouygues Telecom
Mohamed Amine ABDESSEMED
Flink Forward, Berlin, October 2015

About Me
• Software engineer & solution architect @ Bouygues
Telecom
• My daily Toolbox
– Hadoop ecosystem
– Apache Flink
– Apache Kafka
– Elasticsearch
– Apache Camel
– And more.
• If I don’t see me coding, I am probably outside running
2

Outline
• Who we are
• Logged User eXperience
• Challenges
• Typical Data Flow pipeline on Hadoop
• Real-time Data Integration
• Apache Flink : The Elegant
• Data Integration use case
• What we loved using Flink
• What’s Next ?
3

BOUYGUES TELECOM
14M
Clients
11,4M
Mobile
subscriber
2,6M Fixed
customer
A very Innovative company
Leader 4G/4G+/UHMD
First Android based TV BOX
4
Mobile . Fixed . TV . Internet . Cloud

BOUYGUES TELECOM
21358
19137
14479
10595
Bouygues
Telecom
Orange Free SFR
nPerf Mobile Data Networks
Global score 2G/3G/4G (Q3-
2015)
Bouygues Telecom
Orange
Free
SFR
71% 72%
33%
53%
Bouygues
Telecom
Orange Free SFR
Population covered in
4G
Bouygues Telecom Orange Free SFR
5

AT THE HUB OF OUR
14 MILLION CUSTOMERS' DIGITAL
LIVES
AND WE GIVE THEMGENUINE
REASONS TO STAY LOYAL

LUX: Logged User eXperience
Mobile QoE
• Produce Mobile QoE indicators from massive
network equipment’s event logs (4 Billions/day).
• Goals:
–QoE (User) instead of QoS (Machine).
–Real-time Diagnostic (<60sec. end-to-end latency)
–Business Intelligence
–Real-time alarming
–Reporting
7

Mobile QoE
log
s
log
s
log
s
log
s
log
s
log
s
log
s
log
s
8
Challenges

Challenges
1. Data movement
9
log
s
log
s
log
s
log
s
log
s
log
s
log
s
log
s

Challenges
1. Data movement
2. Data Processing
– Data is generally too raw to be used directly.
– How can we transform it ?
– How can we make the results available as soon as possible ?
10

Typical Data Flow pipeline on Hadoop
HADOOP
FTP
HTTP
SCOOP
FLUME
FSPUT
Client
1
Client
2
Client
...
Client
x
Impala
Hive
Hbase
TO
System
1
System
2
System
...
System
x
Raw
Data
DFS
Enriched
Data
V1
DFS
Enriched
Data
V...
DFS
Enriched
Data
Vx
DFS
Data
source1
Data
Source 2
Data
Source x
Data
Source ...
Batch Batch Batch
11

12
THE WORLD MOVES FAST
DATA MUST MOVE FASTER

Real-time Data Integration
• Inspired by Linkedin’s Kafka Data
Integration design pattern.
• Take all the data and it into a central log
repository for real-time subscription.
What if the data is too raw to
used, even binary encoded with
no visible business logic
information?
13

Solution 01: Process Data before pushing it to Kafka
• Not viable:
– Data sources have limited computation resources dedicated to
log collection.
– Not scalable.
– Too hard to maintain.
 We’ve to push it RAW.
14

• Solution 02: The consumers/Data subscribers have to
process Data before using it.
– Drawbacks:
• All the consumers must implement the same business logic, run it
against the same Data.
• Any changes/updates in the processing logic will require an update of
all the consumers.
Provide a useable Data format to each consumer/Data
subscriber.
15

Solution 02: The consumers/Data subscribers have to process Data before using it
Solution 03: Process Kafka’s raw Data and push it back in
decoded/enriched format for subscribers.
 Benefits:
– The business logic will be implemented in one place.
– Resource efficient.
– Data subscribers can focus more on their own business logic.
– Simple handling of sources/clients evolution.
 Challenges:
– Keep the Data moving real time.
– The Data processing pipeline must be very fast
16

Client
1
Client
2
Client
...
Client
x
HADOOP
Impala
Hive
Hbase
KafkaTOPICRAW
KafkaTOPICV1
KafkaTOPICV….
KafkaTOPICVx
SYS
1
TO
SYS
2
SYS
...
TO
Collector
App
Kafka
Producer
ARCHIVE
SYS
x
Enriched
Data
Vx
DFS
Raw
Data
DFS
Enriched
Data
V1
DFS
Enriched
Data
V...
DFS
Data
source 1
Data
Source 2
Data
Source x
Data
Source ...
Streaming
Streaming
Streaming
17

Real-time Data Integration with
Apache Kafka and Spark?
• Started a POC on Spark Streaming.
• Didn’t answer our needs:
– Poor back pressure handling
Jobs kept failing OOM on busy hours.
– Micro-batching & Latency.
– Many configuration parameters:
– The tested version used an HDFS WAL as fault tolerance
mechanism but this should be handled by Kafka.
18

Apache Flink : The Elegant
• True streaming, no more
micro-batching !
• Nice back pressure
handling.
• Fault tolerant, exactly once
processing.
• High-Throughput.
• Scalable.
An open source platform for distributed stream
and batch data processing.
• Rich functional APIs.
• (Almost) no constraints on
serialization.
• Control of parallelism at all
execution levels.
• Flexibility and ease of
extension.
• And more nice stuffs.
19

IoT / Mobile Applications
20
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log

4
Billions
raw
events/day
700 GB/day
(Raw Data
/ snappy
compress
ed)
100 Data
Sources
6Main
Data
Types
26 Kafka
Topics
• 6 raw
• 20 enriched
CDH5
Cluster
2 Brokers
Kafka
Cluster
20 Data
Nodes
750
TB
22

23
Mobile CDR use case
Client
1
Client
2
Client
...
Client
x
HADOOP
CDR_BIN
CDR_DECODED
CDR_ENRICHED
CDR_ENRICHED_ELASTIC
Planck-
Collector
DECODED
CDR
HDFS
Elasticsearch
2 Weeksretention
Other ITSystems(
commercial, ...)
KPI
Views
HDFS
Each Machine
generatesa binary file
every 5min or 2MB
Binary
Decoder
Common
CDR
Enricher
Lookup Live
ReferenceData
Alarms/Live
Counters
Zabbix
Elasticsearch
Formater
15 min
Window
Counters
Historical
Data
ENRICHED
CDR
HDFS
BINARY
CDR
HDFS
REFERENCE
DATA
HDFS
K2ES
LOGSTASH
Kafka
Mirroring
Lookup Live
ReferenceData
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment
Network
equipment

And it Rocks !
• We ran stress tests on
our biggest raw Kafka
topic:
– A day of Data.
– 2 Billions events
(480Gib compressed).
– 10 Kafka partitions
– 10 Flink
TaskManagers (Only
1GB Memory each)
Enrich Rate (in Tickets/second)
Total Processing time (ms)
Kafka I/O Duration (ms)
24

And it Rocks !
• We ran stress tests on
our biggest raw Kafka
topic:
– A day of Data.
– 2 Billions events
(480Gib compressed).
– 10 Kafka partitions
– 10 Flink
TaskManagers (Only
1GB Memory each)
Total Processing time (ms)
Kafka I/O Duration (ms)
25
500 000 events/sec
1 day data
processed in 1h.
Enrich Rate (in Tickets/second)
Less than 200 ms
Processing Time

What we loved using
Flink/Notable features
• Development cost.
• Ease of testing & development.
–Works exactly the way you expect it to work.
–Local Execution mode.
• No more OOM.
• Efficient resource management.
• Excellent performance even with limited
resources.
26

What we loved using
Flink/Notable features
Viele danke Data-Artisans !
Merci Beaucoup
• True streaming from different sources including
Kafka
–Exactly-once, low-latency, High-throughput
stream processing.
• Yarn mode features:
–Yarn yarn.maximum-failed-containers
–Yarn detached-mode
27

What’s Next?
• Connect LUX to new sources.
• Use of JobManager High Availability
• Archive Data on HDFS using the new
filesystem sink.
• Index Elasticsearch Data using the new
Elasticsearch sink.
• Flink ML.
• Contributions to the Flink Project.
28

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Similar to Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka