Building a Streaming Data Pipeline for Trains Delays Processing

Building a Streaming Data
Pipeline for Trains Delays
Processing with Spark and Delta
Lake
BERGERE Alexandre & GHRIBI Kaoula
Cloud Architect & Data Engineer at SNCF

SNCF & ITNOVEM
Alexandre BERGERE
Data Architect – ITNOVEM
Kaoula GHRIBI
Data Engineer – ITNOVEM

Agenda
Introduction
• Team ITNOVEM & SNCF
Data and Use case
▪ Data Processing
▪ Data Exposure
▪ Lessons Learned
What’s next ?

SNCF & ITNOVEM
270,000 employees
14,000,000 of travellers every
day in the world
30,000 km of railway lines
with 2,600 km for high speed
trains
15,000 trains / day
€33,5 bn revenue
+4.2% increase
Photocredit:©MaximeHuriez
Photocredit:©MatthieuRaffard
Photocredit:©SébastienGodefroy

SNCF & ITNOVEM
DATA & IA FACTORY MISSIONS
Centralising the projects
of the different entities
Accelerate innovation at SNCF
Helping SNCF business
lines to explore data to
extract value
Supporting teams from
modeling to the
industrialisation of the solution

BREHAT : Trains Delays Processing

BREHAT
▪ BREHAT beacons: installed along the
rails (at national level); these
beacons detect trains passages.
▪ Why not use a GPS system?
▪ Historical infrastructure
▪ Network coverage area
▪ Underground stations
▪ Observations: A flow of events
triggered at each train passage
containing real and theoretical time
of passage. The observations are
either:
▪ Generated from automatic tracking
systems
▪ Seized by traffic officers
▪ Incidents: An abnormal situation
that could disrupt rail operations.
Where does our data come from? What does it contain ?

BREHAT
Clean & Enriched
beacons

BREHAT
▪ Built a real-time data processing to monitor traffic and map the propagation of
train delays.
▪ Expose the output in the best way depending on the consumer : Power BI, REST
API or Delta files.
PROJECT’S CHALLENGES

BREHAT
▪ Denfert: model driving behaviour between two train stations
▪ Prev. retard: predict train delays
PROJECT’S EXAMPLE

Architecture & Data Processing

Architecture
STORE
PREP AND TRAIN
ADFS Gen2
INGEST MODEL & SERVE
NiFi
MQ SERIES
Azure Databricks
connector
EventHub
Real time information
On Demand information

Why Delta Architecture ?
STRUCTURED STREAMING
+
• ACID Transactions & Full DML Support : Delta Lake supports standard DML
including UPDATE, DELETE and MERGE INTO providing developers more
controls to manage their big datasets
• Unified Batch and Streaming Source and Sink : Streaming data ingest, batch
historic backfill, and interactive queries all just work out of the box

Data processing
Streaming process
P1 - Data Quality Management
P2 - Keep most recent version of events that
satisfies a list of constraints
P3 - Enrichment
P4 – Join & Aggregations
Observation
Delta Table
Source
Mode : Update
Gold
Keep Origin& Destination
+ Businessenrichment
Update Origin&
Destination
Silver
Mode : Append
Mode : Update
OBSERVATIONS CIRCULATIONS
Keep most recent
version
+ enrichmentGPS
Spark Streaming job

Data processing
Technical constraints
• Why should we separate streams?
Multiple streaming aggregations (i.e. a chain of aggregations on a
streaming DF) are not yet supported on streaming Datasets
• Why we added a Silver Table?
Only append mode is supported after a join query.

Exposure
Visualization – Azure Databricks connector
How to connect deltalake to PowerBi?
• Spark : low & not optimized
• Simba : fast but complex to install
• Azure Databricks ( Septembre 2020) : fast & native

Exposure
Visualization – Azure Databricks connector

Lessons Learned
Streaming process
• One stream = One cluster
• A dedicated cluster for the connexion power Bi
-> Try Virtual cluster & Photon in order to reduce cost & optimize latency
• Add time constraints to stream to stream join
• Watermarks: without a watermark, state will increase potentially to the point
where you run out of memory

Databricks support
SNCF & Databricks
• Usage optimization
• Ability to scale through best practice and expertise
-> Make patterns and architectures usable for the all company
-> Increase productivity and mitigate risks

Next steps
Increase project’s
development & data
usage
EXPOSURE BUSINESS DEVELOPMENT
Build API to
increase project’s
adoption

Databricks simplifies data and AI
so data teams can innovate faster

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Building a Streaming Data Pipeline for Trains Delays Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Streaming Data Pipeline for Trains Delays Processing

Similar to Building a Streaming Data Pipeline for Trains Delays Processing (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building a Streaming Data Pipeline for Trains Delays Processing