A major cause of dissatisfaction among passengers is the irregularity of train schedules.
SNCF (French National Railway Company) has distributed a network of beacons over its 32,000 km of train tracks, triggering a flow of events at each train passage. In this talk, we will present how we built a real-time data processing on these data, to monitor traffic and map the propagation of train delays.
Building a Streaming Data Pipeline for Trains Delays Processing
1. Building a Streaming Data
Pipeline for Trains Delays
Processing with Spark and Delta
Lake
BERGERE Alexandre & GHRIBI Kaoula
Cloud Architect & Data Engineer at SNCF
6. SNCF & ITNOVEM
DATA & IA FACTORY MISSIONS
Centralising the projects
of the different entities
Accelerate innovation at SNCF
Helping SNCF business
lines to explore data to
extract value
Supporting teams from
modeling to the
industrialisation of the solution
8. BREHAT
▪ BREHAT beacons: installed along the
rails (at national level); these
beacons detect trains passages.
▪ Why not use a GPS system?
▪ Historical infrastructure
▪ Network coverage area
▪ Underground stations
▪ Observations: A flow of events
triggered at each train passage
containing real and theoretical time
of passage. The observations are
either:
▪ Generated from automatic tracking
systems
▪ Seized by traffic officers
▪ Incidents: An abnormal situation
that could disrupt rail operations.
Where does our data come from? What does it contain ?
10. BREHAT
▪ Built a real-time data processing to monitor traffic and map the propagation of
train delays.
▪ Expose the output in the best way depending on the consumer : Power BI, REST
API or Delta files.
PROJECT’S CHALLENGES
11. BREHAT
▪ Denfert: model driving behaviour between two train stations
▪ Prev. retard: predict train delays
PROJECT’S EXAMPLE
13. Architecture
STORE
PREP AND TRAIN
ADFS Gen2
INGEST MODEL & SERVE
NiFi
MQ SERIES
Azure Databricks
connector
EventHub
Real time information
On Demand information
14. Why Delta Architecture ?
STRUCTURED STREAMING
+
• ACID Transactions & Full DML Support : Delta Lake supports standard DML
including UPDATE, DELETE and MERGE INTO providing developers more
controls to manage their big datasets
• Unified Batch and Streaming Source and Sink : Streaming data ingest, batch
historic backfill, and interactive queries all just work out of the box
15. Data processing
Streaming process
P1 - Data Quality Management
P2 - Keep most recent version of events that
satisfies a list of constraints
P3 - Enrichment
P4 – Join & Aggregations
Observation
Delta Table
Source
Mode : Update
Gold
Keep Origin& Destination
+ Businessenrichment
Update Origin&
Destination
Silver
Mode : Append
Mode : Update
OBSERVATIONS CIRCULATIONS
Keep most recent
version
+ enrichmentGPS
Spark Streaming job
16. Data processing
Technical constraints
• Why should we separate streams?
Multiple streaming aggregations (i.e. a chain of aggregations on a
streaming DF) are not yet supported on streaming Datasets
• Why we added a Silver Table?
Only append mode is supported after a join query.
18. Exposure
Visualization – Azure Databricks connector
How to connect deltalake to PowerBi?
• Spark : low & not optimized
• Simba : fast but complex to install
• Azure Databricks ( Septembre 2020) : fast & native
20. Lessons Learned
Streaming process
• One stream = One cluster
• A dedicated cluster for the connexion power Bi
-> Try Virtual cluster & Photon in order to reduce cost & optimize latency
• Add time constraints to stream to stream join
• Watermarks: without a watermark, state will increase potentially to the point
where you run out of memory
22. Databricks support
SNCF & Databricks
• Usage optimization
• Ability to scale through best practice and expertise
-> Make patterns and architectures usable for the all company
-> Increase productivity and mitigate risks