SlideShare a Scribd company logo
1 of 25
Download to read offline
Building a Streaming Data
Pipeline for Trains Delays
Processing with Spark and Delta
Lake
BERGERE Alexandre & GHRIBI Kaoula
Cloud Architect & Data Engineer at SNCF
SNCF & ITNOVEM
Alexandre BERGERE
Data Architect – ITNOVEM
Kaoula GHRIBI
Data Engineer – ITNOVEM
Agenda
Introduction
• Team ITNOVEM & SNCF
Data and Use case
▪ Data Processing
▪ Data Exposure
▪ Lessons Learned
What’s next ?
SNCF & ITNOVEM
SNCF & ITNOVEM
270,000 employees
14,000,000 of travellers every
day in the world
30,000 km of railway lines
with 2,600 km for high speed
trains
15,000 trains / day
€33,5 bn revenue
+4.2% increase
Photocredit:©MaximeHuriez
Photocredit:©MatthieuRaffard
Photocredit:©SébastienGodefroy
SNCF & ITNOVEM
DATA & IA FACTORY MISSIONS
Centralising the projects
of the different entities
Accelerate innovation at SNCF
Helping SNCF business
lines to explore data to
extract value
Supporting teams from
modeling to the
industrialisation of the solution
BREHAT : Trains Delays Processing
BREHAT
▪ BREHAT beacons: installed along the
rails (at national level); these
beacons detect trains passages.
▪ Why not use a GPS system?
▪ Historical infrastructure
▪ Network coverage area
▪ Underground stations
▪ Observations: A flow of events
triggered at each train passage
containing real and theoretical time
of passage. The observations are
either:
▪ Generated from automatic tracking
systems
▪ Seized by traffic officers
▪ Incidents: An abnormal situation
that could disrupt rail operations.
Where does our data come from? What does it contain ?
BREHAT
Clean & Enriched
beacons
BREHAT
▪ Built a real-time data processing to monitor traffic and map the propagation of
train delays.
▪ Expose the output in the best way depending on the consumer : Power BI, REST
API or Delta files.
PROJECT’S CHALLENGES
BREHAT
▪ Denfert: model driving behaviour between two train stations
▪ Prev. retard: predict train delays
PROJECT’S EXAMPLE
Architecture & Data Processing
Architecture
STORE
PREP AND TRAIN
ADFS Gen2
INGEST MODEL & SERVE
NiFi
MQ SERIES
Azure Databricks
connector
EventHub
Real time information
On Demand information
Why Delta Architecture ?
STRUCTURED STREAMING
+
• ACID Transactions & Full DML Support : Delta Lake supports standard DML
including UPDATE, DELETE and MERGE INTO providing developers more
controls to manage their big datasets
• Unified Batch and Streaming Source and Sink : Streaming data ingest, batch
historic backfill, and interactive queries all just work out of the box
Data processing
Streaming process
P1 - Data Quality Management
P2 - Keep most recent version of events that
satisfies a list of constraints
P3 - Enrichment
P4 – Join & Aggregations
Observation
Delta Table
Source
Mode : Update
Gold
Keep Origin& Destination
+ Businessenrichment
Update Origin&
Destination
Silver
Mode : Append
Mode : Update
OBSERVATIONS CIRCULATIONS
Keep most recent
version
+ enrichmentGPS
Spark Streaming job
Data processing
Technical constraints
• Why should we separate streams?
Multiple streaming aggregations (i.e. a chain of aggregations on a
streaming DF) are not yet supported on streaming Datasets
• Why we added a Silver Table?
Only append mode is supported after a join query.
Demo
STRUCTURED STREAMING
+
Exposure
Visualization – Azure Databricks connector
How to connect deltalake to PowerBi?
• Spark : low & not optimized
• Simba : fast but complex to install
• Azure Databricks ( Septembre 2020) : fast & native
Exposure
Visualization – Azure Databricks connector
Lessons Learned
Streaming process
• One stream = One cluster
• A dedicated cluster for the connexion power Bi
-> Try Virtual cluster & Photon in order to reduce cost & optimize latency
• Add time constraints to stream to stream join
• Watermarks: without a watermark, state will increase potentially to the point
where you run out of memory
What’s next ?
Databricks support
SNCF & Databricks
• Usage optimization
• Ability to scale through best practice and expertise
-> Make patterns and architectures usable for the all company
-> Increase productivity and mitigate risks
Next steps
Increase project’s
development & data
usage
EXPOSURE BUSINESS DEVELOPMENT
Build API to
increase project’s
adoption
Databricks simplifies data and AI
so data teams can innovate faster
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...HostedbyConfluent
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!Databricks
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoDatabricks
 
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...confluent
 
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...InfluxData
 
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle DiagnosticsScylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle DiagnosticsScyllaDB
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise WideDatabricks
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksDatabricks
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
 
Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platformDataWorks Summit
 
The Future of Data Pipelines
The Future of Data PipelinesThe Future of Data Pipelines
The Future of Data PipelinesAll Things Open
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent
 
Best Practices for Engineering Production-Ready Software with Apache Spark
Best Practices for Engineering Production-Ready Software with Apache SparkBest Practices for Engineering Production-Ready Software with Apache Spark
Best Practices for Engineering Production-Ready Software with Apache SparkDatabricks
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsNick Pentreath
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudDatabricks
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks
 

What's hot (20)

Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
Building Stateful applications on Streaming Platforms | Premjit Mishra, Dell ...
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
Data Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at ZalandoData Warehousing with Spark Streaming at Zalando
Data Warehousing with Spark Streaming at Zalando
 
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
 
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
 
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle DiagnosticsScylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics
Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics
 
10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide10 Things Learned Releasing Databricks Enterprise Wide
10 Things Learned Releasing Databricks Enterprise Wide
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
 
Javantura v3 - Microservice – no fluff the REAL stuff – Nakul Mishra
Javantura v3 - Microservice – no fluff the REAL stuff – Nakul MishraJavantura v3 - Microservice – no fluff the REAL stuff – Nakul Mishra
Javantura v3 - Microservice – no fluff the REAL stuff – Nakul Mishra
 
Building Audi’s enterprise big data platform
Building Audi’s enterprise big data platformBuilding Audi’s enterprise big data platform
Building Audi’s enterprise big data platform
 
The Future of Data Pipelines
The Future of Data PipelinesThe Future of Data Pipelines
The Future of Data Pipelines
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
 
Best Practices for Engineering Production-Ready Software with Apache Spark
Best Practices for Engineering Production-Ready Software with Apache SparkBest Practices for Engineering Production-Ready Software with Apache Spark
Best Practices for Engineering Production-Ready Software with Apache Spark
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
Introduction to knime
Introduction to knimeIntroduction to knime
Introduction to knime
 

Similar to Building a Streaming Data Pipeline for Trains Delays Processing

PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT
PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT
PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT PROIDEA
 
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’ItaliaConfluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italiaconfluent
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10Mesosphere Inc.
 
Let's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaLet's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaWamika Singh
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBVoltDB
 
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Codemotion
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
Oracle Open World 2018 - Cloud Lift Accelerator Suite
Oracle Open World 2018 - Cloud Lift Accelerator SuiteOracle Open World 2018 - Cloud Lift Accelerator Suite
Oracle Open World 2018 - Cloud Lift Accelerator SuiteIke Aniagoh
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfIlham31574
 
130815 - Content Delviery Networks for the IEEE Singapore Broadcast group
130815 - Content Delviery Networks for the IEEE Singapore Broadcast group130815 - Content Delviery Networks for the IEEE Singapore Broadcast group
130815 - Content Delviery Networks for the IEEE Singapore Broadcast groupPasocoPteLtd
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkRECAP Project
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCapgemini
 
Aws based digital_transformation_platform
Aws based digital_transformation_platformAws based digital_transformation_platform
Aws based digital_transformation_platformSlobodan Sipcic
 
FEWS Data Analysis with ARR2016
FEWS Data Analysis with ARR2016 FEWS Data Analysis with ARR2016
FEWS Data Analysis with ARR2016 Lindsay Millard
 

Similar to Building a Streaming Data Pipeline for Trains Delays Processing (20)

PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT
PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT
PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT
 
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’ItaliaConfluent Cloud inside the Digital Transformation of Autostrade per l’Italia
Confluent Cloud inside the Digital Transformation of Autostrade per l’Italia
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
 
Let's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaLet's decipher the DevOps macedonia
Let's decipher the DevOps macedonia
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDBReal-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
Real-time Big Data Analytics in the IBM SoftLayer Cloud with VoltDB
 
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Oracle Open World 2018 - Cloud Lift Accelerator Suite
Oracle Open World 2018 - Cloud Lift Accelerator SuiteOracle Open World 2018 - Cloud Lift Accelerator Suite
Oracle Open World 2018 - Cloud Lift Accelerator Suite
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
130815 - Content Delviery Networks for the IEEE Singapore Broadcast group
130815 - Content Delviery Networks for the IEEE Singapore Broadcast group130815 - Content Delviery Networks for the IEEE Singapore Broadcast group
130815 - Content Delviery Networks for the IEEE Singapore Broadcast group
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation Framework
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
CWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / ClouderaCWIN17 Frankfurt / Cloudera
CWIN17 Frankfurt / Cloudera
 
Aws based digital_transformation_platform
Aws based digital_transformation_platformAws based digital_transformation_platform
Aws based digital_transformation_platform
 
FEWS Data Analysis with ARR2016
FEWS Data Analysis with ARR2016 FEWS Data Analysis with ARR2016
FEWS Data Analysis with ARR2016
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 

Building a Streaming Data Pipeline for Trains Delays Processing

  • 1. Building a Streaming Data Pipeline for Trains Delays Processing with Spark and Delta Lake BERGERE Alexandre & GHRIBI Kaoula Cloud Architect & Data Engineer at SNCF
  • 2. SNCF & ITNOVEM Alexandre BERGERE Data Architect – ITNOVEM Kaoula GHRIBI Data Engineer – ITNOVEM
  • 3. Agenda Introduction • Team ITNOVEM & SNCF Data and Use case ▪ Data Processing ▪ Data Exposure ▪ Lessons Learned What’s next ?
  • 5. SNCF & ITNOVEM 270,000 employees 14,000,000 of travellers every day in the world 30,000 km of railway lines with 2,600 km for high speed trains 15,000 trains / day €33,5 bn revenue +4.2% increase Photocredit:©MaximeHuriez Photocredit:©MatthieuRaffard Photocredit:©SébastienGodefroy
  • 6. SNCF & ITNOVEM DATA & IA FACTORY MISSIONS Centralising the projects of the different entities Accelerate innovation at SNCF Helping SNCF business lines to explore data to extract value Supporting teams from modeling to the industrialisation of the solution
  • 7. BREHAT : Trains Delays Processing
  • 8. BREHAT ▪ BREHAT beacons: installed along the rails (at national level); these beacons detect trains passages. ▪ Why not use a GPS system? ▪ Historical infrastructure ▪ Network coverage area ▪ Underground stations ▪ Observations: A flow of events triggered at each train passage containing real and theoretical time of passage. The observations are either: ▪ Generated from automatic tracking systems ▪ Seized by traffic officers ▪ Incidents: An abnormal situation that could disrupt rail operations. Where does our data come from? What does it contain ?
  • 10. BREHAT ▪ Built a real-time data processing to monitor traffic and map the propagation of train delays. ▪ Expose the output in the best way depending on the consumer : Power BI, REST API or Delta files. PROJECT’S CHALLENGES
  • 11. BREHAT ▪ Denfert: model driving behaviour between two train stations ▪ Prev. retard: predict train delays PROJECT’S EXAMPLE
  • 12. Architecture & Data Processing
  • 13. Architecture STORE PREP AND TRAIN ADFS Gen2 INGEST MODEL & SERVE NiFi MQ SERIES Azure Databricks connector EventHub Real time information On Demand information
  • 14. Why Delta Architecture ? STRUCTURED STREAMING + • ACID Transactions & Full DML Support : Delta Lake supports standard DML including UPDATE, DELETE and MERGE INTO providing developers more controls to manage their big datasets • Unified Batch and Streaming Source and Sink : Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box
  • 15. Data processing Streaming process P1 - Data Quality Management P2 - Keep most recent version of events that satisfies a list of constraints P3 - Enrichment P4 – Join & Aggregations Observation Delta Table Source Mode : Update Gold Keep Origin& Destination + Businessenrichment Update Origin& Destination Silver Mode : Append Mode : Update OBSERVATIONS CIRCULATIONS Keep most recent version + enrichmentGPS Spark Streaming job
  • 16. Data processing Technical constraints • Why should we separate streams? Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets • Why we added a Silver Table? Only append mode is supported after a join query.
  • 18. Exposure Visualization – Azure Databricks connector How to connect deltalake to PowerBi? • Spark : low & not optimized • Simba : fast but complex to install • Azure Databricks ( Septembre 2020) : fast & native
  • 19. Exposure Visualization – Azure Databricks connector
  • 20. Lessons Learned Streaming process • One stream = One cluster • A dedicated cluster for the connexion power Bi -> Try Virtual cluster & Photon in order to reduce cost & optimize latency • Add time constraints to stream to stream join • Watermarks: without a watermark, state will increase potentially to the point where you run out of memory
  • 22. Databricks support SNCF & Databricks • Usage optimization • Ability to scale through best practice and expertise -> Make patterns and architectures usable for the all company -> Increase productivity and mitigate risks
  • 23. Next steps Increase project’s development & data usage EXPOSURE BUSINESS DEVELOPMENT Build API to increase project’s adoption
  • 24. Databricks simplifies data and AI so data teams can innovate faster
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.