SlideShare a Scribd company logo
1 of 23
Download to read offline
Revealing the Power of
Legacy Machine-Data
Oliver Lemp
ENGEL Austria GmbH
ENGEL: Injection Moulding Machines
▪ Austria’s largest machine manufacturer
▪ Market leader for injection moulding machines
▪ Machines to manufacture plastic products
Customer Proximity
Business Units
Automotive Medical Packaging
Technical
Moulding
Teletronics
Setting the Scene
The ENGEL Customer Service
Customer
Report a problem at the
machine
1st level support
Field Engineer
Problem cannot be solved
immediately, send field engineer
Repair & Maintenance
• Collect error reports
• Analyse & fix errors
Collect feedback
Analysis in the Past
Excel as the tool of choice
An Idea Popped Into Our Heads
The Use Case
Goal: Self Service Tools
Customer
1st level support
Field Engineer
Repair & Maintenance
DIY error analysis
The Use Case
▪ Use data science to assist the customer support
▪ Classified error documentation (symptoms, errors, solutions)
▪ Detect error patterns automatically / rule-based
▪ Reduce maintenance/repair times
▪ Predict future errors
▪ Detect / discover serial defects
▪ Generate sustainable knowledge
▪ Fast onboarding of new employees
▪ Focus on fixing the problems efficiently
▪ Creating data-driven solutions
Fault Discovery Assistance
Challenges of the Use Case
Starting situation
Challenges of the Use Case
▪ Zipped collection of (serialised) logfiles
▪ Snapshot of the machine’s parameters
▪ Last X errors on the machine
▪ Fault discovery & documentation
▪ Customer support can derive wrong settings from the reports
▪ Different data formats for different control generations
▪ Legacy data (no standard in chosen data formats)
▪ Collected since approx. 1990
▪ Ranging from simple text files to recursive archives
ENGEL Error Reports
Challenges of the Use Case
Recursive Archive Structure
▪ Logfiles (partially binary serialised)
▪ Memory Dumps
▪ Parameter Snapshots
▪ …
Issues
▪ 13 different timestamp formats
▪ Different structure for each control generation
▪ Broken archives
▪ Missing files
▪ …
Report Structure
Prototyping a Solution
▪ Hortonworks stack was promising
▪ Apache Nifi, Spark, Kafka and HDFS as our core components
▪ Starting with a small cluster
▪ 5x raspberry pies
▪ Establishing a production environment on dedicated hardware
▪ On–premise hosting
Hadoop seemed to be in fashion
Prototyping a Solution
Lambda Architecture on Hortonworks HDP
Upload report
Apache Nifi
Data ingestion & routing
Write meta attributes +
filepath to Kafka
Apache Kafka
(Event) Stream
Process metadata
Process parameters
(Parquet)
Store raw data blob
HDFS Batch & Stream Processing
BI, web & mobile apps
New Difficulties Arise
▪ Maintaining Streaming + batch jobs
▪ Kafka and large files
▪ Reading from multiple systems (Kafka + HDFS)
▪ Hadoop and small files
▪ Legacy binary deserialisation
▪ Pascal(!) JNA Wrapper
▪ Unpredictable (parameter) data
▪ Binaries with 200.000 and up to 3 Mio. variables per error report
Non-standard use case
JavaDStream<String> json = kafkaStream.map(ConsumerRecord::value);
json.foreachRDD(rdd -> {
Dataset<Row> df = sparkSession.read().json(rdd);
if (df.count() >= 1) {
List<String> hdfsPaths = df
.select("`hdfs.filepath`")
.javaRDD()
.map(row -> row.getString(0)).collect();
String hdfsPaths = String.join(",", hdfsPaths);
SampleProcessor sampleProcessor = new SampleProcessor();
JavaRDD<String> binaries = javaSparkContext.binaryFiles(hdfsPaths)
.map(report -> new TarArchiveInputStream(report._2.open()))
.map(sampleProcessor::call);
} else {
Log.info("No records in this batch");
}
});
High complexity and workaround for streaming large binaries
Partitioning Systemvariables
▪ Spark and parquet files to store systemvariables
▪ No efficient and economic database was found
Flattening the tree structure
Timestamp,FabNr,VarName,Unit,IntValue,DoubleValue,StringValue,BoolValue
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure,,0.174087137,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_sim,,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_stat,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuChargeMainPump,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuInject,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuOff,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuSafety,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.er_AccuPressMin,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.evAnaDisEn,,,,
Partitioning Systemvariables
▪ Ideally by functionunit (1st part of the variable)
▪ Grouped by machine components
▪ Custom hash-based partitioning
▪ Not time series data
▪ Very good for point lookups – not so much for regex
Optimising for point lookups
SELECT *
FROM variables
WHERE varName = "AccuGeneral1.ai_Pressure"
AND fabNr = "XXX"
SELECT *
FROM variables
WHERE varName = "AccuGeneral1.ai_Pressure"
Query Examples
Issues in This Architecture
▪ Upgrades and Migrations
▪ Job Monitoring / Ganglia Metrics
▪ Repartitioning Job
▪ Merging Streaming and Batch Files
▪ Unpredictable errors in batch jobs
▪ Memory Bombs / Issues
▪ Spark 2.x no binary file support => working with RDDs
▪ This architecture feels like a big workaround
The Real Show Stoppers
WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715,
XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running
tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of
12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
Working Towards an Optimal Solution?
Discovering Azure & Databricks
Upload report
Azure Data Lake Storage
Autoloader
Process metadata
Process parameters
(Parquet)
Store raw data blob
BI, web & mobile apps
Azure Cosmos DB
Working Towards an Optimal Solution?
▪ Equi-distant range partitioning
▪ Based on Lexical order of parameter names
▪ Roughly same amount of parameters per partition
▪ Allows searching for variables in
the same component / root node
▪ Better data skipping
▪ OPTIMIZE instead of repartitioning jobs
Parameter Partitioning
A.xxx
B.xxx
C.xxx
D.xxx
E.xxx
F.xxx
Partition 1
Partition 2
….
Working Towards an Optimal Solution?
▪ Reduced Complexity
▪ One single configurable Spark job
▪ Kafka replaced by Autoloader
▪ Unified Batch & Streaming
▪ ….
▪ Reduced memory pressure
▪ Micro Batches and not full batches
▪ Stable Jobs
▪ Monitoring
▪ JVM / Ganglia Metrics
Azure & Databricks Benefits
The Current State
Self-service tools instead of manual doing
Currently
▪ Established self-service tools
▪ Send a PDF summary to the field
engineers
▪ Steps that could solve the problem
▪ Things to also consider at the machine
In future
▪ Automatically detect and classify
errors
Key Takeaways
▪ Don’t underestimate the effort to process legacy- data
▪ The unknown in this data jungle is quite daunting
▪ Unforeseeable things will happen
▪ Change management in a traditional company is really demanding
▪ Running an unmanaged cluster without dedicated resources leads to pure frustration
▪ Moving to the cloud reduced the complexity in our pipelines by a lot
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Spark Summit
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 

What's hot (20)

Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 

Similar to Revealing the Power of Legacy Machine Data

From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...Cobus Bernard
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Caserandom_chance
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 

Similar to Revealing the Power of Legacy Machine Data (20)

From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Exadata
ExadataExadata
Exadata
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
AWS Webinar 23 - Getting Started with AWS - Understanding total cost of owner...
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Spark Streaming Early Warning Use Case
Spark Streaming Early Warning Use CaseSpark Streaming Early Warning Use Case
Spark Streaming Early Warning Use Case
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Recently uploaded (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Revealing the Power of Legacy Machine Data

  • 1. Revealing the Power of Legacy Machine-Data Oliver Lemp ENGEL Austria GmbH
  • 2. ENGEL: Injection Moulding Machines ▪ Austria’s largest machine manufacturer ▪ Market leader for injection moulding machines ▪ Machines to manufacture plastic products
  • 3. Customer Proximity Business Units Automotive Medical Packaging Technical Moulding Teletronics
  • 4. Setting the Scene The ENGEL Customer Service Customer Report a problem at the machine 1st level support Field Engineer Problem cannot be solved immediately, send field engineer Repair & Maintenance • Collect error reports • Analyse & fix errors Collect feedback
  • 5. Analysis in the Past Excel as the tool of choice
  • 6. An Idea Popped Into Our Heads
  • 7. The Use Case Goal: Self Service Tools Customer 1st level support Field Engineer Repair & Maintenance DIY error analysis
  • 8. The Use Case ▪ Use data science to assist the customer support ▪ Classified error documentation (symptoms, errors, solutions) ▪ Detect error patterns automatically / rule-based ▪ Reduce maintenance/repair times ▪ Predict future errors ▪ Detect / discover serial defects ▪ Generate sustainable knowledge ▪ Fast onboarding of new employees ▪ Focus on fixing the problems efficiently ▪ Creating data-driven solutions Fault Discovery Assistance
  • 9. Challenges of the Use Case Starting situation
  • 10. Challenges of the Use Case ▪ Zipped collection of (serialised) logfiles ▪ Snapshot of the machine’s parameters ▪ Last X errors on the machine ▪ Fault discovery & documentation ▪ Customer support can derive wrong settings from the reports ▪ Different data formats for different control generations ▪ Legacy data (no standard in chosen data formats) ▪ Collected since approx. 1990 ▪ Ranging from simple text files to recursive archives ENGEL Error Reports
  • 11. Challenges of the Use Case Recursive Archive Structure ▪ Logfiles (partially binary serialised) ▪ Memory Dumps ▪ Parameter Snapshots ▪ … Issues ▪ 13 different timestamp formats ▪ Different structure for each control generation ▪ Broken archives ▪ Missing files ▪ … Report Structure
  • 12. Prototyping a Solution ▪ Hortonworks stack was promising ▪ Apache Nifi, Spark, Kafka and HDFS as our core components ▪ Starting with a small cluster ▪ 5x raspberry pies ▪ Establishing a production environment on dedicated hardware ▪ On–premise hosting Hadoop seemed to be in fashion
  • 13. Prototyping a Solution Lambda Architecture on Hortonworks HDP Upload report Apache Nifi Data ingestion & routing Write meta attributes + filepath to Kafka Apache Kafka (Event) Stream Process metadata Process parameters (Parquet) Store raw data blob HDFS Batch & Stream Processing BI, web & mobile apps
  • 14. New Difficulties Arise ▪ Maintaining Streaming + batch jobs ▪ Kafka and large files ▪ Reading from multiple systems (Kafka + HDFS) ▪ Hadoop and small files ▪ Legacy binary deserialisation ▪ Pascal(!) JNA Wrapper ▪ Unpredictable (parameter) data ▪ Binaries with 200.000 and up to 3 Mio. variables per error report Non-standard use case JavaDStream<String> json = kafkaStream.map(ConsumerRecord::value); json.foreachRDD(rdd -> { Dataset<Row> df = sparkSession.read().json(rdd); if (df.count() >= 1) { List<String> hdfsPaths = df .select("`hdfs.filepath`") .javaRDD() .map(row -> row.getString(0)).collect(); String hdfsPaths = String.join(",", hdfsPaths); SampleProcessor sampleProcessor = new SampleProcessor(); JavaRDD<String> binaries = javaSparkContext.binaryFiles(hdfsPaths) .map(report -> new TarArchiveInputStream(report._2.open())) .map(sampleProcessor::call); } else { Log.info("No records in this batch"); } }); High complexity and workaround for streaming large binaries
  • 15. Partitioning Systemvariables ▪ Spark and parquet files to store systemvariables ▪ No efficient and economic database was found Flattening the tree structure Timestamp,FabNr,VarName,Unit,IntValue,DoubleValue,StringValue,BoolValue 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure,,0.174087137,, 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_sim,,,,, 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_stat,,,,false 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuChargeMainPump,,,,false 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuInject,,,, 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuOff,,,, 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuSafety,,,,false 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.er_AccuPressMin,,,, 2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.evAnaDisEn,,,,
  • 16. Partitioning Systemvariables ▪ Ideally by functionunit (1st part of the variable) ▪ Grouped by machine components ▪ Custom hash-based partitioning ▪ Not time series data ▪ Very good for point lookups – not so much for regex Optimising for point lookups SELECT * FROM variables WHERE varName = "AccuGeneral1.ai_Pressure" AND fabNr = "XXX" SELECT * FROM variables WHERE varName = "AccuGeneral1.ai_Pressure" Query Examples
  • 17. Issues in This Architecture ▪ Upgrades and Migrations ▪ Job Monitoring / Ganglia Metrics ▪ Repartitioning Job ▪ Merging Streaming and Batch Files ▪ Unpredictable errors in batch jobs ▪ Memory Bombs / Issues ▪ Spark 2.x no binary file support => working with RDDs ▪ This architecture feels like a big workaround The Real Show Stoppers WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX): ExecutorLostFailure (executor 23 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
  • 18. Working Towards an Optimal Solution? Discovering Azure & Databricks Upload report Azure Data Lake Storage Autoloader Process metadata Process parameters (Parquet) Store raw data blob BI, web & mobile apps Azure Cosmos DB
  • 19. Working Towards an Optimal Solution? ▪ Equi-distant range partitioning ▪ Based on Lexical order of parameter names ▪ Roughly same amount of parameters per partition ▪ Allows searching for variables in the same component / root node ▪ Better data skipping ▪ OPTIMIZE instead of repartitioning jobs Parameter Partitioning A.xxx B.xxx C.xxx D.xxx E.xxx F.xxx Partition 1 Partition 2 ….
  • 20. Working Towards an Optimal Solution? ▪ Reduced Complexity ▪ One single configurable Spark job ▪ Kafka replaced by Autoloader ▪ Unified Batch & Streaming ▪ …. ▪ Reduced memory pressure ▪ Micro Batches and not full batches ▪ Stable Jobs ▪ Monitoring ▪ JVM / Ganglia Metrics Azure & Databricks Benefits
  • 21. The Current State Self-service tools instead of manual doing Currently ▪ Established self-service tools ▪ Send a PDF summary to the field engineers ▪ Steps that could solve the problem ▪ Things to also consider at the machine In future ▪ Automatically detect and classify errors
  • 22. Key Takeaways ▪ Don’t underestimate the effort to process legacy- data ▪ The unknown in this data jungle is quite daunting ▪ Unforeseeable things will happen ▪ Change management in a traditional company is really demanding ▪ Running an unmanaged cluster without dedicated resources leads to pure frustration ▪ Moving to the cloud reduced the complexity in our pipelines by a lot
  • 23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.