ENGEL, which was founded in 1945, now is the leading manufacturer for injection moulding machines on the global market. Since then, and especially in the current era, the amount of data has grown immensely and has also become more and more heterogenous due to newer generations of machine controls. Taking a closer look at the conglomerations of each and every machine’s log files, one can find 13 different types of timestamps, different archive types and more peculiarities of each control generation. Apparently, this has led to certain problems in automatically processing and analysing the data.
4. Setting the Scene
The ENGEL Customer Service
Customer
Report a problem at the
machine
1st level support
Field Engineer
Problem cannot be solved
immediately, send field engineer
Repair & Maintenance
• Collect error reports
• Analyse & fix errors
Collect feedback
7. The Use Case
Goal: Self Service Tools
Customer
1st level support
Field Engineer
Repair & Maintenance
DIY error analysis
8. The Use Case
▪ Use data science to assist the customer support
▪ Classified error documentation (symptoms, errors, solutions)
▪ Detect error patterns automatically / rule-based
▪ Reduce maintenance/repair times
▪ Predict future errors
▪ Detect / discover serial defects
▪ Generate sustainable knowledge
▪ Fast onboarding of new employees
▪ Focus on fixing the problems efficiently
▪ Creating data-driven solutions
Fault Discovery Assistance
10. Challenges of the Use Case
▪ Zipped collection of (serialised) logfiles
▪ Snapshot of the machine’s parameters
▪ Last X errors on the machine
▪ Fault discovery & documentation
▪ Customer support can derive wrong settings from the reports
▪ Different data formats for different control generations
▪ Legacy data (no standard in chosen data formats)
▪ Collected since approx. 1990
▪ Ranging from simple text files to recursive archives
ENGEL Error Reports
11. Challenges of the Use Case
Recursive Archive Structure
▪ Logfiles (partially binary serialised)
▪ Memory Dumps
▪ Parameter Snapshots
▪ …
Issues
▪ 13 different timestamp formats
▪ Different structure for each control generation
▪ Broken archives
▪ Missing files
▪ …
Report Structure
12. Prototyping a Solution
▪ Hortonworks stack was promising
▪ Apache Nifi, Spark, Kafka and HDFS as our core components
▪ Starting with a small cluster
▪ 5x raspberry pies
▪ Establishing a production environment on dedicated hardware
▪ On–premise hosting
Hadoop seemed to be in fashion
13. Prototyping a Solution
Lambda Architecture on Hortonworks HDP
Upload report
Apache Nifi
Data ingestion & routing
Write meta attributes +
filepath to Kafka
Apache Kafka
(Event) Stream
Process metadata
Process parameters
(Parquet)
Store raw data blob
HDFS Batch & Stream Processing
BI, web & mobile apps
14. New Difficulties Arise
▪ Maintaining Streaming + batch jobs
▪ Kafka and large files
▪ Reading from multiple systems (Kafka + HDFS)
▪ Hadoop and small files
▪ Legacy binary deserialisation
▪ Pascal(!) JNA Wrapper
▪ Unpredictable (parameter) data
▪ Binaries with 200.000 and up to 3 Mio. variables per error report
Non-standard use case
JavaDStream<String> json = kafkaStream.map(ConsumerRecord::value);
json.foreachRDD(rdd -> {
Dataset<Row> df = sparkSession.read().json(rdd);
if (df.count() >= 1) {
List<String> hdfsPaths = df
.select("`hdfs.filepath`")
.javaRDD()
.map(row -> row.getString(0)).collect();
String hdfsPaths = String.join(",", hdfsPaths);
SampleProcessor sampleProcessor = new SampleProcessor();
JavaRDD<String> binaries = javaSparkContext.binaryFiles(hdfsPaths)
.map(report -> new TarArchiveInputStream(report._2.open()))
.map(sampleProcessor::call);
} else {
Log.info("No records in this batch");
}
});
High complexity and workaround for streaming large binaries
15. Partitioning Systemvariables
▪ Spark and parquet files to store systemvariables
▪ No efficient and economic database was found
Flattening the tree structure
Timestamp,FabNr,VarName,Unit,IntValue,DoubleValue,StringValue,BoolValue
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure,,0.174087137,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_sim,,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.ai_Pressure_stat,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuChargeMainPump,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuInject,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuOff,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.do_AccuSafety,,,,false
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.er_AccuPressMin,,,,
2020-01-23T14:14:57.000+01:00,200153,AccuGeneral1.evAnaDisEn,,,,
16. Partitioning Systemvariables
▪ Ideally by functionunit (1st part of the variable)
▪ Grouped by machine components
▪ Custom hash-based partitioning
▪ Not time series data
▪ Very good for point lookups – not so much for regex
Optimising for point lookups
SELECT *
FROM variables
WHERE varName = "AccuGeneral1.ai_Pressure"
AND fabNr = "XXX"
SELECT *
FROM variables
WHERE varName = "AccuGeneral1.ai_Pressure"
Query Examples
17. Issues in This Architecture
▪ Upgrades and Migrations
▪ Job Monitoring / Ganglia Metrics
▪ Repartitioning Job
▪ Merging Streaming and Batch Files
▪ Unpredictable errors in batch jobs
▪ Memory Bombs / Issues
▪ Spark 2.x no binary file support => working with RDDs
▪ This architecture feels like a big workaround
The Real Show Stoppers
WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715,
XXXXXXXXXX):
ExecutorLostFailure (executor 23 exited caused by one of the running
tasks)
Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of
12 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
18. Working Towards an Optimal Solution?
Discovering Azure & Databricks
Upload report
Azure Data Lake Storage
Autoloader
Process metadata
Process parameters
(Parquet)
Store raw data blob
BI, web & mobile apps
Azure Cosmos DB
19. Working Towards an Optimal Solution?
▪ Equi-distant range partitioning
▪ Based on Lexical order of parameter names
▪ Roughly same amount of parameters per partition
▪ Allows searching for variables in
the same component / root node
▪ Better data skipping
▪ OPTIMIZE instead of repartitioning jobs
Parameter Partitioning
A.xxx
B.xxx
C.xxx
D.xxx
E.xxx
F.xxx
Partition 1
Partition 2
….
20. Working Towards an Optimal Solution?
▪ Reduced Complexity
▪ One single configurable Spark job
▪ Kafka replaced by Autoloader
▪ Unified Batch & Streaming
▪ ….
▪ Reduced memory pressure
▪ Micro Batches and not full batches
▪ Stable Jobs
▪ Monitoring
▪ JVM / Ganglia Metrics
Azure & Databricks Benefits
21. The Current State
Self-service tools instead of manual doing
Currently
▪ Established self-service tools
▪ Send a PDF summary to the field
engineers
▪ Steps that could solve the problem
▪ Things to also consider at the machine
In future
▪ Automatically detect and classify
errors
22. Key Takeaways
▪ Don’t underestimate the effort to process legacy- data
▪ The unknown in this data jungle is quite daunting
▪ Unforeseeable things will happen
▪ Change management in a traditional company is really demanding
▪ Running an unmanaged cluster without dedicated resources leads to pure frustration
▪ Moving to the cloud reduced the complexity in our pipelines by a lot