Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk
1. #EUai9
Marcin Kulka and Michał Kaczmarczyk
9LivesData
Oct/26/2017
No More Cumbersomeness:
Automatic Predictive
Modeling
on Apache Spark
2. Who we are?
• Marcin Kulka – Senior Software
Engineer
• Michał Kaczmarczyk (Ph.D.) –
Software Architect, Team Leader and
Project Manager
2
3. Who we are?
• Advanced software R&D company (Warsaw,
Poland)
• 75+ scientists and software engineers
• Specializing in scalable storage,
distributed and big data systems
• Cooperating with partners all around the world
3
5. • Masato Asahara (Ph.D.) -
Researcher, NEC Data Science
Research Laboratory
• Ryohei Fujimaki (Ph.D.) -
Research Fellow, NEC Data
Science Research Laboratory
5
6. Agenda
• Typical use case for predictive modeling problem
• Our technology - Automatic Predictive Modeling
• Design challenges
• Evaluation results
• Our observations
6
14. Predictive model design
14
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
15. Predictive model design
15
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
16. Predictive model design
16
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
17. Predictive model design
17
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Sophisticated knowledge...
18. Automatic predictive modeling
18
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
19. Automatic predictive modeling
19
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Highly accurate
results in a short
time!
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
23. Exploring massive modeling possibilities
23
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
24. Exploring massive modeling possibilities
24
Algorithms
Yes
No Yes
Hyperparameters
tuning
Data
preprocessing
strategies
Feature
selection!
25. Exploring massive modeling possibilities
25
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
26. Exploring massive modeling possibilities
26
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
27. Automating and accelerating with Spark
27
Complete in hours!
Yes
No Yes
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Hyperparameters
tuning
29. Modeling flow = training + validation
29
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Best model
Validation
criteria
30. Modeling and prediction flow
30
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Prediction
Best model
Validation
criteria
Best
prediction
32. 3232
Challenges to achieve high execution performance
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
3232
θ1
θ2
θ3
33. 3333
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Challenges to achieve high execution performance
36. Comparison of Spark and native ML engines
36
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
37. Comparison of Spark and native ML engines
37
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Accuracy
38. Comparison of Spark and native ML engines
38
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
39. Comparison of Spark and native ML engines
39
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
40. Comparison of Spark and native ML engines
• We would like to combine Spark and ML engines
40
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
41. Combining Spark and ML engines for training
41
Training
data
(parquet)
HDFS
Models
44. 44
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
45. 45
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Input
requirements:
size & format
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
47. 47
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
48. 48
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
49. Converting to
RDD[Matrix]
49
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
Combining Spark and ML engines for training
50. Combining Spark and ML engines for validation
50
Validation
data
(parquet)
HDFS
56. 56
Predict
(map operation)
Convert to
RDD[Matrix]
Data preprocessing
(MapReduce)
Test data
(parquet)
HDFS
HDFS
Prediction
results
(parquet)
Matrix
Matrix
Matrix
Computations
only for selected
models
Combining Spark and ML engines for prediction
64. Machine learning – most work intensive & time consuming part
64
Machine Learning
(map operation)
Convert
to matrix
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
Yes
No Yes
We must ensure good
balance of paralleled
work
1000s of
models
Matrix
Matrix
Matrix
66. Naive balancing of models to compute
66
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
Decision tree model
Complicated model
67. Predictive balancing
• Balancing
complex and
simple
models
(based on
previous
estimation)
• Complex
models first
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪~
♪~
67
69. Evaluation – targeting Top-10%
• Prediction problem
– Comparing Top-10% precision of targeting potential
positive samples
• Comparing with manual predictive modeling
– Done with scikit-learn v0.18.1
– Selected algorithms (Logistic Regression, SVM, Random
Forests)
– Selected preprocessing strategies
– All parameters of algorithms set with default values
• except Random Forest (n_estimators = 200)
69
70. Evaluation – data sets
• KDDCUP 2014 competition data
– 557K records for training and validate data
– 62K records for test data
– Features: 500
• KDDCUP 2015 competition data
– 108K records for training and validate data
– 12K records for test data
– Features: 500
• IJCAI 2015 competition data
– 87K records for training, validate and test data
– Features: 500
70
71. Evaluation – cluster specificaton
• Size: 3U
• Server modules: 34
• CPU: 272 cores (Intel Xeon D 2.1GHz)
– 128 cores used in the evaluation
• RAM: 2TB
• Storage: 34TB SSD
• Internal network: 10GbE
• Spark v1.6.0, Hadoop v2.7.3
71
Scalable Modular Server
(DX2000)
73. Evaluation results and conclusions
• Competitive results with good accuracy
73
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results
74. Evaluation results and conclusions
• Short execution time
• Full automation of the whole process
• Handling data of any size
74
Data Our technology
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time
76. Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
76
77. Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
77
78. Converting to
RDD[Matrix]
78
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD[DenseMatrix]
79. • Spark used for parallelization
• All the necessary data for a single execution kept
without memory overhead
• Performance critical operations executed:
– On objects with Linear Algebra operations optimized
– By fast native ML algorithms
79
RDD[DenseMatrix]
80. Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
80
81. Limiting execution overhead in tests
• Submitting Spark application takes time
81
TestSpark submit Spark submit Test Spark submit Test
83. Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
83
84. Stable execution on YARN
• Default configuration sometimes failing with not
enough memory
• Spark Web UI:
• Serving much memory to Spark but application
still failing
• Known problem in Spark
84
85. Stable execution on YARN
• JVM system memory spikes over YARN
limitation suddenly (*)
85
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016.
YARN limitation
(6GB)
Time
Memory(GB)
Spike of JVM system
memory usage
86. Stable execution on YARN
• Tip: spark.yarn.executor.memoryOverhead to be
carefully configured
• Recommended overhead: 6-10%
• 15% overhead required in our case
• Must be thoroughly investigated
86
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
88. Summary
• Predictive modeling problem
– Requires sophisticated knowledge
– Takes a long time
• Our technology: Automatic Predictive Modeling
– Combines Spark with native ML engines
– Fully automates the whole process
– Provides highly accurate results
– Takes at most hours
– Handles data of any size
88
89. Future work
• Extending to other models
(e.g. deep learning)
• Speeding up by GPU
• Reducing YARN memory
overhead
89