SlideShare a Scribd company logo
1 of 65
Distributed machine learning
on the JVM without a Ph.D.
Mateusz Dymczyk
Mateusz Dymczyk, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
@mdymczyk
Say who?
BIG data
Mateusz Dymczyk Prague, 23rd October 2015
“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.” — Gartner
Mateusz Dymczyk Prague, 23rd October 2015
3Vs
NSA Baidu
10-100pb
eBay
100pb
Google
100pb
* Estimated data processed per day, circa 2014
Mateusz Dymczyk Prague, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
Current state
Mateusz Dymczyk Prague, 23rd October 2015
Data source
Data
collection Data storage
Simple analytics
Data
processing
Mateusz Dymczyk Prague, 23rd October 2015
What can we do?
Machine what?
Mateusz Dymczyk Prague, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
“The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
ML is extremely broad and involves several domains:
• computer science
• probability and statistics
• optimisation
• linear algebra
Mateusz Dymczyk Prague, 23rd October 2015
Machine learning
• Observation - object which is used for learning or evaluation (eg. a house)
• Features - representation of the observation (eg. square meters, number of rooms, location)
• Labels - a value assigned to an observation (not always used)
• System - set of related objects forming a complex whole (eg. set of observations)
• Model (math) - description of a system using mathematical concepts/language
• Data:
• training gets us our candidate parameters =>
• validation (optional) gets us optimal parameter set =>
• test checks how good the model is
Mateusz Dymczyk Prague, 23rd October 2015
Basic terminology
Mateusz Dymczyk Prague, 23rd October 2015
eg. regression,
when you want to
predict a real
number
eg. clustering,
when you want to
cluster or have
too much data
eg. classification,
when you want to
assign to a category
eg. association analysis,
when you want to find
relations between data
Mateusz Dymczyk Prague, 23rd October 2015
Why?
•Recommendation engines
•Customer churn (dissatisfaction)
•Customer segmentation
•Fraud detection
•Sentiment analysis
•Similar document search
•Demand forecasting
•Spam detection
•Store layout
•Much more: https://www.kaggle.com/wiki/DataScienceUseCases
Mateusz Dymczyk Prague, 23rd October 2015
But don’t take my word for it…
• Lack of distributed/scalable solutions
• Not enough data and/or computing power
• False conviction that we:
• Need to read hard research papers
• Use “weird” programming languages
Mateusz Dymczyk Prague, 23rd October 2015
So what’s the problem?
ML and JVM?
Mateusz Dymczyk Prague, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
Nothing new!
Mateusz Dymczyk Prague, 23rd October 2015
Still not good enough…
•Not designed for big data
•Didn’t fit machine learning computation models
vs
ML, JVM
and a (iterative) distribution?
Mateusz Dymczyk Prague, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
New (distributed) kids on the block
•MLlib (+Spark)
• TridentML (+Storm)
• Apache FlinkML (+Flink)
• Mahout Samsara
• Mahout R-like DSL
• Mahout on Spark
• H2O
• back-end agnostic (but with native APIs)
• open-source machine learning platform
Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
• Distributed, fast, in-memory computational framework
• Based RDD (Resilient Distributed Dataset: abstract, immutable, distributed, easily
rebuilt data format)
• Support for Scala, Java, Python and R
• Focuses on well known methods 

(map(), flatMap(), filter(), reduce() …)
Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
val conf = new SparkConf().setAppName("Spark App")
val sc = new SparkContext(conf)
val textFile: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(s"Found ${counts.count()}")
counts.saveAsTextFile("hdfs://...")
Mateusz Dymczyk Prague, 23rd October 2015
What is Spark?
SparkSQL
Spark
Streaming
MLlib GraphX
Apache Spark (core)
Mesos/Yarn/Standalone
(cluster management)
Mateusz Dymczyk Prague, 23rd October 2015
What is MLlib?
•Machine learning library for Spark (scalable by definition)
•Since September 2013, initially created at AMPLab (UC Berkeley)
•Contains common, well established machine learning algorithms and
utilities
Mateusz Dymczyk Prague, 23rd October 2015
Is it for me?
PROS
• extensive community, part of Spark
(Databricks support)
• Java, Scala, Python, R APIs
• solid implementation of most popular
algorithms
• easy to use, well documented, multitude of
examples
• fast and robust
CONS
• only Spark
• very young, still missing algorithms
• still pretty “low level”
Mateusz Dymczyk Prague, 23rd October 2015
Any problems left?
•Young projects, still require a lot of work
•Plenty of ML algorithms are not good for distribution by definition
•Simply throwing more machines won’t always work (eg. too
much data movement, too many operations)
Mateusz Dymczyk Prague, 23rd October 2015
What can we do?
•Go to Spark’s JIRA
•Add a ticket to MLlib
•Relax
1
2
3
Mateusz Dymczyk Prague, 23rd October 2015
Go smart(er)
•Compromise:
•Approximate
•Lambda architecture
•Compose algorithms:
•eg. clustering + actual similarity check
•User different algorithms
•for instance instead of closed form solution use iterative solutions
•Come up with new algorithms :-)
Data in
Serving layer
Mateusz Dymczyk Prague, 23rd October 2015
Examples!
Mateusz Dymczyk Prague, 23rd October 2015
What we’ll see
•End to end example: similarity search
•Built-in algorithm/util examples:
•Clustering
•Recommender systems (collaborative filtering)
•Logistic regression
•Model evaluation
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search
•Problem: given an object (document, image) find all objects which
are similar to it in a given set.
•Solution: similarity is a well research topic in

mathematics!
•Why:
•Find most popular objects.
•Aggregate similar objects to declutter view.
•Find k most similar objects.
Mateusz Dymczyk Prague, 23rd October 2015
50 shades of similarity
•Distance based:
•Manhattan distance (L1 norm)
•Euclidean distance (L2 norm)
•Angular similarity:
•Cosine
•Set similarity (vectorization not necessary):
•Jaccard
•Dice
•Hamming
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - pipeline
Data preprocessing
(eg. tokenization,
text normalization)
Input data
Vectorization
Similarity
check
Result
“This’s a Short test” [“short”, “test”]
“This’s a not so [“long”, “test”]
long Test”
[1,1,0], [1,0,1] …
Similarity
algorithm
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - distributed pipeline
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Similarity
check
Node
Similarity
check
Node
Cluster
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
•Brute force solution:
•pre-process text
•vectorize (in our case TF-IDF)
•compute all possible pairs
•compute cosine similarity between each pair
Mateusz Dymczyk Prague, 23rd October 2015
Vectorization: TF-IDF
• Term Frequency–Inverse Document Frequency:
• how important a word is for a document in a collection
• higher when the word occurrence is big in a document
• smaller when the word is also common in the whole collection
“This’s a Short test” [“short”, “test”]

“This’s a not so long Test” [“long”, “test”]
[1/6, 1/3, 0], [1/6 , 0, 1/3] …
Mateusz Dymczyk Prague, 23rd October 2015
TF-IDF
val documents: RDD[Seq[String]] = sc.textFile("...")
.map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.vectorize(normalized).cache()
// Brute-force similarity
val cartesian = vectorized
.cartesian(vectorized)
.filter { case (doc1, doc2) => doc1._1.id < doc2._1.id }
.map {
case (doc1, doc2) =>
val similarity: Double = cosine(doc1._2, doc2._2)
Seq(
(doc1._2, (doc2._2, similarity)),
(doc2._2, (doc1._2, similarity))
)
}
.flatMap(identity)
.combineByKey[Seq[(RantTuple, Double)]](
(x: (RantTuple, Double)) => Seq(x),
(acc: Seq[(RantTuple, Double)], y: (RantTuple, Double)) => acc.+:(y),
(acc1: Seq[(RantTuple, Double)], acc2: Seq[(RantTuple, Double)]) => acc1.++:(acc2)
)
}
“This’s a Short test” [“short”, “test”]
[1,1,0], [1,0,1] …
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search I - problems
•Compute all-pairs similarity:
•O(n^2) comparisons
•10^6 documents
•~5*10^11 comparisons =>
•~6 days (10^3 comp/ms)
•Data shuffle size O(nL^2)
•Largest reduce-key: O(n)
n — # docs, L — # of unique words in a doc
Similarity
check
Node
Similarity
check
Node
Cluster
Mateusz Dymczyk Prague, 23rd October 2015
Why is data shuffle so bad?
50 GB/s
100MB/s
100-600MB/s
1 GB/s
0.3 GB/s
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Cluster
Similarity
check
Node
Similarity
check
Node
Cluster
Group by
feature(s)
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search II
•Problems:
•What if no features to group by
•What if it produces too big clusters?
•Solution: cluster anyway but smart!
Mateusz Dymczyk Prague, 23rd October 2015
Locality sensitive hashing
•Similar objects = same bucket (maximizes the % of collisions)
•Group of algorithms (different similarity measures):
•random projection for cosine
•min-hash for jaccard
•…
•Problems:
•possibility of false positives and false negatives
•double check the former, minimize the latter
•might produce duplicates pairs!
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search III
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.extract(normalized).cache()
val lsh = new LSH(data=vectorized, p=65537, m=1000, numRows=1000, numBands=25, minClusterSize=2)
val model = lsh.run
var clusters : RDD[(Long, Iterable[SparseVector])] = model.clusters
clusters.map { case (id, cluster) => cosines(cluster) }
}
• Sample implementations:
• https://github.com/mrsqueeze/spark-hash (min-hash)
• https://github.com/marufaytekin/lsh-spark (Charikar’s LSH for cosine)
Mateusz Dymczyk Prague, 23rd October 2015
Similarity search - results
INPUT
”パウダーファンデーションのパフがすぐに汚れ
てしまう。” (“Powder foundation’s puff gets dirty really fast”)
OUTPUT
0.80 “パウダーをつけるパフがすぐに汚れる。”
(“The puff gets dirty really fast after applying the powder.”)
0.53 “パフがすぐに汚くなってしまう。” (“The
puff gets dirty really fast.”)
0.30 “パウダリーファンデーションをつけるた
めのスポンジというかパフ、すぐに汚れて、ファ
ンデをつける時にきれいに伸ばせなくなる。”
(“The sponge for applying the powdery foundation gets dirty really
fast, when using the foundation it doesn’t spread nicely.”)
Mateusz Dymczyk Prague, 23rd October 2015
Built-ins
Mateusz Dymczyk Prague, 23rd October 2015
Clustering
val data = sc.textFile("...")
val parsedData = data.map(_.split(' ').map(_.toDouble)).cache()
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
val prediction = clusters.predict(point)
•unsupervised learning problem which tries to group
subsets of objects with one another based on some notion
of similarity.
•supported algorithms: K-means, Gaussian mixture, Power
iteration clustering (PIC), Latent Dirichlet allocation (LDA)
Mateusz Dymczyk Prague, 23rd October 2015
Recommender systems
val data = sc.textFile(“…”)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
• Collaborative filtering
• User/product matrix predictions
Mateusz Dymczyk Prague, 23rd October 2015
(Logistic) Regression
•iterative algorithm - greatly benefits from caching
•often used for binary classification (can be
generalised)
// <label> <idx1>:<val1> <idx2>:<val2> ...
val data = MLUtils.loadLibSVMFile(sc, “…”).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(data)
model.predict(pointToPredict)
Mateusz Dymczyk Prague, 23rd October 2015
Is everything OK?
Mateusz Dymczyk Prague, 23rd October 2015
Supervised learning workflow
Raw
data
Cleaned/
scaled
data
Training
set
Validating
set
Model
creation
Final
model
Validation
Incoming
new
data
Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
•Certain ML algorithms create models
•How do we know if the model we got is good (enough)?
•Different types of evaluation depending on the ML algorithm type:
•classification: prediction and recall (based on true/false positive/
negative)
•regression: different methods based on the difference of evaluation
and validation data
Mateusz Dymczyk Prague, 23rd October 2015
Model evaluation
val data = MLUtils.loadLibSVMFile(sc, "...")
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
model.clearThreshold
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold: $t, Precision: $p")
}
val recall = metrics.recallByThreshold
recall.foreach { case (t, r) =>
println(s"Threshold: $t, Recall: $r")
}
Mateusz Dymczyk Prague, 23rd October 2015
To remember…
Mateusz Dymczyk Prague, 23rd October 2015
Common pitfalls
1. Try to avoid groupByKey()
•instead try reduceByKey()
2.Don’t collect all the data in the driver:
•collect() will copy all the elements to the driver node
•instead persist it (file, DB)
3.Use cache()/persist() where necessary (use Sparks WebUI)!
4.Code for failure and handle malformed input!
5.Remember about Serializable!
Mateusz Dymczyk Prague, 23rd October 2015
Performance recap
1. Parallelising (not concurrency!) makes us faster
2.Network traffic makes us (really) slow
1. keep data close to the processing units (stay local)
2.take note of operation order
3.don’t iterate more than necessary
3.In-memory computation/caching helps a lot (especially in case of
iterative machine learning!)
Mateusz Dymczyk Prague, 23rd October 2015
Where to go from here
• Get ideas: https://www.kaggle.com/wiki/DataScienceUseCases
• Get started with Spark:
• http://spark.apache.org/docs/latest/quick-start.html
• https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
• Get started with MLlib:
• http://spark.apache.org/docs/latest/mllib-guide.html
• https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
• Try out other frameworks and courses:
• https://github.com/h2oai/sparkling-water
• https://www.coursera.org/course/mmds
• Learn the basics:
• https://www.coursera.org/learn/machine-learning
• Practical books:
• “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media
• “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
Q&A
Mateusz Dymczyk Prague, 23rd October 2015
Mateusz Dymczyk Prague, 23rd October 2015
Streams
Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
•Linear models (regression) can be trained in a streaming fashion (1.1+)
•Clustering can be done on streams (with k-means)
•what if data over time changes? — mllib supports “forgetfulness”
Mateusz Dymczyk Prague, 23rd October 2015
Can I has stream?
val trainingData = ssc.textFileStream("...").map(Vectors.parse)
val testData = ssc.textFileStream("...").map(LabeledPoint.parse)
val model = new StreamingKMeans()
.setK(2)
.setDecayFactor(1.0)
.setRandomCenters(3, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
Mateusz Dymczyk Prague, 23rd October 2015
Still sounds like work….
Mateusz Dymczyk Prague, 23rd October 2015
Seldon.io
• open predictive platform
• provides content 

recommendation and 

predictive functionality
Mateusz Dymczyk Prague, 23rd October 2015
Prediction.io
• open source ML server for building predictive engines
• event collection, algorithms, evaluation and querying predictive results via REST
• uses Hadoop, HBase, Spark and Elasticsearch

More Related Content

What's hot

A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013eXascale Infolab
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
 
Toward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and ApplicationsToward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and ApplicationsRaja Chiky
 

What's hot (9)

A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Session 2
Session 2Session 2
Session 2
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...
 
Toward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and ApplicationsToward Semantic Data Stream - Technologies and Applications
Toward Semantic Data Stream - Technologies and Applications
 

Similar to Distributed Machine Learning on the JVM Without a PhD

Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
Data Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineData Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineRosa Romero Gómez, PhD
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinarQualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinarChristina Silver
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
 
Choosing the right software for your research study : an overview of leading ...
Choosing the right software for your research study : an overview of leading ...Choosing the right software for your research study : an overview of leading ...
Choosing the right software for your research study : an overview of leading ...Merlien Institute
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRoverChristoph Matthies
 
Mind the Gap - Data Science Meets Software Engineering
Mind the Gap - Data Science Meets Software EngineeringMind the Gap - Data Science Meets Software Engineering
Mind the Gap - Data Science Meets Software EngineeringBernhard Haslhofer
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsPéter Király
 
CodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National PoliceCodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
 
vOfficeware Brown Bag - NOSQL
vOfficeware Brown Bag - NOSQLvOfficeware Brown Bag - NOSQL
vOfficeware Brown Bag - NOSQLapexdodge
 

Similar to Distributed Machine Learning on the JVM Without a PhD (20)

Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Data Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front LineData Visualization for Big Data: Experience from the Front Line
Data Visualization for Big Data: Experience from the Front Line
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinarQualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
 
Choosing the right software for your research study : an overview of leading ...
Choosing the right software for your research study : an overview of leading ...Choosing the right software for your research study : an overview of leading ...
Choosing the right software for your research study : an overview of leading ...
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Mind the Gap - Data Science Meets Software Engineering
Mind the Gap - Data Science Meets Software EngineeringMind the Gap - Data Science Meets Software Engineering
Mind the Gap - Data Science Meets Software Engineering
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
CodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National PoliceCodeOne 2018 - Microservices in action at the Dutch National Police
CodeOne 2018 - Microservices in action at the Dutch National Police
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
vOfficeware Brown Bag - NOSQL
vOfficeware Brown Bag - NOSQLvOfficeware Brown Bag - NOSQL
vOfficeware Brown Bag - NOSQL
 
ALA Interoperability
ALA InteroperabilityALA Interoperability
ALA Interoperability
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

Distributed Machine Learning on the JVM Without a PhD

  • 1. Distributed machine learning on the JVM without a Ph.D. Mateusz Dymczyk Mateusz Dymczyk, 23rd October 2015
  • 2. Mateusz Dymczyk Prague, 23rd October 2015 @mdymczyk Say who?
  • 3. BIG data Mateusz Dymczyk Prague, 23rd October 2015
  • 4. “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost- effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” — Gartner Mateusz Dymczyk Prague, 23rd October 2015 3Vs NSA Baidu 10-100pb eBay 100pb Google 100pb * Estimated data processed per day, circa 2014
  • 5. Mateusz Dymczyk Prague, 23rd October 2015
  • 6. Mateusz Dymczyk Prague, 23rd October 2015 Current state
  • 7. Mateusz Dymczyk Prague, 23rd October 2015 Data source Data collection Data storage Simple analytics Data processing
  • 8. Mateusz Dymczyk Prague, 23rd October 2015 What can we do?
  • 9. Machine what? Mateusz Dymczyk Prague, 23rd October 2015
  • 10. Mateusz Dymczyk Prague, 23rd October 2015
  • 11. “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning” ML is extremely broad and involves several domains: • computer science • probability and statistics • optimisation • linear algebra Mateusz Dymczyk Prague, 23rd October 2015 Machine learning
  • 12. • Observation - object which is used for learning or evaluation (eg. a house) • Features - representation of the observation (eg. square meters, number of rooms, location) • Labels - a value assigned to an observation (not always used) • System - set of related objects forming a complex whole (eg. set of observations) • Model (math) - description of a system using mathematical concepts/language • Data: • training gets us our candidate parameters => • validation (optional) gets us optimal parameter set => • test checks how good the model is Mateusz Dymczyk Prague, 23rd October 2015 Basic terminology
  • 13. Mateusz Dymczyk Prague, 23rd October 2015 eg. regression, when you want to predict a real number eg. clustering, when you want to cluster or have too much data eg. classification, when you want to assign to a category eg. association analysis, when you want to find relations between data
  • 14. Mateusz Dymczyk Prague, 23rd October 2015 Why? •Recommendation engines •Customer churn (dissatisfaction) •Customer segmentation •Fraud detection •Sentiment analysis •Similar document search •Demand forecasting •Spam detection •Store layout •Much more: https://www.kaggle.com/wiki/DataScienceUseCases
  • 15. Mateusz Dymczyk Prague, 23rd October 2015 But don’t take my word for it…
  • 16. • Lack of distributed/scalable solutions • Not enough data and/or computing power • False conviction that we: • Need to read hard research papers • Use “weird” programming languages Mateusz Dymczyk Prague, 23rd October 2015 So what’s the problem?
  • 17. ML and JVM? Mateusz Dymczyk Prague, 23rd October 2015
  • 18. Mateusz Dymczyk Prague, 23rd October 2015 Nothing new!
  • 19. Mateusz Dymczyk Prague, 23rd October 2015 Still not good enough… •Not designed for big data •Didn’t fit machine learning computation models vs
  • 20. ML, JVM and a (iterative) distribution? Mateusz Dymczyk Prague, 23rd October 2015
  • 21. Mateusz Dymczyk Prague, 23rd October 2015 New (distributed) kids on the block •MLlib (+Spark) • TridentML (+Storm) • Apache FlinkML (+Flink) • Mahout Samsara • Mahout R-like DSL • Mahout on Spark • H2O • back-end agnostic (but with native APIs) • open-source machine learning platform
  • 22. Mateusz Dymczyk Prague, 23rd October 2015 What is Spark? • Distributed, fast, in-memory computational framework • Based RDD (Resilient Distributed Dataset: abstract, immutable, distributed, easily rebuilt data format) • Support for Scala, Java, Python and R • Focuses on well known methods 
 (map(), flatMap(), filter(), reduce() …)
  • 23. Mateusz Dymczyk Prague, 23rd October 2015 What is Spark? val conf = new SparkConf().setAppName("Spark App") val sc = new SparkContext(conf) val textFile: RDD[String] = sc.textFile("hdfs://...") val counts: RDD[(String, Int)] = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) println(s"Found ${counts.count()}") counts.saveAsTextFile("hdfs://...")
  • 24. Mateusz Dymczyk Prague, 23rd October 2015 What is Spark? SparkSQL Spark Streaming MLlib GraphX Apache Spark (core) Mesos/Yarn/Standalone (cluster management)
  • 25. Mateusz Dymczyk Prague, 23rd October 2015 What is MLlib? •Machine learning library for Spark (scalable by definition) •Since September 2013, initially created at AMPLab (UC Berkeley) •Contains common, well established machine learning algorithms and utilities
  • 26. Mateusz Dymczyk Prague, 23rd October 2015 Is it for me? PROS • extensive community, part of Spark (Databricks support) • Java, Scala, Python, R APIs • solid implementation of most popular algorithms • easy to use, well documented, multitude of examples • fast and robust CONS • only Spark • very young, still missing algorithms • still pretty “low level”
  • 27. Mateusz Dymczyk Prague, 23rd October 2015 Any problems left? •Young projects, still require a lot of work •Plenty of ML algorithms are not good for distribution by definition •Simply throwing more machines won’t always work (eg. too much data movement, too many operations)
  • 28. Mateusz Dymczyk Prague, 23rd October 2015 What can we do? •Go to Spark’s JIRA •Add a ticket to MLlib •Relax 1 2 3
  • 29. Mateusz Dymczyk Prague, 23rd October 2015 Go smart(er) •Compromise: •Approximate •Lambda architecture •Compose algorithms: •eg. clustering + actual similarity check •User different algorithms •for instance instead of closed form solution use iterative solutions •Come up with new algorithms :-) Data in Serving layer
  • 30. Mateusz Dymczyk Prague, 23rd October 2015 Examples!
  • 31. Mateusz Dymczyk Prague, 23rd October 2015 What we’ll see •End to end example: similarity search •Built-in algorithm/util examples: •Clustering •Recommender systems (collaborative filtering) •Logistic regression •Model evaluation
  • 32. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search •Problem: given an object (document, image) find all objects which are similar to it in a given set. •Solution: similarity is a well research topic in
 mathematics! •Why: •Find most popular objects. •Aggregate similar objects to declutter view. •Find k most similar objects.
  • 33. Mateusz Dymczyk Prague, 23rd October 2015 50 shades of similarity •Distance based: •Manhattan distance (L1 norm) •Euclidean distance (L2 norm) •Angular similarity: •Cosine •Set similarity (vectorization not necessary): •Jaccard •Dice •Hamming
  • 34. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search - pipeline Data preprocessing (eg. tokenization, text normalization) Input data Vectorization Similarity check Result “This’s a Short test” [“short”, “test”] “This’s a not so [“long”, “test”] long Test” [1,1,0], [1,0,1] … Similarity algorithm
  • 35. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search - distributed pipeline Input data Result Data preprocessing Node Data preprocessing Node Vectorization Node Vectorization Node Similarity check Node Similarity check Node Cluster
  • 36. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search I •Brute force solution: •pre-process text •vectorize (in our case TF-IDF) •compute all possible pairs •compute cosine similarity between each pair
  • 37. Mateusz Dymczyk Prague, 23rd October 2015 Vectorization: TF-IDF • Term Frequency–Inverse Document Frequency: • how important a word is for a document in a collection • higher when the word occurrence is big in a document • smaller when the word is also common in the whole collection “This’s a Short test” [“short”, “test”]
 “This’s a not so long Test” [“long”, “test”] [1/6, 1/3, 0], [1/6 , 0, 1/3] …
  • 38. Mateusz Dymczyk Prague, 23rd October 2015 TF-IDF val documents: RDD[Seq[String]] = sc.textFile("...") .map(_.split(" ").toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf)
  • 39. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search I type DocSimilarity = (String, Seq[(String, Double)]) case class Document(id: Long, doc: String) def similarity(docs: RDD[Document]): RDD[DocSimilarity] = { val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_)) val vectorized: RDD[(Document, Vector)] = TfIdf.vectorize(normalized).cache() // Brute-force similarity val cartesian = vectorized .cartesian(vectorized) .filter { case (doc1, doc2) => doc1._1.id < doc2._1.id } .map { case (doc1, doc2) => val similarity: Double = cosine(doc1._2, doc2._2) Seq( (doc1._2, (doc2._2, similarity)), (doc2._2, (doc1._2, similarity)) ) } .flatMap(identity) .combineByKey[Seq[(RantTuple, Double)]]( (x: (RantTuple, Double)) => Seq(x), (acc: Seq[(RantTuple, Double)], y: (RantTuple, Double)) => acc.+:(y), (acc1: Seq[(RantTuple, Double)], acc2: Seq[(RantTuple, Double)]) => acc1.++:(acc2) ) } “This’s a Short test” [“short”, “test”] [1,1,0], [1,0,1] …
  • 40. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search I - problems •Compute all-pairs similarity: •O(n^2) comparisons •10^6 documents •~5*10^11 comparisons => •~6 days (10^3 comp/ms) •Data shuffle size O(nL^2) •Largest reduce-key: O(n) n — # docs, L — # of unique words in a doc Similarity check Node Similarity check Node Cluster
  • 41. Mateusz Dymczyk Prague, 23rd October 2015 Why is data shuffle so bad? 50 GB/s 100MB/s 100-600MB/s 1 GB/s 0.3 GB/s
  • 42. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search II Input data Result Data preprocessing Node Data preprocessing Node Vectorization Node Vectorization Node Cluster Similarity check Node Similarity check Node Cluster Group by feature(s)
  • 43. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search II •Problems: •What if no features to group by •What if it produces too big clusters? •Solution: cluster anyway but smart!
  • 44. Mateusz Dymczyk Prague, 23rd October 2015 Locality sensitive hashing •Similar objects = same bucket (maximizes the % of collisions) •Group of algorithms (different similarity measures): •random projection for cosine •min-hash for jaccard •… •Problems: •possibility of false positives and false negatives •double check the former, minimize the latter •might produce duplicates pairs!
  • 45. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search III type DocSimilarity = (String, Seq[(String, Double)]) case class Document(id: Long, doc: String) def similarity(docs: RDD[Document]): RDD[DocSimilarity] = { val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_)) val vectorized: RDD[(Document, Vector)] = TfIdf.extract(normalized).cache() val lsh = new LSH(data=vectorized, p=65537, m=1000, numRows=1000, numBands=25, minClusterSize=2) val model = lsh.run var clusters : RDD[(Long, Iterable[SparseVector])] = model.clusters clusters.map { case (id, cluster) => cosines(cluster) } } • Sample implementations: • https://github.com/mrsqueeze/spark-hash (min-hash) • https://github.com/marufaytekin/lsh-spark (Charikar’s LSH for cosine)
  • 46. Mateusz Dymczyk Prague, 23rd October 2015 Similarity search - results INPUT ”パウダーファンデーションのパフがすぐに汚れ てしまう。” (“Powder foundation’s puff gets dirty really fast”) OUTPUT 0.80 “パウダーをつけるパフがすぐに汚れる。” (“The puff gets dirty really fast after applying the powder.”) 0.53 “パフがすぐに汚くなってしまう。” (“The puff gets dirty really fast.”) 0.30 “パウダリーファンデーションをつけるた めのスポンジというかパフ、すぐに汚れて、ファ ンデをつける時にきれいに伸ばせなくなる。” (“The sponge for applying the powdery foundation gets dirty really fast, when using the foundation it doesn’t spread nicely.”)
  • 47. Mateusz Dymczyk Prague, 23rd October 2015 Built-ins
  • 48. Mateusz Dymczyk Prague, 23rd October 2015 Clustering val data = sc.textFile("...") val parsedData = data.map(_.split(' ').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData, 2, numIterations = 20) val prediction = clusters.predict(point) •unsupervised learning problem which tries to group subsets of objects with one another based on some notion of similarity. •supported algorithms: K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA)
  • 49. Mateusz Dymczyk Prague, 23rd October 2015 Recommender systems val data = sc.textFile(“…”) val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) val model = ALS.train(ratings, 1, 20, 0.01) val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts) • Collaborative filtering • User/product matrix predictions
  • 50. Mateusz Dymczyk Prague, 23rd October 2015 (Logistic) Regression •iterative algorithm - greatly benefits from caching •often used for binary classification (can be generalised) // <label> <idx1>:<val1> <idx2>:<val2> ... val data = MLUtils.loadLibSVMFile(sc, “…”).cache() val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(data) model.predict(pointToPredict)
  • 51. Mateusz Dymczyk Prague, 23rd October 2015 Is everything OK?
  • 52. Mateusz Dymczyk Prague, 23rd October 2015 Supervised learning workflow Raw data Cleaned/ scaled data Training set Validating set Model creation Final model Validation Incoming new data
  • 53. Mateusz Dymczyk Prague, 23rd October 2015 Model evaluation •Certain ML algorithms create models •How do we know if the model we got is good (enough)? •Different types of evaluation depending on the ML algorithm type: •classification: prediction and recall (based on true/false positive/ negative) •regression: different methods based on the difference of evaluation and validation data
  • 54. Mateusz Dymczyk Prague, 23rd October 2015 Model evaluation val data = MLUtils.loadLibSVMFile(sc, "...") val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L) training.cache() val model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(training) model.clearThreshold val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) } val metrics = new BinaryClassificationMetrics(predictionAndLabels) val precision = metrics.precisionByThreshold precision.foreach { case (t, p) => println(s"Threshold: $t, Precision: $p") } val recall = metrics.recallByThreshold recall.foreach { case (t, r) => println(s"Threshold: $t, Recall: $r") }
  • 55. Mateusz Dymczyk Prague, 23rd October 2015 To remember…
  • 56. Mateusz Dymczyk Prague, 23rd October 2015 Common pitfalls 1. Try to avoid groupByKey() •instead try reduceByKey() 2.Don’t collect all the data in the driver: •collect() will copy all the elements to the driver node •instead persist it (file, DB) 3.Use cache()/persist() where necessary (use Sparks WebUI)! 4.Code for failure and handle malformed input! 5.Remember about Serializable!
  • 57. Mateusz Dymczyk Prague, 23rd October 2015 Performance recap 1. Parallelising (not concurrency!) makes us faster 2.Network traffic makes us (really) slow 1. keep data close to the processing units (stay local) 2.take note of operation order 3.don’t iterate more than necessary 3.In-memory computation/caching helps a lot (especially in case of iterative machine learning!)
  • 58. Mateusz Dymczyk Prague, 23rd October 2015 Where to go from here • Get ideas: https://www.kaggle.com/wiki/DataScienceUseCases • Get started with Spark: • http://spark.apache.org/docs/latest/quick-start.html • https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x • Get started with MLlib: • http://spark.apache.org/docs/latest/mllib-guide.html • https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x • Try out other frameworks and courses: • https://github.com/h2oai/sparkling-water • https://www.coursera.org/course/mmds • Learn the basics: • https://www.coursera.org/learn/machine-learning • Practical books: • “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media • “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media
  • 59. Q&A Mateusz Dymczyk Prague, 23rd October 2015
  • 60. Mateusz Dymczyk Prague, 23rd October 2015 Streams
  • 61. Mateusz Dymczyk Prague, 23rd October 2015 Can I has stream? •Linear models (regression) can be trained in a streaming fashion (1.1+) •Clustering can be done on streams (with k-means) •what if data over time changes? — mllib supports “forgetfulness”
  • 62. Mateusz Dymczyk Prague, 23rd October 2015 Can I has stream? val trainingData = ssc.textFileStream("...").map(Vectors.parse) val testData = ssc.textFileStream("...").map(LabeledPoint.parse) val model = new StreamingKMeans() .setK(2) .setDecayFactor(1.0) .setRandomCenters(3, 0.0) model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print() ssc.start() ssc.awaitTermination()
  • 63. Mateusz Dymczyk Prague, 23rd October 2015 Still sounds like work….
  • 64. Mateusz Dymczyk Prague, 23rd October 2015 Seldon.io • open predictive platform • provides content 
 recommendation and 
 predictive functionality
  • 65. Mateusz Dymczyk Prague, 23rd October 2015 Prediction.io • open source ML server for building predictive engines • event collection, algorithms, evaluation and querying predictive results via REST • uses Hadoop, HBase, Spark and Elasticsearch