The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
1. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Apache Spark
Large-scale recommendations with Apache Spark and Python
Christian S. Perone
christian.perone@gmail.com
2. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
AGENDA
INTRODUCTION
Big Data
The Elephant
APACHE SPARK
Apache Spark Introduction
Resilient Distributed Datasets
Data Frames
Spark and Machine Learning
COLLABORATIVE FILTERING
Introduction
Factorization
Practice time
Q&A
3. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHO AM I
Christian S. Perone
Machine Learning/Software Engineer
Blog
http://blog.christianperone.com
Open-source projects
https://github.com/perone
Twitter @tarantulae
7. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
8. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
9. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
10. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
We want to being able to handle data, query, build models, make
predictions, etc.
11. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
12. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
13. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
Every real distributed machine learning (ML) researcher/engineer knows
that MR is bad. ML algorithms are iterative and MR is not suited for
iterative algorithms, which is due to unnecessary frequent I/O (...).
—Kenneth Tran, On the imminent decline of MapReduce – 2014
14. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The Mahout community decided to move its codebase onto modern data
processing systems that offer a richer programming model and more
efficient execution than Hadoop MapReduce. Mahout will therefore reject
new MapReduce algorithm implementations from now on (...).
—Mahaut, Goodbye MapReduce – 2014
16. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
17. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
18. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
19. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
We will focus in the Python API.
20. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
21. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
22. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
23. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
24. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
Controllable persistence for reuse (including caching in RAM)
26. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - TRANSFORMATIONS VS ACTIONS
The operations that can be applied on the RDDs have two main
types:
TRANSFORMATIONS
These are the lazy operations to create new RDDs based on other
RDDs. Example:
map, filter, union, distinct, etc.
ACTIONS
These are the operations that actually does some computation and get the
results or write to disk. Example:
count, collect, first
29. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
30. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
Creating a RDD from a file:
>>> rdd = sc.textFile("data.txt")
31. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
32. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100]
"GET /x.html HTTP/1.1"', (...)]
33. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100]
"GET /x.html HTTP/1.1"', (...)]
Breaking down:
>>> filter_rdd = rdd_log.filter(lambda l: 'x.html' in l)
>>> filter_rdd.count()
238
34. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
35. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
36. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
That’s why DataFrames are so awesome.
37. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
38. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
39. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
40. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
Optimizer is able to look inside of operations.
42. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
43. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
Filter by a column:
>>> df.filter(df["User"]=="Perone").count()
120
44. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
45. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
46. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
49. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
50. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
51. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
MLlib will still support the RDD-based API with bug fixes.
No more new features to the RDD-based API.
In the Spark 2.x releases, will add features to the DataFrames-based
API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.2), the
RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
53. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
54. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
55. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
56. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
57. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
Utilities: linear algebra, statistics, data handling, etc.
58. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
59. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
60. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
61. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
>>> result = model.transform(documentDF)
>>> result.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'],
result=DenseVector([-0.0168, 0.0042, -0.0308]))]
63. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
64. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
65. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
66. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
67. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
Cold start
68. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
69. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
OPTIMIZATION
minx,y
u,i
(rui − xT
u yi)2
+ λ(
u
xu
2
+
i
yi
2
)
* omitted biases
70. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
LET’S DO IT
Practice time !
Notebook at: https://github.com/perone/spark-als-intro
Load/parse data
Pandas integration, sampling, plotting
Spark SQL
Split data (train/test)
Build model
Train model
Evaluate model
Have fun !