Spark - Philly JUG

Spark
Brian O’Neill (@boneill42)
Monetate

Agenda
● History / Context
○Hadoop
○Lambda
●Spark Basics
○RDDs, Dataframe, SQL, Streaming
● Play along / Demo

We work at Monetate...
Client
(e.g. Retailer)
Decision
Engine
Data
Analytics
Engine
consumer marketer
Dashboard
Warehouse
Meta
Observations

We call it a...
Personalization Platform
Not so hard until...
m’s → B’s
100ms’s → 10ms’s
days → minutes
(sessions / month)
(response times)
(analytics lag)

map / reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]

word count
The Code
def map(doc)
doc.each do |word|
emit(word, 1)
end
end
def reduce(key, values[])
sum = values.inject {|sum,x| sum + x }
emit(key, sum)
end
The Run
doc1 = “boy meets girl”
doc2 = ”girl likes boy”)
map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
reduce (boy, [1, 1]) -> (boy, 2)
reduce (girl, [1, 1]) -> (girl, 2)
reduce (likes [1]) -> (likes, 1)
reduce (meets, [1]) -> (meets, 1)

Let’s invent some terminology...

Lambda on Spark (e.g.)
S3
Kafka
MySQL
RDD
RDD
Dataframe
Druid

Concept : RDDs
“Spark revolves around the concept of a resilient distributed
dataset (RDD), which is a fault-tolerant collection of
elements that can be operated on in parallel. There are two
ways to create RDDs: parallelizing an existing collection in
your driver program, or referencing a dataset in an external
storage system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop InputFormat.”
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Concept : Transformations &
Operations
Transformation:
RDD(s) → RDD
e.g. map, filter, groupBy, etc.
Action:
RDD → value
e.g. reduce, count, etc.

Code: RDDs
JavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc)
.cassandraTable("java_api", "products", productReader)
.keyBy(new Function<Product, Integer>() {
@Override
public Integer call(Product product) throws Exception {
return product.getId();
}
});

Concept : DataFrames
DataFrames = RDD + Schema
“A DataFrame is a distributed collection of data organized
into named columns. It is conceptually equivalent to a table
in a relational database or a data frame in R/Python, but
with richer optimizations under the hood. DataFrames can
be constructed from a wide array of sources such as:
structured data files, tables in Hive, external databases, or
existing RDDs.”
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

Concept : Spark SQL
SELECT
min(event_time) AS start_time,
max(event_time) AS end_time,
account_id
FROM events GROUP BY account_id

Code: SQL + Dataframes
StructType schema = configuration.getSchemaForProduct();
DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);
sqlContext.registerDataFrameAsTable(dataFrame, “products”);

And remember Uncle Ben…
“With great power, comes great responsibility.”

Concept : Streaming
.forEachRDD

Code: Streaming
JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(),
SessionizerState.getConfig().getSparkStreamingBatchDuration());
JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...);
kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() {
@Override
public void call(JavaRDD<byte[]> rdd) throws Exception {
JavaRDD<String> lines = rdd.map(new Function<byte[], String>( {
public String call(byte[] bytes) throws IOException {
return new String(bytes, Charset.forName("UTF-8"));
}
});
processRdd(lines);
}
});

Basic Architecture
http://spark.apache.org/docs/latest/cluster-overview.html

Kinesis / Streaming Architecture

Get stuff...
Get Spark...
http://spark.apache.org/downloads.html
Get Cassandra…
http://cassandra.apache.org/download/
Get Code…
https://github.com/boneill42/spark-on-cassandra-quickstart

Configure stuff...
$spark/conf> cp spark-env.sh.template spark-env.sh
$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh

Start stuff...
# Start Master
$spark> sbin/start-master.sh
$spark> tail -f logs/*
# Start Worker
$spark> bin/spark-class org.apache.spark.deploy.worker.Worker
spark://127.0.0.1:7077

Build and launch stuff...
# Build
$code> mvn clean install
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
# Launch
$code> spark-submit --class com.github.boneill42.JavaDemo
--master spark://127.0.0.1:7077
target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar
spark://127.0.0.1:7077 127.0.0.1

Advertisements...
https://github.com/monetate/koupler
https://github.com/monetate/ectou-metadata
https://github.com/monetate/ectou-export

Spark - Philly JUG

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark - Philly JUG

Similar to Spark - Philly JUG (20)

More from Brian O'Neill

More from Brian O'Neill (7)

Recently uploaded

Recently uploaded (20)

Spark - Philly JUG

Editor's Notes