SlideShare a Scribd company logo
1 of 37
Spark
Brian O’Neill (@boneill42)
Monetate
Agenda
● History / Context
○Hadoop
○Lambda
●Spark Basics
○RDDs, Dataframe, SQL, Streaming
● Play along / Demo
We work at Monetate...
Client
(e.g. Retailer)
Decision
Engine
Data
Analytics
Engine
consumer marketer
Dashboard
Warehouse
Meta
Observations
We call it a...
Personalization Platform
Not so hard until...
m’s → B’s
100ms’s → 10ms’s
days → minutes
(sessions / month)
(response times)
(analytics lag)
HISTORY
history - hadoop
map / reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]
word count
The Code
def map(doc)
doc.each do |word|
emit(word, 1)
end
end
def reduce(key, values[])
sum = values.inject {|sum,x| sum + x }
emit(key, sum)
end
The Run
doc1 = “boy meets girl”
doc2 = ”girl likes boy”)
map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
reduce (boy, [1, 1]) -> (boy, 2)
reduce (girl, [1, 1]) -> (girl, 2)
reduce (likes [1]) -> (likes, 1)
reduce (meets, [1]) -> (meets, 1)
Jobs on top of jobs...
Real-time? Different hammer.
Let’s invent some terminology...
Traditional lambda...
Can we collapse the lambda?
Spark- FTW!
Lambda on Spark (e.g.)
S3
Kafka
MySQL
RDD
RDD
Dataframe
Druid
SPARK BASICS
Concept : RDDs
“Spark revolves around the concept of a resilient distributed
dataset (RDD), which is a fault-tolerant collection of
elements that can be operated on in parallel. There are two
ways to create RDDs: parallelizing an existing collection in
your driver program, or referencing a dataset in an external
storage system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop InputFormat.”
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
Concept : Transformations &
Operations
Transformation:
RDD(s) → RDD
e.g. map, filter, groupBy, etc.
Action:
RDD → value
e.g. reduce, count, etc.
Code: RDDs
JavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc)
.cassandraTable("java_api", "products", productReader)
.keyBy(new Function<Product, Integer>() {
@Override
public Integer call(Product product) throws Exception {
return product.getId();
}
});
DAGs
Lazily evaluated!
Concept : DataFrames
DataFrames = RDD + Schema
“A DataFrame is a distributed collection of data organized
into named columns. It is conceptually equivalent to a table
in a relational database or a data frame in R/Python, but
with richer optimizations under the hood. DataFrames can
be constructed from a wide array of sources such as:
structured data files, tables in Hive, external databases, or
existing RDDs.”
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
Concept : Spark SQL
SELECT
min(event_time) AS start_time,
max(event_time) AS end_time,
account_id
FROM events GROUP BY account_id
Code: SQL + Dataframes
StructType schema = configuration.getSchemaForProduct();
DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema);
sqlContext.registerDataFrameAsTable(dataFrame, “products”);
And remember Uncle Ben…
“With great power, comes great responsibility.”
Concept : Streaming
.forEachRDD
Code: Streaming
JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(),
SessionizerState.getConfig().getSparkStreamingBatchDuration());
JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...);
kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() {
@Override
public void call(JavaRDD<byte[]> rdd) throws Exception {
JavaRDD<String> lines = rdd.map(new Function<byte[], String>( {
public String call(byte[] bytes) throws IOException {
return new String(bytes, Charset.forName("UTF-8"));
}
});
processRdd(lines);
}
});
DEPLOYMENT
Basic Architecture
http://spark.apache.org/docs/latest/cluster-overview.html
Kinesis / Streaming Architecture
Amazon’s EMR
Play along demo.
Get stuff...
Get Spark...
http://spark.apache.org/downloads.html
Get Cassandra…
http://cassandra.apache.org/download/
Get Code…
https://github.com/boneill42/spark-on-cassandra-quickstart
Configure stuff...
$spark/conf> cp spark-env.sh.template spark-env.sh
$spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh
Start stuff...
# Start Master
$spark> sbin/start-master.sh
$spark> tail -f logs/*
# Start Worker
$spark> bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://127.0.0.1:7077
Build and launch stuff...
# Build
$code> mvn clean install
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
# Launch
$code> spark-submit --class com.github.boneill42.JavaDemo 
--master spark://127.0.0.1:7077 
target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
spark://127.0.0.1:7077 127.0.0.1
A message from our
sponsor
Advertisements...
https://github.com/monetate/koupler
https://github.com/monetate/ectou-metadata
https://github.com/monetate/ectou-export

More Related Content

What's hot

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 

What's hot (20)

Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 

Viewers also liked

Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
sjoerdluteyn
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
Shintaro Fukushima
 

Viewers also liked (20)

Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Performance
PerformancePerformance
Performance
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Pixie dust overview
Pixie dust overviewPixie dust overview
Pixie dust overview
 
Why dont you_create_new_spark_jl
Why dont you_create_new_spark_jlWhy dont you_create_new_spark_jl
Why dont you_create_new_spark_jl
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Spark tutorial py con 2016 part 2
Spark tutorial py con 2016   part 2Spark tutorial py con 2016   part 2
Spark tutorial py con 2016 part 2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Spark tutorial pycon 2016 part 1
Spark tutorial pycon 2016   part 1Spark tutorial pycon 2016   part 1
Spark tutorial pycon 2016 part 1
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 

Similar to Spark - Philly JUG

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 

Similar to Spark - Philly JUG (20)

Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Munich March 2015 - Cassandra + Spark Overview
Munich March 2015 -  Cassandra + Spark OverviewMunich March 2015 -  Cassandra + Spark Overview
Munich March 2015 - Cassandra + Spark Overview
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

More from Brian O'Neill (7)

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Big data philly_jug
Big data philly_jugBig data philly_jug
Big data philly_jug
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform Development
 
Hms nyc* talk
Hms nyc* talkHms nyc* talk
Hms nyc* talk
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software development
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Spark - Philly JUG

  • 2. Agenda ● History / Context ○Hadoop ○Lambda ●Spark Basics ○RDDs, Dataframe, SQL, Streaming ● Play along / Demo
  • 3. We work at Monetate... Client (e.g. Retailer) Decision Engine Data Analytics Engine consumer marketer Dashboard Warehouse Meta Observations
  • 4. We call it a... Personalization Platform Not so hard until... m’s → B’s 100ms’s → 10ms’s days → minutes (sessions / month) (response times) (analytics lag)
  • 7. map / reduce tuple = (key, value) map(x) -> tuple[] reduce(key, value[]) -> tuple[]
  • 8. word count The Code def map(doc) doc.each do |word| emit(word, 1) end end def reduce(key, values[]) sum = values.inject {|sum,x| sum + x } emit(key, sum) end The Run doc1 = “boy meets girl” doc2 = ”girl likes boy”) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) map (doc2) -> (girl, 1), (likes, 1), (boy, 1) reduce (boy, [1, 1]) -> (boy, 2) reduce (girl, [1, 1]) -> (girl, 2) reduce (likes [1]) -> (likes, 1) reduce (meets, [1]) -> (meets, 1)
  • 9. Jobs on top of jobs...
  • 11. Let’s invent some terminology...
  • 13. Can we collapse the lambda?
  • 15. Lambda on Spark (e.g.) S3 Kafka MySQL RDD RDD Dataframe Druid
  • 17. Concept : RDDs “Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.” http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
  • 18. Concept : Transformations & Operations Transformation: RDD(s) → RDD e.g. map, filter, groupBy, etc. Action: RDD → value e.g. reduce, count, etc.
  • 19. Code: RDDs JavaPairRDD<Integer, Product> productsRDD = javaFunctions(sc) .cassandraTable("java_api", "products", productReader) .keyBy(new Function<Product, Integer>() { @Override public Integer call(Product product) throws Exception { return product.getId(); } });
  • 21. Concept : DataFrames DataFrames = RDD + Schema “A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.” http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
  • 22. Concept : Spark SQL SELECT min(event_time) AS start_time, max(event_time) AS end_time, account_id FROM events GROUP BY account_id
  • 23. Code: SQL + Dataframes StructType schema = configuration.getSchemaForProduct(); DataFrame dataFrame = sqlContext.createDataFrame(productsRDD, schema); sqlContext.registerDataFrameAsTable(dataFrame, “products”);
  • 24. And remember Uncle Ben… “With great power, comes great responsibility.”
  • 26. Code: Streaming JavaStreamingContext streamingContext = new JavaStreamingContext(getSparkConf(), SessionizerState.getConfig().getSparkStreamingBatchDuration()); JavaReceiverInputDStream<byte[]> kinesisStream = KinesisUtils.createStream(...); kinesisStream.foreachRDD(new VoidFunction<JavaRDD<byte[]>>() { @Override public void call(JavaRDD<byte[]> rdd) throws Exception { JavaRDD<String> lines = rdd.map(new Function<byte[], String>( { public String call(byte[] bytes) throws IOException { return new String(bytes, Charset.forName("UTF-8")); } }); processRdd(lines); } });
  • 29. Kinesis / Streaming Architecture
  • 32. Get stuff... Get Spark... http://spark.apache.org/downloads.html Get Cassandra… http://cassandra.apache.org/download/ Get Code… https://github.com/boneill42/spark-on-cassandra-quickstart
  • 33. Configure stuff... $spark/conf> cp spark-env.sh.template spark-env.sh $spark/conf> echo "SPARK_MASTER_IP=127.0.0.1" >> spark-env.sh
  • 34. Start stuff... # Start Master $spark> sbin/start-master.sh $spark> tail -f logs/* # Start Worker $spark> bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
  • 35. Build and launch stuff... # Build $code> mvn clean install [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ # Launch $code> spark-submit --class com.github.boneill42.JavaDemo --master spark://127.0.0.1:7077 target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar spark://127.0.0.1:7077 127.0.0.1
  • 36. A message from our sponsor

Editor's Notes

  1. https://gist.github.com/cirla/030a1a4e2b6075d2fc5d