Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
16. Some code toread Wikipedia
val rdd = sc.textFile("/mnt/wikipediapagecounts.gz")
val parsedRDD = rdd.flatMap {
line => line.split("""s+""") match {
case Array(project, page, numRequests, _) => Some((project, page, numRequests))
case _ => None
}
}
// filter only English pages ; count pages and requests to it.
parsedRDD.filter { case (project, page, numRequests) => project == "en" }.
map { case (_, page, numRequests) => (page, numRequests) }.
reduceByKey(_ + _).
take(100). foreach { case (page, requests) => println(s"$page: $requests") }
17. When to Use RDDs?
• ... Low-levelAPI & control of dataset
• ... Dealing with unstrucrureddata (media streamsor texts)
• ... Manipulate data with lambda functions than DSL
• ... Don’t care schemaor structureof data
• ... Sacrifice optimization, performance & inefficiecies
19. What’s the problem?
• ... Express how-to solution, not what-to
• ... Not optimized by Spark
• ... Slow for non-JVM languages like Python
• ... Inadverdent inefficiecies
20. Inadvertent inefficiencies in RDDs
parsedRDD.filter { case (project, page, numRequests) => project == "en" }.
map { case (_, page, numRequests) => (page, numRequests) }.
reduceByKey(_ + _).
filter { case (page, _) => ! isSpecialPage(page) }.
take(100). foreach { case (project, requests) => println (s"project: $requests") }
25. DataFrame API code.
// convert RDD -> DF with column names
val df = parsedRDD.toDF("project", "page", "numRequests")
//filter, groupBy, sum, and then agg()
df.filter($"project" === "en").
groupBy($"page").
agg(sum($"numRequests").as("count")).
limit(100).
show(100)
project page numRequests
en 23 45
en 24 200
26. Take DataFrame à SQL Table à Query
df. createOrReplaceTempView(("edits")
val results = spark.sql("""SELECT page, sum(numRequests)
AS count FROM edits WHERE project = 'en' GROUP BY page
LIMIT 100""")
results.show(100)
project page numRequests
en 23 45
en 24 200
27. Easy to write code... Believe it!
from pyspark.sql.functions import avg
dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])
dataDF = dataRDD.toDF(["name", "age"])
# Using RDD code to compute aggregate average
(dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1]
+y[1])) .map(lambda (x, (y, z)): (x, y / z)))
# Using DataFrame
dataDF.groupBy("name").agg(avg("age"))
name age
Jim 20
Ann 31
Jim 30
28. Why structure APIs?
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)}
.map { case (dept, (age, c)) => dept -> age / c }
select dept, avg(age) from data group by 1
SQL
DataFrame
RDD
data.groupBy("dept").avg("age")
29. 29
Using Catalyst in Spark SQL
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzing a logical plan to resolve references
Logical Optimization: logical plan optimization
Physical Planning: Physical planning
Code Generation: Compile parts of the query to Java bytecode
SQL AST
DataFrame
Datasets
30. Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
LogicalPlan
filter
join
Physical Plan
join
scan
(users)eventsfile userstable
30
scan
(events)
filter
users.join(events, users("id") === events("uid")) .
filter(events("date") > "2015-01-01")
DataFrame Optimization
31. Type-safe: operate
on domain objects
with compiled
lambda functions
8
Dataset API in Spark 2.x
val df = spark.read.json("people.json")
/ / Convert data to domain objects.
case class Person(name: String, age: I nt )
val ds: Dataset[Person] = df.as[Person]
val = filterDS = ds . f i l t er( p= >p. age > 3)
43. TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
45. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Q1
Q2
Q3
Q4
Title
Blue
Orange
Green
Use this chart to start
46. Here are some icons to use - scalable
DB Benefits
DB Features
General / Data Science
Icons can berecolored within Powerpoint — see: format picture/ picturecolor / recolor
Orange, Green, and Black versions (no recoloration necessary)can befound in go/icons