SlideShare a Scribd company logo
1 of 110
Download to read offline
Spark SQL
10 Things You Need to Know
● Located in Encinitas, CA & Austin, TX
● We work on a technology called Data Algebra
● We hold nine patents in this technology
● Create turnkey performance enhancement for db engines
● We’re working on a product called Algebraix Query Accelerator
● The first public release of the product focuses on Apache Spark
● The product will be available in Amazon Web Services and Azure
About Algebraix
About Me
● I’m a long time developer turned product person
● I love music (playing music and listening to music)
● I grew up in Palm Springs, CA
● Development: Full stack developer, UX developer, API dev, etc
● Product: worked in UX Lead, Director Product Innovation, VP Product
● Certs: AWS Solutions Architect, AWS Developer
● Education: UCI - Polical Science, Chemistry
About You
● Does anyone here use:
○ Spark or Spark SQL? In Production?
○ Amazon Web Services?
○ SQL Databases?
○ NoSQL Databases?
○ Distributed Databases?
● What brings you in here?
○ Want to become a rockstar at work?
○ Trying to do analytics at a larger scale?
○ Want exposure to the new tools in the industry?
Goals of This Talk
For you to be able to get up and running quickly with real world data and produce
applications, while avoiding major pitfalls that most newcomers to Spark SQL
would encounter.
1. Spark SQL use cases
2. Loading data: in the cloud vs locally, RDDs vs DataFrames
3. SQL vs the DataFrame API. What’s the difference?
4. Schemas: implicit vs explicit schemas, data types
5. Loading & saving results
6. What SQL functionality works and what doesn’t?
7. Using SQL for ETL
8. Working with JSON data
9. Reading and Writing to an external SQL databases
10. Testing your work in the real world
Spark SQL: 10 things you should know
Notes
>>>
When I write “>>>” it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
Notes
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
When I show you this guy ^^ it means we’re done
with this section, and feel free to ask questions
before we move on. =D
Spark SQL
Use Cases
What, when, why
1
Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
1
Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
ETL
capabilities
alongside
familiar
SQL
1
Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
ETL
capabilities
alongside
familiar
SQL
Live SQL
analytics
over
streaming
data
1
Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
Interaction
with
external
Databases
ETL
capabilities
alongside
familiar
SQL
Live SQL
analytics
over
streaming
data
1
Scalable
query
performance
with larger
clusters
Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
Interaction
with
external
Databases
ETL
capabilities
alongside
familiar
SQL
Live SQL
analytics
over
streaming
data
1
Loading
Data
Cloud vs local,
RDD vs DF
2
Loading
Data
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.
Cloud vs local,
RDD vs DF
2
Loading
Data
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”
2
Loading
Data
Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.
2
Loading
Data
Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
you’ll need to load data into an RDD and
transform it first.
2
Loading
Data
Cloud vs local,
RDD vs DF
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
2
Loading
Data
Cloud vs local,
RDD vs DF
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
2
Loading
Data
Cloud vs local,
RDD vs DF
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
2
Loading
Data
Cloud vs local,
RDD vs DF
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
2
Loading
Data
Cloud vs local,
RDD vs DF
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
2
Loading
Data
Cloud vs local,
RDD vs DF
# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet(“file:/logs.parquet”)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
2
Loading
Data
Cloud vs local,
RDD vs DF
# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet(“file:/logs.parquet”)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| row|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
In fact, now they even have support for querying
parquet files directly! Easy peasy!
2
Loading
Data
Cloud vs local,
RDD vs DF
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
In fact, now they even have support for querying
parquet files directly! Easy peasy!
# load parquet straight into DF, and write some SQL
>>> spark.sql(“””
SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1
”””).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
2
Loading
Data
Cloud vs local,
RDD vs DF
In fact, now they even have support for querying
parquet files directly! Easy peasy!
# load parquet straight into DF, and write some SQL
>>> spark.sql(“””
SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1
”””).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
That’s one aspect of loading data. The other
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
2
Loading
Data
Cloud vs local,
RDD vs DF
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
That’s one aspect of loading data. The other
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”)
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
2
Loading
Data
Cloud vs local,
RDD vs DF
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”)
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
2
Loading
Data
Cloud vs local,
RDD vs DF
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
2
Loading
Data
Cloud vs local,
RDD vs DF
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
Now you should have several ways to load data
to quickly start writing SQL with Apache Spark.
2
Loading
Data
Cloud vs local,
RDD vs DF
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
Now you should have several ways to load data
to quickly start writing SQL with Apache Spark.
2
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
What is a
Dataframe?
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Dataframe
Functions
vs SQL
SQL-Like functions
in the
Dataframe API
3
Schemas
Inferred
vs
explicit
4
Schemas
Inferred
vs
explicit
Schemas can be inferred, i.e. guessed, by spark.
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
4
Schemas
Inferred
vs
explicit
Schemas can be inferred, i.e. guessed, by spark.
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
# sample data - “people.txt”
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
4
Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
# sample data - “people.txt”
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
4
Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
# we didn’t actually pass anything into that 2nd param.
# yet, behind the scenes, there’s still a schema.
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
4
Schemas
Inferred
vs
explicit
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
# we didn’t actually pass anything into that 2nd param.
# yet, behind the scenes, there’s still a schema.
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
Spark SQL can certainly handle queries where
`id` is a string, but it should be an int.
4
Schemas
Inferred
vs
explicit
Spark SQL can certainly handle queries where
`id` is a string, but what if we don’t want it to be?
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
4
Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
You can actually provide a schema, too, which
will be more authoritative.
4
Schemas
Inferred
vs
explicit
You can actually provide a schema, too, which
will be more authoritative.
# load as RDD and map it to a row with multiple fields
import pyspark.sql.types as types
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
types.StructField(‘id',types.IntegerType(), False)
,types.StructField(‘name',types.StringType())
,types.StructField(‘company',types.StringType())
,types.StructField(‘state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
4
Schemas
Inferred
vs
explicit
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
types.StructField(‘id',types.IntegerType(), False)
,types.StructField(‘name',types.StringType())
,types.StructField(‘company',types.StringType())
,types.StructField(‘state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
>>> peopleDF.printSchema()
Root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- company: string (nullable = true)
|-- state: string (nullable = true)
4
Schemas
Inferred
vs
explicit
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
types.StructField(‘id',types.IntegerType(), False)
,types.StructField(‘name',types.StringType())
,types.StructField(‘company',types.StringType())
,types.StructField(‘state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
>>> peopleDF.printSchema()
Root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- company: string (nullable = true)
|-- state: string (nullable = true)
And what are the available types?
4
Schemas
available types
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
>>> peopleDF.printSchema()
Root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- company: string (nullable = true)
|-- state: string (nullable = true)
And what are the available types?
# http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html
__all__ = ["DataType", "NullType", "StringType",
"BinaryType", "BooleanType", "DateType", "TimestampType",
"DecimalType", "DoubleType", "FloatType", "ByteType",
"IntegerType", "LongType", "ShortType", "ArrayType",
"MapType", "StructField", "StructType"]
4
# Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
And what are the available types?
# http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html
__all__ = ["DataType", "NullType", "StringType",
"BinaryType", "BooleanType", "DateType", "TimestampType",
"DecimalType", "DoubleType", "FloatType", "ByteType",
"IntegerType", "LongType", "ShortType", "ArrayType",
"MapType", "StructField", "StructType"]
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
4
# Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
"BinaryType", "BooleanType", "DateType", "TimestampType",
"DecimalType", "DoubleType", "FloatType", "ByteType",
"IntegerType", "LongType", "ShortType", "ArrayType",
"MapType", "StructField", "StructType"]
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
Now you know about inferred and explicit
schemas, and the available types you can use.
4
# Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
Now you know about inferred and explicit
schemas, and the available types you can use.
4
Loading &
Saving
Results
Types,
performance,
& considerations
5
Loading &
Saving
Results
Types,
performance,
& considerations
Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.
5
Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.
5
Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
# read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)
Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
# read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)
Loading &
Saving
Results
Types,
performance,
& considerations
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
SQL
Function
Coverage
What works and
what doesn’t
6
● Limited support for subqueries and various
other noticeable SQL functionalities
● Runs roughly half of the 99 TPC-DS
benchmark queries
● More SQL support in HiveContext
SQL
Function
Coverage
Spark 1.6
6
Spark 1.6
http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
SQL
Function
Coverage
Spark 2.0
6
Spark 2.0
In DataBricks’ Words
● SQL2003 support
● Runs all 99 of TPC-DS benchmark queries
● A native SQL parser that supports both
ANSI-SQL as well as Hive QL
● Native DDL command implementations
● Subquery support, including
○ Uncorrelated Scalar Subqueries
○ Correlated Scalar Subqueries
○ NOT IN predicate Subqueries (in
WHERE/HAVING clauses)
○ IN predicate subqueries (in
WHERE/HAVING clauses)
○ (NOT) EXISTS predicate subqueries
(in WHERE/HAVING clauses)
● View canonicalization support
● In addition, when building without Hive
support, Spark SQL should have almost all
the functionality as when building with Hive
support, with the exception of Hive
connectivity, Hive UDFs, and script
transforms.
http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
SQL
Function
Coverage
Spark 2.0
6
Spark 2.0
In DataBricks’ Words
● SQL2003 support
● Runs all 99 of TPC-DS benchmark queries
● A native SQL parser that supports both
ANSI-SQL as well as Hive QL
● Native DDL command implementations
● Subquery support, including
○ Uncorrelated Scalar Subqueries
○ Correlated Scalar Subqueries
○ NOT IN predicate Subqueries (in
WHERE/HAVING clauses)
○ IN predicate subqueries (in
WHERE/HAVING clauses)
○ (NOT) EXISTS predicate subqueries
(in WHERE/HAVING clauses)
● View canonicalization support
● In addition, when building without Hive
support, Spark SQL should have almost all
the functionality as when building with Hive
support, with the exception of Hive
connectivity, Hive UDFs, and script
transforms.
Using
Spark SQL
for ETL
Tips and tricks
7
Using
Spark SQL
for ETL
Tips and tricks
7
These things are some of the things we learned
to start doing after working with Spark a while.
Using
Spark SQL
for ETL
Tips and tricks
7
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
These things are some of the things we learned
to start doing after working with Spark a while.
Using
Spark SQL
for ETL
Tips and tricks
7
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
These things are some of the things we learned
to start doing after working with Spark a while.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
Using
Spark SQL
for ETL
Tips and tricks
7
These things are some of the things we learned
to start doing after working with Spark a while.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
Tip 3: If using EMR, create a cluster with your
desired steps, prove it works, then export a CLI
command to reproduce it, and run it in Data
Pipeline to start recurring pipelines / jobs.
Using
Spark SQL
for ETL
Tips and tricks
7
Tip 3: If using EMR, create a cluster with your
desired steps, prove it works, then export a CLI
command to reproduce it, and run it in Data
Pipeline to start recurring pipelines / jobs.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
Working
with JSON
Quick and easy
8
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data must be
read-in as line
delimied json
objectsJSON
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
# json explode example
>>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json")
>>> spark.sql("SELECT * FROM json").show()
+----+----------------+
| x| y|
+----+----------------+
|row1| [1, 2, 3, 4, 5]|
|row2|[6, 7, 8, 9, 10]|
+----+----------------+
>>> spark.sql("SELECT x, explode(y) FROM json").show()
+----+---+
| x|col|
+----+---+
|row1| 1|
|row1| 2|
|row1| 3|
|row1| 4|
|row1| 5|
|row2| 6|
|row2| 7|
|row2| 8|
|row2| 9|
|row2| 10|
+----+---+
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8 If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
For multi-line JSON files,
you’ve got to do much more:
# a list of data from files.
files = sc.wholeTextFiles("data.json")
# each tuple is (path, jsonData)
rawJSON = files.map(lambda x: x[1])
# sanitize the data
cleanJSON = rawJSON.map(
lambda x: re.sub(r"s+", "",x,flags=re.UNICODE)
)
# finally, you can then read that in as “JSON”
spark.read.json( scrubbedJSON )
# PS -- the same goes for XML.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
External
Databases
Reading and
Writing
9
External
Databases
Reading and
Writing
9
To read from an external database, you’ve got to
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
External
Databases
Reading and
Writing
9
To read from an external database, you’ve got to
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark 
--jars /path/to/mysql-jdbc.jar
--packages
# note: you can also add the path to your jar in the
spark.defaults config file to these settings:
spark.driver.extraClassPath
spark.executor.extraClassPath
Once you’ve got your connector jars successfully
imported, now you can read an existing database
into your spark application or spark shell as a
dataframe.
External
Databases
Reading and
Writing
9
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark 
--jars /path/to/mysql-jdbc.jar
--packages
# note: you can also add the path to your jar in the
spark.defaults config file to these settings:
spark.driver.extraClassPath
spark.executor.extraClassPath
Once you’ve got your connector jars successfully
imported, now you can read an existing database
into your spark application or spark shell as a
dataframe.External
Databases
Reading and
Writing
9
spark.driver.extraClassPath
spark.executor.extraClassPath
# line broken for readibility
sqlURL = “jdbc:mysql://<db-host>:<port>
?user=<user>
&password=<pass>
&rewriteBatchedStatements=true
&continueBatchOnError=true”
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
dataframe.
External
Databases
Reading and
Writing
9
# line broken for readibility
sqlURL = “jdbc:mysql://<db-host>:<port>
?user=<user>
&password=<pass>
&rewriteBatchedStatements=true # omg use this.
&continueBatchOnError=true” # increases performance.
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
External
Databases
Reading and
Writing
9
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a
bit destructive. “Overwrite” doesn’t just overwrite
your data, it overwrites your schemas too.
Say goodbye to those precious indices.
External
Databases
Reading and
Writing
9 If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a
bit destructive. “Overwrite” doesn’t just overwrite
your data, it overwrites your schemas too.
Say goodbye to those precious indices.
Real World
Testing
Things to consider
10
Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
Get really
comfortable
using
.parallelize() to
create dummy
data.
Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
Get really
comfortable
using
.parallelize() to
create dummy
data.
If you’re using
big data, and
many nodes,
don’t use
.collect() unless
you intend to
Final
Questions?
AMA
Thank you!
Bonus
AQA Demo
Hiring
Big Data Engineer

More Related Content

What's hot

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Haoyuan Li
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 

What's hot (20)

DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Spark etl
Spark etlSpark etl
Spark etl
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Spark sql
Spark sqlSpark sql
Spark sql
 

Viewers also liked

Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Linked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale IntegrationLinked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale Integrationrumito
 
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data SourcesVirtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sourcesrumito
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOPaolo Cristofaro
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridEmiliano Pecis
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera, Inc.
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 

Viewers also liked (20)

Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Linked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale IntegrationLinked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale Integration
 
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data SourcesVirtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources
Virtuoso Sponger - RDFizer Middleware for creating RDF from non RDF Data Sources
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
Oracle Coherence
Oracle CoherenceOracle Coherence
Oracle Coherence
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagrid
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 

Similar to Spark SQL - 10 Things You Need to Know

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonRidwan Fadjar
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar to Spark SQL - 10 Things You Need to Know (20)

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Memulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan PythonMemulai Data Processing dengan Spark dan Python
Memulai Data Processing dengan Spark dan Python
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Spark SQL - 10 Things You Need to Know

  • 1. Spark SQL 10 Things You Need to Know
  • 2. ● Located in Encinitas, CA & Austin, TX ● We work on a technology called Data Algebra ● We hold nine patents in this technology ● Create turnkey performance enhancement for db engines ● We’re working on a product called Algebraix Query Accelerator ● The first public release of the product focuses on Apache Spark ● The product will be available in Amazon Web Services and Azure About Algebraix
  • 3. About Me ● I’m a long time developer turned product person ● I love music (playing music and listening to music) ● I grew up in Palm Springs, CA ● Development: Full stack developer, UX developer, API dev, etc ● Product: worked in UX Lead, Director Product Innovation, VP Product ● Certs: AWS Solutions Architect, AWS Developer ● Education: UCI - Polical Science, Chemistry
  • 4. About You ● Does anyone here use: ○ Spark or Spark SQL? In Production? ○ Amazon Web Services? ○ SQL Databases? ○ NoSQL Databases? ○ Distributed Databases? ● What brings you in here? ○ Want to become a rockstar at work? ○ Trying to do analytics at a larger scale? ○ Want exposure to the new tools in the industry?
  • 5. Goals of This Talk For you to be able to get up and running quickly with real world data and produce applications, while avoiding major pitfalls that most newcomers to Spark SQL would encounter.
  • 6. 1. Spark SQL use cases 2. Loading data: in the cloud vs locally, RDDs vs DataFrames 3. SQL vs the DataFrame API. What’s the difference? 4. Schemas: implicit vs explicit schemas, data types 5. Loading & saving results 6. What SQL functionality works and what doesn’t? 7. Using SQL for ETL 8. Working with JSON data 9. Reading and Writing to an external SQL databases 10. Testing your work in the real world Spark SQL: 10 things you should know
  • 7. Notes >>> When I write “>>>” it means that I’m talking to the pyspark console and what follows immediately afterward is the output. How to read what I’m writing
  • 8. Notes >>> When I write >>> it means that I’m talking to the pyspark console and what follows immediately afterward is the output. How to read what I’m writing >>> aBunchOfData.count() 901150
  • 9. Notes >>> When I write >>> it means that I’m talking to the pyspark console and what follows immediately afterward is the output. How to read what I’m writing >>> aBunchOfData.count() 901150 Otherwise, I’m just demonstrating pyspark code.
  • 10. Notes >>> When I write >>> it means that I’m talking to the pyspark console and what follows immediately afterward is the output. How to read what I’m writing >>> aBunchOfData.count() 901150 Otherwise, I’m just demonstrating pyspark code.
  • 11. Notes afterward is the output. How to read what I’m writing >>> aBunchOfData.count() 901150 Otherwise, I’m just demonstrating pyspark code. When I show you this guy ^^ it means we’re done with this section, and feel free to ask questions before we move on. =D
  • 13. Spark SQL Use Cases What, when, why Ad-hoc querying of data in files 1
  • 14. Spark SQL Use Cases What, when, why Ad-hoc querying of data in files ETL capabilities alongside familiar SQL 1
  • 15. Spark SQL Use Cases What, when, why Ad-hoc querying of data in files ETL capabilities alongside familiar SQL Live SQL analytics over streaming data 1
  • 16. Spark SQL Use Cases What, when, why Ad-hoc querying of data in files Interaction with external Databases ETL capabilities alongside familiar SQL Live SQL analytics over streaming data 1
  • 17. Scalable query performance with larger clusters Spark SQL Use Cases What, when, why Ad-hoc querying of data in files Interaction with external Databases ETL capabilities alongside familiar SQL Live SQL analytics over streaming data 1
  • 19. Loading Data You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first. Cloud vs local, RDD vs DF 2
  • 20. Loading Data You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first.Cloud vs local, RDD vs DF # loading data into an RDD in Spark 2.0 sc = spark.sparkContext oneSysLog = sc.textFile(“file:/var/log/system.log”) allSysLogs = sc.textFile(“file:/var/log/system.log*”) allLogs = sc.textFile(“file:/var/log/*.log” 2
  • 21. Loading Data Cloud vs local, RDD vs DF # loading data into an RDD in Spark 2.0 sc = spark.sparkContext oneSysLog = sc.textFile(“file:/var/log/system.log”) allSysLogs = sc.textFile(“file:/var/log/system.log*”) allLogs = sc.textFile(“file:/var/log/*.log”) # lets count the lines in each RDD >>> oneSysLog.count() 8339 >>> allSysLogs.count() 47916 >>> allLogs.count() 546254 You can load data directly into a DataFrame, and begin querying it relatively quickly. Otherwise you’ll need to load data into an RDD and transform it first. 2
  • 22. Loading Data Cloud vs local, RDD vs DF # loading data into an RDD in Spark 2.0 sc = spark.sparkContext oneSysLog = sc.textFile(“file:/var/log/system.log”) allSysLogs = sc.textFile(“file:/var/log/system.log*”) allLogs = sc.textFile(“file:/var/log/*.log”) # lets count the lines in each RDD >>> oneSysLog.count() 8339 >>> allSysLogs.count() 47916 >>> allLogs.count() 546254 That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe. you’ll need to load data into an RDD and transform it first. 2
  • 23. Loading Data Cloud vs local, RDD vs DF allLogs = sc.textFile(“file:/var/log/*.log”) # lets count the lines in each RDD >>> oneSysLog.count() 8339 >>> allSysLogs.count() 47916 >>> allLogs.count() 546254 That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe. # import Row, map the rdd, and create dataframe from pyspark.sql import Row sc = spark.sparkContext allSysLogs = sc.textFile(“file:/var/log/system.log*”) logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow)) logsDF = spark.createDataFrame(logsRDD) 2
  • 24. Loading Data Cloud vs local, RDD vs DF >>> allLogs.count() 546254 That’s great, but you can’t query this. You’ll need to convert the data to Rows, add a schema, and convert it to a dataframe. # import Row, map the rdd, and create dataframe from pyspark.sql import Row sc = spark.sparkContext allSysLogs = sc.textFile(“file:/var/log/system.log*”) logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow)) logsDF = spark.createDataFrame(logsRDD) Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data. 2
  • 25. Loading Data Cloud vs local, RDD vs DF from pyspark.sql import Row sc = spark.sparkContext allSysLogs = sc.textFile(“file:/var/log/system.log*”) logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow)) logsDF = spark.createDataFrame(logsRDD) Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data. # write some SQL logsDF = spark.createDataFrame(logsRDD) logsDF.createOrReplaceTempView(“logs”) >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ 2
  • 26. Loading Data Cloud vs local, RDD vs DF Once the data is converted to at least a DataFrame with a schema, now you can talk SQL to the data. # write some SQL logsDF = spark.createDataFrame(logsRDD) logsDF.createOrReplaceTempView(“logs”) >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly. 2
  • 27. Loading Data Cloud vs local, RDD vs DF # write some SQL logsDF = spark.createDataFrame(logsRDD) logsDF.createOrReplaceTempView(“logs”) >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly. Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they contain enough schema information to do so. 2
  • 28. Loading Data Cloud vs local, RDD vs DF # load parquet straight into DF, and write some SQL logsDF = spark.read.parquet(“file:/logs.parquet”) logsDF.createOrReplaceTempView(“logs”) >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ But, you can also load certain types of data and store it directly as a DataFrame. This allows you to get to SQL quickly. Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they contain enough schema information to do so. 2
  • 29. Loading Data Cloud vs local, RDD vs DF # load parquet straight into DF, and write some SQL logsDF = spark.read.parquet(“file:/logs.parquet”) logsDF.createOrReplaceTempView(“logs”) >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | row| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ Both JSON and Parquet formats can be loaded as a DataFrame straightaway because they contain enough schema information to do so. In fact, now they even have support for querying parquet files directly! Easy peasy! 2
  • 30. Loading Data Cloud vs local, RDD vs DF >>> spark.sql(“SELECT * FROM logs LIMIT 1”).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ In fact, now they even have support for querying parquet files directly! Easy peasy! # load parquet straight into DF, and write some SQL >>> spark.sql(“”” SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1 ”””).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ 2
  • 31. Loading Data Cloud vs local, RDD vs DF In fact, now they even have support for querying parquet files directly! Easy peasy! # load parquet straight into DF, and write some SQL >>> spark.sql(“”” SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1 ”””).show() +--------------------+ | log| +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ That’s one aspect of loading data. The other aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already. 2
  • 32. Loading Data Cloud vs local, RDD vs DF +--------------------+ |Jan 6 16:37:01 (...| +--------------------+ That’s one aspect of loading data. The other aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already. # i.e. on AWS EMR, s3:// is installed already. sc = spark.sparkContext decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”) # count the lines in all of the december 2016 logs in S3 >>> decemberLogs.count() 910125081250 # wow, such logs. Ur poplar. 2
  • 33. Loading Data Cloud vs local, RDD vs DF aspect is using the protocols for cloud storage (i.e. s3://). In some cloud ecosystems, support for their storage protocol comes installed already. # i.e. on AWS EMR, s3:// is installed already. sc = spark.sparkContext decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”) # count the lines in all of the december 2016 logs in S3 >>> decemberLogs.count() 910125081250 # wow, such logs. Ur poplar. Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already. 2
  • 34. Loading Data Cloud vs local, RDD vs DF # count the lines in all of the december 2016 logs in S3 >>> decemberLogs.count() 910125081250 # wow, such logs. Ur poplar. Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already. my-linux-shell$ pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py >>> rdd = sc.readText(“s3a://acme-co/path/to/files”) rdd.count() # note: “s3a” and not “s3” -- “s3” is specific to AWS EMR. 2
  • 35. Loading Data Cloud vs local, RDD vs DF Sometimes you actually need to provide support for those protocols if your VM’s OS doesn’t have it already. my-linux-shell$ pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 >>> rdd = sc.readText(“s3a://acme-co/path/to/files”) rdd.count() # note: “s3a” and not “s3” -- “s3” is specific to AWS EMR. Now you should have several ways to load data to quickly start writing SQL with Apache Spark. 2
  • 36. Loading Data Cloud vs local, RDD vs DF my-linux-shell$ pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 >>> rdd = sc.readText(“s3a://acme-co/path/to/files”) rdd.count() # note: “s3a” and not “s3” -- “s3” is specific to AWS EMR. Now you should have several ways to load data to quickly start writing SQL with Apache Spark. 2
  • 50. Schemas Inferred vs explicit Schemas can be inferred, i.e. guessed, by spark. With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema. 4
  • 51. Schemas Inferred vs explicit Schemas can be inferred, i.e. guessed, by spark. With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema. # sample data - “people.txt” 1|Kristian|Algebraix Data|San Diego|CA 2|Pat|Algebraix Data|San Diego|CA 3|Lebron|Cleveland Cavaliers|Cleveland|OH 4|Brad|Self Employed|Hollywood|CA 4
  • 52. Schemas Inferred vs explicit # load as RDD and map it to a row with multiple fields rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=s[0],name=s[1],company=s[2],state=s[4]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD) # full syntax: .createDataFrame(peopleRDD, schema) With inferred schemas, you usually end up with a bunch of strings and ints. If you have more specific needs, supply your own schema. # sample data - “people.txt” 1|Kristian|Algebraix Data|San Diego|CA 2|Pat|Algebraix Data|San Diego|CA 3|Lebron|Cleveland Cavaliers|Cleveland|OH 4|Brad|Self Employed|Hollywood|CA 4
  • 53. Schemas Inferred vs explicit # load as RDD and map it to a row with multiple fields rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=s[0],name=s[1],company=s[2],state=s[4]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD) # full syntax: .createDataFrame(peopleRDD, schema) 3|Lebron|Cleveland Cavaliers|Cleveland|OH 4|Brad|Self Employed|Hollywood|CA # we didn’t actually pass anything into that 2nd param. # yet, behind the scenes, there’s still a schema. >>> peopleDF.printSchema() Root |-- company: string (nullable = true) |-- id: string (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) 4
  • 54. Schemas Inferred vs explicit def mapper(line): s = line.split(“|”) return Row(id=s[0],name=s[1],company=s[2],state=s[4]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD) # full syntax: .createDataFrame(peopleRDD, schema) # we didn’t actually pass anything into that 2nd param. # yet, behind the scenes, there’s still a schema. >>> peopleDF.printSchema() Root |-- company: string (nullable = true) |-- id: string (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) Spark SQL can certainly handle queries where `id` is a string, but it should be an int. 4
  • 55. Schemas Inferred vs explicit Spark SQL can certainly handle queries where `id` is a string, but what if we don’t want it to be? # load as RDD and map it to a row with multiple fields rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD) >>> peopleDF.printSchema() Root |-- company: string (nullable = true) |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) 4
  • 56. Schemas Inferred vs explicit # load as RDD and map it to a row with multiple fields rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD) >>> peopleDF.printSchema() Root |-- company: string (nullable = true) |-- id: long (nullable = true) |-- name: string (nullable = true) |-- state: string (nullable = true) You can actually provide a schema, too, which will be more authoritative. 4
  • 57. Schemas Inferred vs explicit You can actually provide a schema, too, which will be more authoritative. # load as RDD and map it to a row with multiple fields import pyspark.sql.types as types rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=s[0],name=s[1],company=s[2],state=s[4]) schema = types.StructType([ types.StructField(‘id',types.IntegerType(), False) ,types.StructField(‘name',types.StringType()) ,types.StructField(‘company',types.StringType()) ,types.StructField(‘state',types.StringType()) ]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD, schema) 4
  • 58. Schemas Inferred vs explicit rdd = sc.textFile(“file:/people.txt”) def mapper(line): s = line.split(“|”) return Row(id=s[0],name=s[1],company=s[2],state=s[4]) schema = types.StructType([ types.StructField(‘id',types.IntegerType(), False) ,types.StructField(‘name',types.StringType()) ,types.StructField(‘company',types.StringType()) ,types.StructField(‘state',types.StringType()) ]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD, schema) >>> peopleDF.printSchema() Root |-- id: integer (nullable = false) |-- name: string (nullable = true) |-- company: string (nullable = true) |-- state: string (nullable = true) 4
  • 59. Schemas Inferred vs explicit return Row(id=s[0],name=s[1],company=s[2],state=s[4]) schema = types.StructType([ types.StructField(‘id',types.IntegerType(), False) ,types.StructField(‘name',types.StringType()) ,types.StructField(‘company',types.StringType()) ,types.StructField(‘state',types.StringType()) ]) peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD, schema) >>> peopleDF.printSchema() Root |-- id: integer (nullable = false) |-- name: string (nullable = true) |-- company: string (nullable = true) |-- state: string (nullable = true) And what are the available types? 4
  • 60. Schemas available types peopleRDD = rdd.map(mapper) peopleDF = spark.createDataFrame(peopleRDD, schema) >>> peopleDF.printSchema() Root |-- id: integer (nullable = false) |-- name: string (nullable = true) |-- company: string (nullable = true) |-- state: string (nullable = true) And what are the available types? # http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html __all__ = ["DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType", "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"] 4
  • 61. # Spark SQL handles this just fine as if they were # legit date objects. spark.sql(“”” SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01” “””).show() Schemas available types And what are the available types? # http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html __all__ = ["DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType", "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"] **Gotcha alert** Spark doesn’t seem to care when you leave dates as strings. 4
  • 62. # Spark SQL handles this just fine as if they were # legit date objects. spark.sql(“”” SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01” “””).show() Schemas available types "BinaryType", "BooleanType", "DateType", "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType", "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"] **Gotcha alert** Spark doesn’t seem to care when you leave dates as strings. Now you know about inferred and explicit schemas, and the available types you can use. 4
  • 63. # Spark SQL handles this just fine as if they were # legit date objects. spark.sql(“”” SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01” “””).show() Schemas available types **Gotcha alert** Spark doesn’t seem to care when you leave dates as strings. Now you know about inferred and explicit schemas, and the available types you can use. 4
  • 65. Loading & Saving Results Types, performance, & considerations Loading and Saving is fairly straight forward. Save your dataframes in your desired format. 5
  • 66. Loading & Saving Results Types, performance, & considerations # picking up where we left off peopleDF = spark.createDataFrame(peopleRDD, schema) peopleDF.write.save("s3://acme-co/people.parquet", format="parquet") # format= defaults to parquet if omitted # formats: json, parquet, jdbc, orc, libsvm, csv, text Loading and Saving is fairly straight forward. Save your dataframes in your desired format. 5
  • 67. Loading & Saving Results Types, performance, & considerations # picking up where we left off peopleDF = spark.createDataFrame(peopleRDD, schema) peopleDF.write.save("s3://acme-co/people.parquet", format="parquet") # format= defaults to parquet if omitted # formats: json, parquet, jdbc, orc, libsvm, csv, text Loading and Saving is fairly straight forward. Save your dataframes in your desired format. When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema. 5
  • 68. # read from stored parquet peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”) # read from stored JSON peopleDF = spark.read.json(“s3://acme-co/people.json”) Loading & Saving Results Types, performance, & considerations # picking up where we left off peopleDF = spark.createDataFrame(peopleRDD, schema) peopleDF.write.save("s3://acme-co/people.parquet", format="parquet") # format= defaults to parquet if omitted # formats: json, parquet, jdbc, orc, libsvm, csv, text When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema. 5
  • 69. # read from stored parquet peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”) # read from stored JSON peopleDF = spark.read.json(“s3://acme-co/people.json”) Loading & Saving Results Types, performance, & considerations format="parquet") # format= defaults to parquet if omitted # formats: json, parquet, jdbc, orc, libsvm, csv, text When you read, some types preserve schema. Parquet keeps the full schema, JSON has inferrable schema, and JDBC pulls in schema. 5
  • 71. ● Limited support for subqueries and various other noticeable SQL functionalities ● Runs roughly half of the 99 TPC-DS benchmark queries ● More SQL support in HiveContext SQL Function Coverage Spark 1.6 6 Spark 1.6 http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
  • 72. SQL Function Coverage Spark 2.0 6 Spark 2.0 In DataBricks’ Words ● SQL2003 support ● Runs all 99 of TPC-DS benchmark queries ● A native SQL parser that supports both ANSI-SQL as well as Hive QL ● Native DDL command implementations ● Subquery support, including ○ Uncorrelated Scalar Subqueries ○ Correlated Scalar Subqueries ○ NOT IN predicate Subqueries (in WHERE/HAVING clauses) ○ IN predicate subqueries (in WHERE/HAVING clauses) ○ (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses) ● View canonicalization support ● In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms. http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
  • 73. SQL Function Coverage Spark 2.0 6 Spark 2.0 In DataBricks’ Words ● SQL2003 support ● Runs all 99 of TPC-DS benchmark queries ● A native SQL parser that supports both ANSI-SQL as well as Hive QL ● Native DDL command implementations ● Subquery support, including ○ Uncorrelated Scalar Subqueries ○ Correlated Scalar Subqueries ○ NOT IN predicate Subqueries (in WHERE/HAVING clauses) ○ IN predicate subqueries (in WHERE/HAVING clauses) ○ (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses) ● View canonicalization support ● In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.
  • 75. Using Spark SQL for ETL Tips and tricks 7 These things are some of the things we learned to start doing after working with Spark a while.
  • 76. Using Spark SQL for ETL Tips and tricks 7 Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern” These things are some of the things we learned to start doing after working with Spark a while.
  • 77. Using Spark SQL for ETL Tips and tricks 7 Tip 2: When tinkering locally, save a small version of the dataset via spark and test against that. These things are some of the things we learned to start doing after working with Spark a while. Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern”
  • 78. Using Spark SQL for ETL Tips and tricks 7 These things are some of the things we learned to start doing after working with Spark a while. Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern” Tip 2: When tinkering locally, save a small version of the dataset via spark and test against that. Tip 3: If using EMR, create a cluster with your desired steps, prove it works, then export a CLI command to reproduce it, and run it in Data Pipeline to start recurring pipelines / jobs.
  • 79. Using Spark SQL for ETL Tips and tricks 7 Tip 3: If using EMR, create a cluster with your desired steps, prove it works, then export a CLI command to reproduce it, and run it in Data Pipeline to start recurring pipelines / jobs. Tip 1: In production, break your applications into smaller apps as steps. I.e. “Pipeline pattern” Tip 2: When tinkering locally, save a small version of the dataset via spark and test against that.
  • 81. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45}
  • 82. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 83. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 84. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 85. Working with JSON Quick and easy 8 JSON data must be read-in as line delimied json objectsJSON {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) # json explode example >>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json") >>> spark.sql("SELECT * FROM json").show() +----+----------------+ | x| y| +----+----------------+ |row1| [1, 2, 3, 4, 5]| |row2|[6, 7, 8, 9, 10]| +----+----------------+ >>> spark.sql("SELECT x, explode(y) FROM json").show() +----+---+ | x|col| +----+---+ |row1| 1| |row1| 2| |row1| 3| |row1| 4| |row1| 5| |row2| 6| |row2| 7| |row2| 8| |row2| 9| |row2| 10| +----+---+
  • 86. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 87. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 88. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 89. Working with JSON Quick and easy 8 If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done. For multi-line JSON files, you’ve got to do much more: # a list of data from files. files = sc.wholeTextFiles("data.json") # each tuple is (path, jsonData) rawJSON = files.map(lambda x: x[1]) # sanitize the data cleanJSON = rawJSON.map( lambda x: re.sub(r"s+", "",x,flags=re.UNICODE) ) # finally, you can then read that in as “JSON” spark.read.json( scrubbedJSON ) # PS -- the same goes for XML.
  • 90. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 91. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 92. Working with JSON Quick and easy 8 JSON data is most easily read-in as line delimied json objects* {“n”:”sarah”,”age”:29} {“n”:”steve”,”age”:45} If you want to flatten your JSON data, use the explode method (works in both DF API and SQL) Access nested-objects with dot syntax SELECT field.subfield FROM json Access arrays with inline array syntax SELECT col[1], col[3] FROM json Schema is inferred upon load. Unlike other lazy operations, this will cause some work to be done.
  • 94. External Databases Reading and Writing 9 To read from an external database, you’ve got to have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jars flag when starting pyspark or spark-submit.
  • 95. External Databases Reading and Writing 9 To read from an external database, you’ve got to have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jars flag when starting pyspark or spark-submit. # loading data into an RDD in Spark 2.0 my-linux-shell$ pyspark --jars /path/to/mysql-jdbc.jar --packages # note: you can also add the path to your jar in the spark.defaults config file to these settings: spark.driver.extraClassPath spark.executor.extraClassPath
  • 96. Once you’ve got your connector jars successfully imported, now you can read an existing database into your spark application or spark shell as a dataframe. External Databases Reading and Writing 9 have your JDBC connectors (jar) handy. In order to pass a jar package into spark, you’d use the --jars flag when starting pyspark or spark-submit. # loading data into an RDD in Spark 2.0 my-linux-shell$ pyspark --jars /path/to/mysql-jdbc.jar --packages # note: you can also add the path to your jar in the spark.defaults config file to these settings: spark.driver.extraClassPath spark.executor.extraClassPath
  • 97. Once you’ve got your connector jars successfully imported, now you can read an existing database into your spark application or spark shell as a dataframe.External Databases Reading and Writing 9 spark.driver.extraClassPath spark.executor.extraClassPath # line broken for readibility sqlURL = “jdbc:mysql://<db-host>:<port> ?user=<user> &password=<pass> &rewriteBatchedStatements=true &continueBatchOnError=true” df = spark.read.jdbc(url=sqlURL, table="<db>.<table>") df.createOrReplaceTempView(“myDB”) spark.sql(“SELECT * FROM myDB”).show()
  • 98. dataframe. External Databases Reading and Writing 9 # line broken for readibility sqlURL = “jdbc:mysql://<db-host>:<port> ?user=<user> &password=<pass> &rewriteBatchedStatements=true # omg use this. &continueBatchOnError=true” # increases performance. df = spark.read.jdbc(url=sqlURL, table="<db>.<table>") df.createOrReplaceTempView(“myDB”) spark.sql(“SELECT * FROM myDB”).show() If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while.
  • 99. External Databases Reading and Writing 9 df = spark.read.jdbc(url=sqlURL, table="<db>.<table>") df.createOrReplaceTempView(“myDB”) spark.sql(“SELECT * FROM myDB”).show() If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while. Also, be warned, save modes in spark can be a bit destructive. “Overwrite” doesn’t just overwrite your data, it overwrites your schemas too. Say goodbye to those precious indices.
  • 100. External Databases Reading and Writing 9 If you’ve done some work and built created or manipulated a dataframe, you can write it to a database by using the spark.read.jdbc method. Be prepared, it can a while. Also, be warned, save modes in spark can be a bit destructive. “Overwrite” doesn’t just overwrite your data, it overwrites your schemas too. Say goodbye to those precious indices.
  • 102. Real World Testing Things to consider If testing locally, do not load data from S3 or other similar types of cloud storage. 10
  • 103. Real World Testing Things to consider If testing locally, do not load data from S3 or other similar types of cloud storage. 10 Construct your applications as much as you can in advance. Cloudy clusters are expensive.
  • 104. Real World Testing Things to consider If testing locally, do not load data from S3 or other similar types of cloud storage. 10 Construct your applications as much as you can in advance. Cloudy clusters are expensive. you can test a lot of your code reliably with a 1-node cluster.
  • 105. Real World Testing Things to consider If testing locally, do not load data from S3 or other similar types of cloud storage. 10 Construct your applications as much as you can in advance. Cloudy clusters are expensive. you can test a lot of your code reliably with a 1-node cluster. Get really comfortable using .parallelize() to create dummy data.
  • 106. Real World Testing Things to consider If testing locally, do not load data from S3 or other similar types of cloud storage. 10 Construct your applications as much as you can in advance. Cloudy clusters are expensive. you can test a lot of your code reliably with a 1-node cluster. Get really comfortable using .parallelize() to create dummy data. If you’re using big data, and many nodes, don’t use .collect() unless you intend to