SlideShare a Scribd company logo
1 of 40
Download to read offline
Migrating to Spark 2.0 -
Part 2
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● What’s New in Spark 2.0
● Recap of Part 1
● Sub Queries
● Catalog API
● Hive Catalog
● Refresh Table
● Check Point for Iteration
● References
What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
Recap of Part 1
● Choosing Scala Version
● New Connectors
● Spark Session Entry Point
● Built in Csv Connector
● Moving from DF RDD API to Dataset
● Cross Joins
● Custom ML Transforms
SubQueries in Spark
SubQueries
● A Query inside another query is known as subquery
● Standard feature of SQL
● Ex from MySQL
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● Highly useful as they allow us to combine multiple
different type of aggregation in one query
Types of SubQuery
● In select clause ( Scalar)
SELECT employee_id,
age,
(SELECT MAX(age) FROM employee) max_age
FROM employee
● In from clause (Derived Tables)
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● In where clause ( Predicate)
SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
SubQuery support in Spark 1.x
● SubQuery support in spark 1.x mimic the support
available in hive 0.12
● Hive only supported the subqueries in from clause so
spark only supported the same.
● The subquery in from clause fairly limited on what they
are capable of doing.
● To support the advanced querying in spark sql, they
needed to add other from subqueries in 2.0
SubQuery support in Spark 2.x
● Spark has greatly improved its support on SQL dialect
on 2.0 version
● They have added most of the standard features of
SQL-92 standard
● Full fledged sql parser, no more depend on hive
● Runs all 99 queries TPC-DS natively
● Makes Spark full fledged OLAP query engine
Scalar SubQueries
● Scalar subqueries are the sub queries which returns
single ( scalar) result
● There are two kind of Scalar subqueries
○ UnCorrelated Subqueries
The one which doesn’t depend upon the external query
● Correlated Subqueries
The one depend upon the outer queries
Uncorrelated Scalar SubQueries
● Add maximum sales amount to each row of the sales
data
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
Correlated SubQueries
● Add maximum sales amount to each row of the sales
data in each item category
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
Catalog API
Catalog in SQL
● Catalog is a metadata store which contains the
metadata of the all the information of a SQL system
● Typical contents of catalog are
○ Databases
○ Tables
○ Table Metadata
○ Functions
○ Partitions
○ Buckets
Catalog in Spark 1.x
● By default, spark uses a in memory catalog which keeps
track of spark temp tables
● It is not persistent
● For any persistent operations, spark advocated use of
the hive metastore
● There was no standard API to query metadata
information from the in memory / hive metadata store
● AdHoc functions added to SQLContext over a time to fix
this
Need of Catalog API
● Many interactive applications, like notebook systems,
often need an API to query metastore to show relevant
information to user
● Whenever we integrate with hive, without a catalog API
we have to resort to running HQL queries and parsing
it’s information for getting data
● Cannot manipulate hive metastore directly from spark
API
● Need to evolve to support more meta stores in future
Catalog API in 2.x
● Spark 2.0 has added a full fledged catalog API to spark
session
● It lives a sparkSession.catalog namespace
● This catalog has API’s to create, read and delete
elements from in memory and also from hive metastore
● Having this standard API to interact with catalog makes
developer much easier than before
● If we were using non standard API’s before, it’s time to
migrate
Catalog API Migration
● Migrate from the sqlContext API’s to
sparkSession.catalog API
● Use sparkSession rather than using HiveContext to
have access to the special operations
● Spark 1.x
sparkone.CatalogExample
● Spark 2.x
sparktwo.CatalogExample
Hive Integration
Hive Integration in Spark 1.x
● Spark SQL had native support for the hive from
beginning
● In beginning spark SQL used hive query parse for
parsing and meta store for persistent storage
● To integrate with hive, one has to create HiveContext
which is separate than SQLContext
● Some of API’s were only available in SQLContext and
some hive specific on HiveContext
● No support for manipulating hive metastore
Hive Integration in 2.x
● No more separate HiveContext
● SparkSession has enableHiveSupport API to enable
the hive support
● This makes both spark SQL and Hive API’s consistent
● Spark 2.0 catalog API also supports the hive metastore
● Example
sparktwo.CatalogHiveExample
Refresh Table API
Need of Refresh Table API
● In spark, we cache a table ( dataset) for performance
reasons
● Spark caches the metadata in it’s metadatastore and
actual data in block manager
● If underneath file/table changes, there was no direct API
in spark to force a table refresh
● If you just uncache/recache, it will only reflects the
change in data not metadata
● So we need a standard way to refresh the table
Refresh Table and By Path
● Spark 2.0 provides two API’s for refreshing datasets in
spark
● refreshTable API which was imported from
HiveContext are used for or registered temp tables or
hive tables
● refreshByPath API is used for refreshing datasets
without have to register them as tables beforehand
● Spark 1.x
sparkone.RefreshExample
● Spark 2.x - sparktwo.RefreshExample
CheckPoint API
Iterative Programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
● Even though caching and RDD mechanisms worked
great with iterative programming, moving to dataframe
has created new challenges
Spark iterative processing
Iteration in Dataframe API
● As every step of iteration creates new DF, the logical
plan keeps on going big
● As spark needs to keep complete query plan for
recovery, the overhead to analyse the plan increases as
number of iterations increases
● This overhead is compute bound and done at master
● As this overhead increases, it makes iteration very very
slow
● Ex : sparkone.CheckPoint
Solution to Query Plan Issue
● To solve ever growing query plan (lineage) we need to
truncate it to make it faster
● Whenever we truncate the query plan, we will lose the
ability to recover
● To avoid that, we need to store the intermediate data
before we truncated the query plan
● Saving intermediate data with truncation of the query
plan will result in faster performance
Dataset Persistence API
● In spark 2.1, there is a new persist API on dataset
● It’s analogues to RDD persist API
● In RDD, persist used to persist the RDD and then
truncate it’s lineage
● Similarly in case of dataset, persist will persist the
dataset and then truncates it’s query plan
● Make sure that persist times are much lower than the
overhead you are facing
● Ex : sparktwo.CheckPoint
Migrating Best Practices
Best Practice Migration
● As fundamental abstractions has changed in spark 2.0,
we need to rethink our best practices
● Many best practices were centered around RDD
abstraction which is no more central abstraction
● Also there are many optimisation in the catalyst where
many optimisation is done for us by platform
● So let’s look at some best practices of 1.x and see how
we can change
Choice of Serializer
● Use kryo serializer over java and register classes
with kryo
● This best practice was devised for efficient caching and
transfer of RDD data
● But in spark 2.0, Dataset uses custom code generated
serialization framework for most of the code and data
● So unless there is heavy use of RDD in your project you
don’t need to worry about serializer is 2.0
Cache Format
● RDD uses MEMORY_ONLY as default and it’s most
efficient caching
● DataFrame/Dataset uses MEMORY_AND_DISK rather
than MEMORY_ONLY
● Computation of Dataset and converting to custom
serialization format is often costly
● So use MEMORY_AND_DISK as default format over
MEMORY_ONLY
Use of BroadCast variables
● Use broadcast variable for optimising the lookups and
joins
● BroadCast variables played an important role of making
joins efficient in the RDD world
● These variables don’t have much scope in Dataset API
land
● By configuring broadcast value, spark sql will do
broadcasting automatically
● Don’t use them unless there is a reason
Choice of Clusters
● Use YARN/Mesos for production. Standalone is
mostly for simple apps
● Users were encouraged to use a dedicated cluster
manager over standalone given by spark
● With databricks cloud putting weight behind standalone
cluster it has been ready for production grade
● Many companies run their spark applications in
standalone cluster today
● Choose standalone if you run only spark applications
Use of HiveContext
● Use HiveContext over SQLContext for using Spark
SQL
● In spark 1.x, Spark SQL was simplistic and heavily
depended on hive to give query parsing
● In spark 2.0, spark sql is enriched which is now more
powerful than hive itself
● Most of the udf of hive are now rewritten in spark sql
and code generated
● Unless you want to use hive metastore, use
sparksession without hive
References
● http://blog.madhukaraphatak.com/categories/spark-two-
migration-series/
● http://www.spark.tc/migrating-applications-to-apache-sp
ark-2-0-2/
● http://blog.madhukaraphatak.com/categories/spark-two/
● https://www.youtube.com/watch?v=jyXEUXCYGwo
● https://databricks.com/blog/2016/06/17/sql-subqueries-i
n-apache-spark-2-0.html

More Related Content

What's hot

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
 

What's hot (20)

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 

Similar to Migrating to Spark 2.0 - Part 2

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsBowen Li
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Flink Forward
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupHolden Karau
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientistsStitch Fix Algorithms
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark ScalaKnoldus Inc.
 

Similar to Migrating to Spark 2.0 - Part 2 (20)

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systems
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 

More from datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 

More from datamantra (14)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 

Recently uploaded

Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 

Recently uploaded (16)

2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 

Migrating to Spark 2.0 - Part 2

  • 1. Migrating to Spark 2.0 - Part 2 Moving to next generation spark https://github.com/phatak-dev/spark-two-migration
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● What’s New in Spark 2.0 ● Recap of Part 1 ● Sub Queries ● Catalog API ● Hive Catalog ● Refresh Table ● Check Point for Iteration ● References
  • 4. What’s new in 2.0? ● Dataset is the new user facing single abstraction ● RDD abstraction is used only for runtime ● Higher performance with whole stage code generation ● Significant changes to streaming abstraction with spark structured streaming ● Incorporates learning from 4 years of production use ● Spark ML is replacing MLLib as de facto ml library ● Breaks API compatibility for better API’s and features
  • 5. Need for Migration ● Lot of real world code is written in 1.x series of spark ● As fundamental abstractions are changed, all this code need to migrate make use performance and API benefits ● More and more ecosystem projects need the version of spark 2.0 ● 1.x series will be out of maintenance mode very soon. So no more bug fixes
  • 6. Recap of Part 1 ● Choosing Scala Version ● New Connectors ● Spark Session Entry Point ● Built in Csv Connector ● Moving from DF RDD API to Dataset ● Cross Joins ● Custom ML Transforms
  • 8. SubQueries ● A Query inside another query is known as subquery ● Standard feature of SQL ● Ex from MySQL SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● Highly useful as they allow us to combine multiple different type of aggregation in one query
  • 9. Types of SubQuery ● In select clause ( Scalar) SELECT employee_id, age, (SELECT MAX(age) FROM employee) max_age FROM employee ● In from clause (Derived Tables) SELECT AVG(sum_column1) FROM (SELECT SUM(column1) AS sum_column1 FROM t1 GROUP BY column1) AS t1; ● In where clause ( Predicate) SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
  • 10. SubQuery support in Spark 1.x ● SubQuery support in spark 1.x mimic the support available in hive 0.12 ● Hive only supported the subqueries in from clause so spark only supported the same. ● The subquery in from clause fairly limited on what they are capable of doing. ● To support the advanced querying in spark sql, they needed to add other from subqueries in 2.0
  • 11. SubQuery support in Spark 2.x ● Spark has greatly improved its support on SQL dialect on 2.0 version ● They have added most of the standard features of SQL-92 standard ● Full fledged sql parser, no more depend on hive ● Runs all 99 queries TPC-DS natively ● Makes Spark full fledged OLAP query engine
  • 12. Scalar SubQueries ● Scalar subqueries are the sub queries which returns single ( scalar) result ● There are two kind of Scalar subqueries ○ UnCorrelated Subqueries The one which doesn’t depend upon the external query ● Correlated Subqueries The one depend upon the outer queries
  • 13. Uncorrelated Scalar SubQueries ● Add maximum sales amount to each row of the sales data ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  • 14. Correlated SubQueries ● Add maximum sales amount to each row of the sales data in each item category ● This normally helps us to understand how far away a given transaction compared to maximum sales we have done ● In Spark 1.x sparkone.SubQueries ● In Spark 2.x sparktwo.SubQueries
  • 16. Catalog in SQL ● Catalog is a metadata store which contains the metadata of the all the information of a SQL system ● Typical contents of catalog are ○ Databases ○ Tables ○ Table Metadata ○ Functions ○ Partitions ○ Buckets
  • 17. Catalog in Spark 1.x ● By default, spark uses a in memory catalog which keeps track of spark temp tables ● It is not persistent ● For any persistent operations, spark advocated use of the hive metastore ● There was no standard API to query metadata information from the in memory / hive metadata store ● AdHoc functions added to SQLContext over a time to fix this
  • 18. Need of Catalog API ● Many interactive applications, like notebook systems, often need an API to query metastore to show relevant information to user ● Whenever we integrate with hive, without a catalog API we have to resort to running HQL queries and parsing it’s information for getting data ● Cannot manipulate hive metastore directly from spark API ● Need to evolve to support more meta stores in future
  • 19. Catalog API in 2.x ● Spark 2.0 has added a full fledged catalog API to spark session ● It lives a sparkSession.catalog namespace ● This catalog has API’s to create, read and delete elements from in memory and also from hive metastore ● Having this standard API to interact with catalog makes developer much easier than before ● If we were using non standard API’s before, it’s time to migrate
  • 20. Catalog API Migration ● Migrate from the sqlContext API’s to sparkSession.catalog API ● Use sparkSession rather than using HiveContext to have access to the special operations ● Spark 1.x sparkone.CatalogExample ● Spark 2.x sparktwo.CatalogExample
  • 22. Hive Integration in Spark 1.x ● Spark SQL had native support for the hive from beginning ● In beginning spark SQL used hive query parse for parsing and meta store for persistent storage ● To integrate with hive, one has to create HiveContext which is separate than SQLContext ● Some of API’s were only available in SQLContext and some hive specific on HiveContext ● No support for manipulating hive metastore
  • 23. Hive Integration in 2.x ● No more separate HiveContext ● SparkSession has enableHiveSupport API to enable the hive support ● This makes both spark SQL and Hive API’s consistent ● Spark 2.0 catalog API also supports the hive metastore ● Example sparktwo.CatalogHiveExample
  • 25. Need of Refresh Table API ● In spark, we cache a table ( dataset) for performance reasons ● Spark caches the metadata in it’s metadatastore and actual data in block manager ● If underneath file/table changes, there was no direct API in spark to force a table refresh ● If you just uncache/recache, it will only reflects the change in data not metadata ● So we need a standard way to refresh the table
  • 26. Refresh Table and By Path ● Spark 2.0 provides two API’s for refreshing datasets in spark ● refreshTable API which was imported from HiveContext are used for or registered temp tables or hive tables ● refreshByPath API is used for refreshing datasets without have to register them as tables beforehand ● Spark 1.x sparkone.RefreshExample ● Spark 2.x - sparktwo.RefreshExample
  • 28. Iterative Programming in Spark ● Spark is one of the first big data framework to have great support iterative programming natively ● Iterative programs go over the again and again to compute some results ● Spark ML is one of iterative frameworks in spark ● Even though caching and RDD mechanisms worked great with iterative programming, moving to dataframe has created new challenges
  • 30. Iteration in Dataframe API ● As every step of iteration creates new DF, the logical plan keeps on going big ● As spark needs to keep complete query plan for recovery, the overhead to analyse the plan increases as number of iterations increases ● This overhead is compute bound and done at master ● As this overhead increases, it makes iteration very very slow ● Ex : sparkone.CheckPoint
  • 31. Solution to Query Plan Issue ● To solve ever growing query plan (lineage) we need to truncate it to make it faster ● Whenever we truncate the query plan, we will lose the ability to recover ● To avoid that, we need to store the intermediate data before we truncated the query plan ● Saving intermediate data with truncation of the query plan will result in faster performance
  • 32. Dataset Persistence API ● In spark 2.1, there is a new persist API on dataset ● It’s analogues to RDD persist API ● In RDD, persist used to persist the RDD and then truncate it’s lineage ● Similarly in case of dataset, persist will persist the dataset and then truncates it’s query plan ● Make sure that persist times are much lower than the overhead you are facing ● Ex : sparktwo.CheckPoint
  • 34. Best Practice Migration ● As fundamental abstractions has changed in spark 2.0, we need to rethink our best practices ● Many best practices were centered around RDD abstraction which is no more central abstraction ● Also there are many optimisation in the catalyst where many optimisation is done for us by platform ● So let’s look at some best practices of 1.x and see how we can change
  • 35. Choice of Serializer ● Use kryo serializer over java and register classes with kryo ● This best practice was devised for efficient caching and transfer of RDD data ● But in spark 2.0, Dataset uses custom code generated serialization framework for most of the code and data ● So unless there is heavy use of RDD in your project you don’t need to worry about serializer is 2.0
  • 36. Cache Format ● RDD uses MEMORY_ONLY as default and it’s most efficient caching ● DataFrame/Dataset uses MEMORY_AND_DISK rather than MEMORY_ONLY ● Computation of Dataset and converting to custom serialization format is often costly ● So use MEMORY_AND_DISK as default format over MEMORY_ONLY
  • 37. Use of BroadCast variables ● Use broadcast variable for optimising the lookups and joins ● BroadCast variables played an important role of making joins efficient in the RDD world ● These variables don’t have much scope in Dataset API land ● By configuring broadcast value, spark sql will do broadcasting automatically ● Don’t use them unless there is a reason
  • 38. Choice of Clusters ● Use YARN/Mesos for production. Standalone is mostly for simple apps ● Users were encouraged to use a dedicated cluster manager over standalone given by spark ● With databricks cloud putting weight behind standalone cluster it has been ready for production grade ● Many companies run their spark applications in standalone cluster today ● Choose standalone if you run only spark applications
  • 39. Use of HiveContext ● Use HiveContext over SQLContext for using Spark SQL ● In spark 1.x, Spark SQL was simplistic and heavily depended on hive to give query parsing ● In spark 2.0, spark sql is enriched which is now more powerful than hive itself ● Most of the udf of hive are now rewritten in spark sql and code generated ● Unless you want to use hive metastore, use sparksession without hive
  • 40. References ● http://blog.madhukaraphatak.com/categories/spark-two- migration-series/ ● http://www.spark.tc/migrating-applications-to-apache-sp ark-2-0-2/ ● http://blog.madhukaraphatak.com/categories/spark-two/ ● https://www.youtube.com/watch?v=jyXEUXCYGwo ● https://databricks.com/blog/2016/06/17/sql-subqueries-i n-apache-spark-2-0.html