SlideShare a Scribd company logo
1 of 34
Jean-Georges Perrin • @jgperrin
Apache Spark v3.0.0
What’s new? A very personal view.
Jean-Georges Perrin
Software since 1983 $>0 1995
Big Data since 1984 $>0 2006
AI since 1994 $>0 2010
x12
@jgperrin • http://jgp.ai
Sources:

Wikipedia: https://en.wikipedia.org/wiki/Big_data

IBM: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
• volume 

• variety 

• velocity 

• variability

• value
3
V4
5
Biiiiiiiig Data
Dataisconsidered
bigwhenitneeds
morethanone
computertobe
processed
Apps
Analytics
Distrib.
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson Studio
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks
A few more figures
Who does not like performance figures?
• Databricks:

• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)

• >90% of all Spark API are Spark SQL, regardless of language used

• Community:

• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
What’s new in v3?
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
3400+ Jira tickets
Highlights in a nutshell
• Python

• Python v3 only (so long Python v2)

• Better error handling

• Koalas offer better Pandas support (close to 80%)

• SQL

• Better ANSI SQL compliance

• Core

• Adaptive query execution, including partition pruning

• Java v11 support, Scala v2.12 only

• Hydrogen - hardware & accelerator aware scheduler
GPU support for model training
Optimizing the optimizer
Got to love Catalyst
• v1.x: rule

• v2.x: rule & cost (thanks to IBM)

• v3.x: rule & cost & runtime (thanks to Databricks & Intel)
• Dynamically coalescing shuffle partitions

• Dynamically switching join strategies

• Dynamically optimizing skew joins
Sources:

Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Adaptive query execution (AQE)
Yet another 3-letter acronym
SparkSession spark = SparkSession.builder()
.appName("Join using AQE")
.master("local[*]")
.config("spark.sql.adaptive.enabled", useAqe)
.getOrCreate();
…
Dataset<Row> institPerCountyDf = higherEdDf.join(
countyZipDf,
higherEdDf.col("zip").equalTo(countyZipDf.col("zip")),
"inner");
institPerCountyDf = institPerCountyDf.join(
censusDf,
institPerCountyDf.col("county").equalTo(censusDf.col("countyId")),
"left");
For the entire session
/jgperrin/net.jgp.books.spark.ch12
Chapter 12
Lab #302
In a free companion book
• sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133)

• any, every, some (SPARK-19851)

• bit_and, bit_or (SPARK-27879)

• bit_count (SPARK-29491)

• bit_xor (SPARK-29545)

• bool_and, bool_or (SPARK-30184)

• count_if (SPARK-27425)

• date_part (SPARK-28690)

• extract (SPARK-23903)

• forall (SPARK-27905)

• from_csv (SPARK-25393)

• make_date (SPARK-28432)

• make_interval (SPARK-29393)

• make_timestamp (SPARK-28459)

• map_entries (SPARK-23935)

• map_filter (SPARK-23937)

• map_zip_with (SPARK-23938)

• max_by, min_by (SPARK-27653)

• schema_of_csv (SPARK-25672)

• to_csv (SPARK-25638)

• transform_keys (SPARK-23939)

• transform_values (SPARK-23940)

• typeof (SPARK-29961)

• version (SPARK-29554)

• xxhash64 (SPARK-27099)
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
New static functions
http://jgp.ai/sia
Always a soup
• Finally a reference guide

• http://jgp.ai/sparksql

• EXPLAIN can be FORMATTED

• Proleptic Gregorian calendar,
based on Java 8

• Overflow check

• ANSI compatibility through
configuration flag
SQL
Ingestion
Who needs a push down?
• Already available in databases

• Allow to filter what you ingest, before you ingest it

• Equivalent but easier than ingesting and filtering after
String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
/jgperrin/net.jgp.books.spark.ch08
Chapter 8
Lab #310
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
/jgperrin/net.jgp.books.spark.ch07
Chapter 7
Lab #201
Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)

• Runtime will throw you off:

• Parsing dates

• Data sources (v2 on the way)

• Reference

• https://spark.apache.org/docs/latest/migration-guide.html
org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321
/jgperrin/net.jgp.books.spark.ch03
Mi-figue, mi-raisin
Mixed bags
The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources
Takeaways
• Apache Spark v3 is a major update, 3400+ patches

• Foundation for a rich data ecosystem

• Python increasingly popular, beats Scala

• Cornerstone for the lakehouse concept
Thank you! http://jgp.ai/sia
Join me for DataFriday:
http://jgp.ai/datafriday
Backup
Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0

• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0

• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr

• Soup by Valeria Boltneva from Pexels

More Related Content

What's hot

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 

What's hot (20)

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 

Similar to Apache Spark v3.0.0

Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
bartzon
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Kristofferson A
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
MongoDB
 

Similar to Apache Spark v3.0.0 (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
HotSpotコトハジメ
HotSpotコトハジメHotSpotコトハジメ
HotSpotコトハジメ
 
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 

More from Jean-Georges Perrin

More from Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 

Recently uploaded

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Apache Spark v3.0.0

  • 1. Jean-Georges Perrin • @jgperrin Apache Spark v3.0.0 What’s new? A very personal view.
  • 2. Jean-Georges Perrin Software since 1983 $>0 1995 Big Data since 1984 $>0 2006 AI since 1994 $>0 2010 x12 @jgperrin • http://jgp.ai
  • 3.
  • 4.
  • 7.
  • 8. Apps Analytics Distrib. Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS An analytics operating system?
  • 9. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 10. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 11. DATA Engineer DATA Scientist Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders. Sources: Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
  • 13.
  • 14. Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Python rules in Notebooks
  • 15. A few more figures Who does not like performance figures? • Databricks: • Processes >5T records/day with Structured Streaming (introduced in Spark v2.0, stable in Spark v2.2) • >90% of all Spark API are Spark SQL, regardless of language used • Community: • Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB benchmark 5,000,000,000,001 Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
  • 17. Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
 Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html 3400+ Jira tickets
  • 18. Highlights in a nutshell • Python • Python v3 only (so long Python v2) • Better error handling • Koalas offer better Pandas support (close to 80%) • SQL • Better ANSI SQL compliance • Core • Adaptive query execution, including partition pruning • Java v11 support, Scala v2.12 only • Hydrogen - hardware & accelerator aware scheduler GPU support for model training
  • 19. Optimizing the optimizer Got to love Catalyst • v1.x: rule • v2.x: rule & cost (thanks to IBM) • v3.x: rule & cost & runtime (thanks to Databricks & Intel)
  • 20. • Dynamically coalescing shuffle partitions • Dynamically switching join strategies • Dynamically optimizing skew joins Sources: Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html Adaptive query execution (AQE) Yet another 3-letter acronym
  • 21. SparkSession spark = SparkSession.builder() .appName("Join using AQE") .master("local[*]") .config("spark.sql.adaptive.enabled", useAqe) .getOrCreate(); … Dataset<Row> institPerCountyDf = higherEdDf.join( countyZipDf, higherEdDf.col("zip").equalTo(countyZipDf.col("zip")), "inner"); institPerCountyDf = institPerCountyDf.join( censusDf, institPerCountyDf.col("county").equalTo(censusDf.col("countyId")), "left"); For the entire session /jgperrin/net.jgp.books.spark.ch12 Chapter 12 Lab #302
  • 22. In a free companion book • sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133) • any, every, some (SPARK-19851) • bit_and, bit_or (SPARK-27879) • bit_count (SPARK-29491) • bit_xor (SPARK-29545) • bool_and, bool_or (SPARK-30184) • count_if (SPARK-27425) • date_part (SPARK-28690) • extract (SPARK-23903) • forall (SPARK-27905) • from_csv (SPARK-25393) • make_date (SPARK-28432) • make_interval (SPARK-29393) • make_timestamp (SPARK-28459) • map_entries (SPARK-23935) • map_filter (SPARK-23937) • map_zip_with (SPARK-23938) • max_by, min_by (SPARK-27653) • schema_of_csv (SPARK-25672) • to_csv (SPARK-25638) • transform_keys (SPARK-23939) • transform_values (SPARK-23940) • typeof (SPARK-29961) • version (SPARK-29554) • xxhash64 (SPARK-27099) Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html New static functions http://jgp.ai/sia
  • 23. Always a soup • Finally a reference guide • http://jgp.ai/sparksql • EXPLAIN can be FORMATTED • Proleptic Gregorian calendar, based on Java 8 • Overflow check • ANSI compatibility through configuration flag SQL
  • 24. Ingestion Who needs a push down? • Already available in databases • Allow to filter what you ingest, before you ingest it • Equivalent but easier than ingesting and filtering after
  • 25. String sqlQuery = "select actor.first_name, actor.last_name, film.title, " + "film.description " + "from actor, film_actor, film " + "where actor.actor_id = film_actor.actor_id " + "and film_actor.film_id = film.film_id"; Dataset<Row> df = spark.read().jdbc( "jdbc:mysql://localhost:3306/sakila", "(" + sqlQuery + ") actor_film_alias", props); Will only ingest the result of the MySQL query /jgperrin/net.jgp.books.spark.ch08 Chapter 8 Lab #310
  • 26. +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT| | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? An i...| 12/28/2016|http://amzn.to/2vBxOe1| | 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav| … Dataset<Row> df = spark.read().format("csv") … .load("data/books.csv") .filter("authorId = 1”); +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ Will only ingest books where authorId is 1 /jgperrin/net.jgp.books.spark.ch07 Chapter 7 Lab #201
  • 27. Migration tips Yes, there are needed • Compilation will detect some (new Exception in structured streaming) • Runtime will throw you off: • Parsing dates • Data sources (v2 on the way) • Reference • https://spark.apache.org/docs/latest/migration-guide.html
  • 28. org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .getOrCreate(); spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"); or: SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .config("spark.sql.legacy.timeParserPolicy", "LEGACY") .getOrCreate(); Chapter 3 Lab #320 Lab #321 /jgperrin/net.jgp.books.spark.ch03
  • 30. The lakehouse is a full ecosystem Or is it an operating system? Streams Systems Files Other databases Systems Streams TBA? FilesOther databases Business Data science Data engineering Delta Lake & Delta Engine Outcome Processing & Storage Data sources
  • 31. Takeaways • Apache Spark v3 is a major update, 3400+ patches • Foundation for a rich data ecosystem • Python increasingly popular, beats Scala • Cornerstone for the lakehouse concept
  • 32. Thank you! http://jgp.ai/sia Join me for DataFriday: http://jgp.ai/datafriday
  • 34. Credits • World of Watson by Jean-Georges Perrin CC BY-SA 4.0 • Digital Garage by Jean-Georges Perrin CC BY-SA 4.0 • Figs, grapes and rosehips by Marco Verch Professional Photographer and Speaker, Flickr • Soup by Valeria Boltneva from Pexels