SlideShare a Scribd company logo
1 of 57
Download to read offline
Pavel Hardak (Product Manager, Workday)
Jianneng Li (Software Engineer, Workday)
Lessons Learned Using Apache Spark
for Self-Service Data Prep (and More)
in SaaS World
#UnifiedAnalytics #SparkAISummit
This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions
could differ materially from results implied by the forward-looking statements. Forward-looking statements
include any statements regarding strategies or plans for future operations; any statements concerning new
features, enhancements or upgrades to our existing applications or plans for future applications; and any
statements of belief. Further information on risks that could affect Workday’s results is included in our filings
with the Securities and Exchange Commission which are available on the Workday investor relations
webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
#UnifiedAnalytics #SparkAISummit 2
Agenda
● Workday - Finance and HCM in the cloud
● Workday Platform - “Power of One”
● Prism Analytics - Powered by Apache Spark
● Production Stories & Lessons Learned
● Questions
3#UnifiedAnalytics #SparkAISummit 3
#UnifiedAnalytics #SparkAISummit 4
● “Pure” SaaS apps suite
○ Finance and HCM
● Customers: 2,500+
○ 200+ of Fortune 500
● Revenue: $2.82B
○ Growth: 32% YoY
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting
Workday Confidential
#UnifiedAnalytics #SparkAISummit 5
6
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
#UnifiedAnalytics #SparkAISummit
#UnifiedAnalytics #SparkAISummit 7
Durable
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Object Data Model
MetadataExtensible
#UnifiedAnalytics #SparkAISummit 8
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Security
Encryption Privacy and
Compliance
Trust
#UnifiedAnalytics #SparkAISummit 9
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Reporting and Analytics
Dashboards CollaborationDistribution
#UnifiedAnalytics #SparkAISummit 10
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting
Workday Planning
Workday
Financial Management
Workday
Human Capital
Management
Workday Prism
Analytics and
Reporting
Prism Analytics
Integrate 3rd
Party Data
Data Management
Data Preparation
Data Discovery
Report Publishing
11#UnifiedAnalytics #SparkAISummit
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism
Analytics and
Reporting
Workday Prism Analytics
The full spectrum of Finance and HCM insights, all within Workday.
Workday Data + Non-Workday Data
#UnifiedAnalytics #SparkAISummit 12
Finance, HCM
Operational
Industry systems
Legacy systems More…
CRM Service ticketing
Surveys Point of Sale
Stock grants
Map
Ingest
Preparation AnalysisAcquisition
Reporting
Worksheets
Data Discovery
Cleanse and Transform
Blend Datasets
Apply Security Permissions
Publish Data Source
Prism Analytics Workflow
13#UnifiedAnalytics #SparkAISummit
Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Spark in Prism Analytics
#UnifiedAnalytics #SparkAISummit 14
Interactive Data Prep in Prism
Transform Stages
Number of samples
Examples and statistics
15#UnifiedAnalytics #SparkAISummit
Interactive Data Prep in Prism
16#UnifiedAnalytics #SparkAISummit
Interactive Data Prep in Prism
Powered by Spark
Edit Transform
17#UnifiedAnalytics #SparkAISummit
Data Prep Publishing in Prism
Also powered by Spark
18#UnifiedAnalytics #SparkAISummit
19#UnifiedAnalytics #SparkAISummit
Interactive Publishing
Data size 100 - 100K rows Billions of rows
Sampling Yes No
Caching Yes No
Latency Seconds Minutes to hours
Result Returned in memory Written to disk
SLA Best effort Consistent performance
Data Prep: Interactive vs. Publishing
20#UnifiedAnalytics #SparkAISummit
Data Prep: Interactive vs. Publishing
Same plan!
Prism Logical Model
21#UnifiedAnalytics #SparkAISummit
Prism Logical Model
• Superset of SQL operators
• Compiles to Spark plans through Spark SQL
• Implements custom Catalyst rules and strategies
22#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
Example: Interactive Data Prep Operators
23#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
IngestSampler
LogicalIngestSampler
IngestSamplerExec
IngestSamplerRDD
Prism Logical Plan
RDD
Spark Physical Plan
Spark Logical Plan
Prism Data Types
24#UnifiedAnalytics #SparkAISummit
Implementing Additional Data Types
• Prism has a richer type system than Catalyst
• Uses StructType and StructField to implement
additional data types
25#UnifiedAnalytics #SparkAISummit
Example: Prism Currency Type
object CurrencyType extends StructType(
Array(
StructField(“amount”,DecimalType(26, 6)),
StructField(“code”, StringType)))
>> { “amount”: 1000.000000, “code”: “USD” }
>> { “amount”: -999.000000, “code”: “YEN” }
26#UnifiedAnalytics #SparkAISummit
Lessons Learned
27#UnifiedAnalytics #SparkAISummit
Lessons #1: Nested SQL
28#UnifiedAnalytics #SparkAISummit
Lesson #1: Nested SQL
29#UnifiedAnalytics #SparkAISummit
• SQL requires computed columns to be nested
– SELECT 1 as c1, c1 + 1 as c2; /* ✗ */
– SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */
• First version: one nesting per computed column
– Does not scale to 100s of columns
– Takes a long time to compile and optimize
Lesson #1: Example Dependency Graph
[first.name], [last.name], [income],
concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal.tax],
[income] *0.10 as [state.tax],
concat([full.name],”@workday.com”) as [email]
first.name last.name income
full.name federal.tax
email
state.tax
2nd level
1st level
30#UnifiedAnalytics #SparkAISummit
select [income] * 0.10 as [state_tax], *
from (select [income] * 0.28 as [federal_tax], *
from (select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name], *
from (select [first.name], [last.name], [income] from base_table))))
Lesson #1: SQL Before Optimization
4 levels of nested SQL
31#UnifiedAnalytics #SparkAISummit
Lesson #1: SQL After Optimization
2 levels of nested SQL
32
select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal_tax],
[income] * 0.10 as [state_tax], *
from (select [first.name], [last.name], [income] from base_table)))
#UnifiedAnalytics #SparkAISummit
Lesson #2: Plan Blowup
33#UnifiedAnalytics #SparkAISummit
Lesson #2: Plan Blowup
34#UnifiedAnalytics #SparkAISummit
• Generated plans can have duplicate operators
• E.g. self joins and self unions
• Need to de-duplicate to improve performance
Lesson #2: Deduping Prism Logical Plan
35#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Lesson #2: Deduping Prism Logical Plan
36#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Union(
Cache(ID=1,
Sample(k=100,
Parse(“Dataset A”))),
Cache(ID=2,
Join(
Cache(ID=1, ∅),
Parse(“Dataset B”))),
Cache(ID=2, ∅)
)
Lesson #2: Deduping Prism Logical Plan
37#UnifiedAnalytics #SparkAISummit
Union(
Sample(k=100,
Parse(“Dataset A”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”)),
Join(
Sample(k=100,
Parse(“Dataset A”)),
Parse(“Dataset B”))
)
Lesson #2: Deduping Spark Tree String
38#UnifiedAnalytics #SparkAISummit
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1
Lesson #2: Deduping Spark Tree String
39#UnifiedAnalytics #SparkAISummit
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Scan #1
+- (7) Project #1
+- (8) Join #1
:- (9) Scan #1
+- (10) Scan #1
(1) Project #2
+- (2) Join #2
:- (3) Project #1
: +- (4) Join #1
: :- (5) Scan #1
: +- (6) Lines 5-5
+- (7) Lines 3-6
Lesson #3: Broadcast Join Tuning
40#UnifiedAnalytics #SparkAISummit
Node 1
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
Node 2
E 2
F 4
G 5
H 8
EE 3
FF 8
Node 1 Node 2
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
EE 3
FF 8
E 2
F 4
G 5
H 8
AA 2
BB 5
CC 9
EE 3
FF 8
Broadcast
Join
#UnifiedAnalytics #SparkAISummit 41
Lesson #3: Broadcast Join Review
• Spark’s broadcasting mechanism is inefficient
– Broadcasted data goes through the driver
– No global limit on broadcasted data
– Complex jobs can make driver run out of memory
Lesson #3: Spark Broadcast
42#UnifiedAnalytics #SparkAISummit
Driver
Executor 1
Executor 2
(1) Driver collects broadcasted data from executors
(2) Driver sends broadcasted data to executors
• Initially disabled broadcast joins for stability
• Expectation: small number of joins, all large joins
Lesson #3: Disabling Broadcast Joins
43#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = -1
Lesson #3: Re-Enabling Broadcast Joins
44
• Reality: large number of joins, many are small
• Re-enabled broadcast join with a low threshold
• 2-10x runtime improvement
#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = 1000000
Lesson #4: Case-Insensitive Grouping
45#UnifiedAnalytics #SparkAISummit
Prism
Prism
Prism
HDFS / S3
Query Engine
Spark
Driver
Spark
Executor
Interactive
Data Prep
Spark
Driver
Spark
Executor
Spark
Driver
Data Prep
Publishing
YARN
Spark
Executor
Spark
Executor
Lesson #4: Spark in Query Engine
#UnifiedAnalytics #SparkAISummit 46
47#UnifiedAnalytics #SparkAISummit
Lesson #4: Spark in Query Engine
Sum of Billing Amount per Billing Location
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
BillingLocation TotalBillingAmount
CALIFORNIA 100000
california 90000
TeXas 60000
Illinois 25000
texas 15000
Texas 5000
SELECT BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY BillingLocation
ORDER BY TotalBillingAmount
48#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
Sum of Billing Amount per Billing Location
SELECT MIN(BillingLocation) AS BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY UPPER(BillingLocation)
ORDER BY TotalBillingAmount
BillingLocation TotalBillingAmount
CALIFORNIA 190000
TeXas 80000
Illinois 25000
In Workday, grouping on strings columns is case insensitive
49
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
GROUP BY stringField
GROUP BY UPPER(stringField)
+
MIN(stringField)
~7x regression
50#UnifiedAnalytics #SparkAISummit
Lesson #4: Case-Insensitive Grouping is Costly
Aggregation on strings uses Spark uses SortAggregate operator
➔ Modified Spark’s HashAggregate to support strings
Regression reduced to
~3x
SortAggregate HashAggregate
51#UnifiedAnalytics #SparkAISummit
Lesson #4: Aggregation on String Columns
In Spark’s HashAggregate operator, functions used in
GROUPING operator were getting evaluated twice
Regression reduced to
~2x
UPPER evaluated
twice
UPPER evaluated
only once
52#UnifiedAnalytics #SparkAISummit
Lesson #4: Reducing Function Evaluations
Precompute uppercase for all characters
➔ replace toUpperCase() on each char by a simple array lookup
Regression reduced to ~1.5x
(and want to decrease more...)
UPPER Optimized UPPER
53#UnifiedAnalytics #SparkAISummit
Lesson #4: Optimizing Spark’s UPPER Function
And one more thing...
54#UnifiedAnalytics #SparkAISummit
HDFS / S3
Prism 1
Tenant 1
Prism 2
Tenant 2
Prism 3
Tenant 3
Prism 4
Tenant 4
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Current – Single-Tenanted Spark Clusters
55#UnifiedAnalytics #SparkAISummit
HDFS / S3
Spark Cluster Spark Cluster
Tenant 2 Tenant 4Tenant 3 Tenant 6Tenant 5 Tenant 7Tenant 1 Tenant 8
Prism 1 Prism 2 Prism 3
Spark Cluster
Future – Multi-Tenanted Spark Clusters
56#UnifiedAnalytics #SparkAISummit
Questions?
57
workday.com/careers
#UnifiedAnalytics #SparkAISummit

More Related Content

What's hot

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis SystemsJeff Hung
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Spark Summit
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...DataWorks Summit/Hadoop Summit
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...Databricks
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real WorldDataWorks Summit
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 

What's hot (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Apache Metron in the Real World
Apache Metron in the Real WorldApache Metron in the Real World
Apache Metron in the Real World
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 

Similar to "Lessons learned using Apache Spark for self-service data prep in SaaS world"

Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data ValidationDatabricks
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial modelsPetr Wolf
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphDatabricks
 
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp
 
Data Models Breakout Session
Data Models Breakout SessionData Models Breakout Session
Data Models Breakout SessionSplunk
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
design_Mockup1.pdfUntitled July 25 2022
design_Mockup1.pdfUntitled July 25 2022  design_Mockup1.pdfUntitled July 25 2022
design_Mockup1.pdfUntitled July 25 2022 LinaCovington707
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreDeveloping Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreTom Gersic
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Cloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST HighlightCloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST HighlightCAST
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 

Similar to "Lessons learned using Apache Spark for self-service data prep in SaaS world" (20)

Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial models
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
 
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
 
Data Models Breakout Session
Data Models Breakout SessionData Models Breakout Session
Data Models Breakout Session
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
design_Mockup1.pdfUntitled July 25 2022
design_Mockup1.pdfUntitled July 25 2022  design_Mockup1.pdfUntitled July 25 2022
design_Mockup1.pdfUntitled July 25 2022
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)Vadlamudi saketh30 (ml)
Vadlamudi saketh30 (ml)
 
Tdxgg18 summary presentation
Tdxgg18 summary presentationTdxgg18 summary presentation
Tdxgg18 summary presentation
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreDeveloping Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Cloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST HighlightCloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST Highlight
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 

Recently uploaded

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 

Recently uploaded (20)

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 

"Lessons learned using Apache Spark for self-service data prep in SaaS world"

  • 1. Pavel Hardak (Product Manager, Workday) Jianneng Li (Software Engineer, Workday) Lessons Learned Using Apache Spark for Self-Service Data Prep (and More) in SaaS World #UnifiedAnalytics #SparkAISummit
  • 2. This presentation may contain forward-looking statements for which there are risks, uncertainties, and assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions could differ materially from results implied by the forward-looking statements. Forward-looking statements include any statements regarding strategies or plans for future operations; any statements concerning new features, enhancements or upgrades to our existing applications or plans for future applications; and any statements of belief. Further information on risks that could affect Workday’s results is included in our filings with the Securities and Exchange Commission which are available on the Workday investor relations webpage: www.workday.com/company/investor_relations.php Workday assumes no obligation for and does not intend to update any forward-looking statements. Any unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap, blog, our website, press release or public statement that are not currently available are subject to change at Workday’s discretion and may not be delivered as planned or at all. Customers who purchase Workday, Inc. services should make their purchase decisions upon services, features, and functions that are currently available. Safe Harbor Statement #UnifiedAnalytics #SparkAISummit 2
  • 3. Agenda ● Workday - Finance and HCM in the cloud ● Workday Platform - “Power of One” ● Prism Analytics - Powered by Apache Spark ● Production Stories & Lessons Learned ● Questions 3#UnifiedAnalytics #SparkAISummit 3
  • 4. #UnifiedAnalytics #SparkAISummit 4 ● “Pure” SaaS apps suite ○ Finance and HCM ● Customers: 2,500+ ○ 200+ of Fortune 500 ● Revenue: $2.82B ○ Growth: 32% YoY Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 6. 6 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform #UnifiedAnalytics #SparkAISummit
  • 7. #UnifiedAnalytics #SparkAISummit 7 Durable Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Object Data Model MetadataExtensible
  • 8. #UnifiedAnalytics #SparkAISummit 8 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Security Encryption Privacy and Compliance Trust
  • 9. #UnifiedAnalytics #SparkAISummit 9 Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Reporting and Analytics Dashboards CollaborationDistribution
  • 10. #UnifiedAnalytics #SparkAISummit 10 Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 11. Workday Planning Workday Financial Management Workday Human Capital Management Workday Prism Analytics and Reporting Prism Analytics Integrate 3rd Party Data Data Management Data Preparation Data Discovery Report Publishing 11#UnifiedAnalytics #SparkAISummit Plan Execute Analyze Planning Financial Management Human Capital Management Prism Analytics and Reporting
  • 12. Workday Prism Analytics The full spectrum of Finance and HCM insights, all within Workday. Workday Data + Non-Workday Data #UnifiedAnalytics #SparkAISummit 12
  • 13. Finance, HCM Operational Industry systems Legacy systems More… CRM Service ticketing Surveys Point of Sale Stock grants Map Ingest Preparation AnalysisAcquisition Reporting Worksheets Data Discovery Cleanse and Transform Blend Datasets Apply Security Permissions Publish Data Source Prism Analytics Workflow 13#UnifiedAnalytics #SparkAISummit
  • 14. Prism Prism Prism HDFS / S3 Query Engine Spark Driver Spark Executor Interactive Data Prep Spark Driver Spark Executor Spark Driver Data Prep Publishing YARN Spark Executor Spark Executor Spark in Prism Analytics #UnifiedAnalytics #SparkAISummit 14
  • 15. Interactive Data Prep in Prism Transform Stages Number of samples Examples and statistics 15#UnifiedAnalytics #SparkAISummit
  • 16. Interactive Data Prep in Prism 16#UnifiedAnalytics #SparkAISummit
  • 17. Interactive Data Prep in Prism Powered by Spark Edit Transform 17#UnifiedAnalytics #SparkAISummit
  • 18. Data Prep Publishing in Prism Also powered by Spark 18#UnifiedAnalytics #SparkAISummit
  • 19. 19#UnifiedAnalytics #SparkAISummit Interactive Publishing Data size 100 - 100K rows Billions of rows Sampling Yes No Caching Yes No Latency Seconds Minutes to hours Result Returned in memory Written to disk SLA Best effort Consistent performance Data Prep: Interactive vs. Publishing
  • 20. 20#UnifiedAnalytics #SparkAISummit Data Prep: Interactive vs. Publishing Same plan!
  • 22. Prism Logical Model • Superset of SQL operators • Compiles to Spark plans through Spark SQL • Implements custom Catalyst rules and strategies 22#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit
  • 23. Example: Interactive Data Prep Operators 23#UnifiedAnalytics #SparkAISummit#UnifiedAnalytics #SparkAISummit IngestSampler LogicalIngestSampler IngestSamplerExec IngestSamplerRDD Prism Logical Plan RDD Spark Physical Plan Spark Logical Plan
  • 25. Implementing Additional Data Types • Prism has a richer type system than Catalyst • Uses StructType and StructField to implement additional data types 25#UnifiedAnalytics #SparkAISummit
  • 26. Example: Prism Currency Type object CurrencyType extends StructType( Array( StructField(“amount”,DecimalType(26, 6)), StructField(“code”, StringType))) >> { “amount”: 1000.000000, “code”: “USD” } >> { “amount”: -999.000000, “code”: “YEN” } 26#UnifiedAnalytics #SparkAISummit
  • 28. Lessons #1: Nested SQL 28#UnifiedAnalytics #SparkAISummit
  • 29. Lesson #1: Nested SQL 29#UnifiedAnalytics #SparkAISummit • SQL requires computed columns to be nested – SELECT 1 as c1, c1 + 1 as c2; /* ✗ */ – SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */ • First version: one nesting per computed column – Does not scale to 100s of columns – Takes a long time to compile and optimize
  • 30. Lesson #1: Example Dependency Graph [first.name], [last.name], [income], concat([first.name],”.”, [last.name]) as [full.name], [income] * 0.28 as [federal.tax], [income] *0.10 as [state.tax], concat([full.name],”@workday.com”) as [email] first.name last.name income full.name federal.tax email state.tax 2nd level 1st level 30#UnifiedAnalytics #SparkAISummit
  • 31. select [income] * 0.10 as [state_tax], * from (select [income] * 0.28 as [federal_tax], * from (select concat([full.name],”@workday.com”) as [email], * from (select concat([first.name],”.”, [last.name]) as [full.name], * from (select [first.name], [last.name], [income] from base_table)))) Lesson #1: SQL Before Optimization 4 levels of nested SQL 31#UnifiedAnalytics #SparkAISummit
  • 32. Lesson #1: SQL After Optimization 2 levels of nested SQL 32 select concat([full.name],”@workday.com”) as [email], * from (select concat([first.name],”.”, [last.name]) as [full.name], [income] * 0.28 as [federal_tax], [income] * 0.10 as [state_tax], * from (select [first.name], [last.name], [income] from base_table))) #UnifiedAnalytics #SparkAISummit
  • 33. Lesson #2: Plan Blowup 33#UnifiedAnalytics #SparkAISummit
  • 34. Lesson #2: Plan Blowup 34#UnifiedAnalytics #SparkAISummit • Generated plans can have duplicate operators • E.g. self joins and self unions • Need to de-duplicate to improve performance
  • 35. Lesson #2: Deduping Prism Logical Plan 35#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 36. Union( Cache(ID=1, Sample(k=100, Parse(“Dataset A”))), Join( Cache(ID=1, ∅), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) ) Lesson #2: Deduping Prism Logical Plan 36#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 37. Union( Cache(ID=1, Sample(k=100, Parse(“Dataset A”))), Cache(ID=2, Join( Cache(ID=1, ∅), Parse(“Dataset B”))), Cache(ID=2, ∅) ) Lesson #2: Deduping Prism Logical Plan 37#UnifiedAnalytics #SparkAISummit Union( Sample(k=100, Parse(“Dataset A”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)), Join( Sample(k=100, Parse(“Dataset A”)), Parse(“Dataset B”)) )
  • 38. Lesson #2: Deduping Spark Tree String 38#UnifiedAnalytics #SparkAISummit (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Scan #1 +- (7) Project #1 +- (8) Join #1 :- (9) Scan #1 +- (10) Scan #1
  • 39. Lesson #2: Deduping Spark Tree String 39#UnifiedAnalytics #SparkAISummit (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Scan #1 +- (7) Project #1 +- (8) Join #1 :- (9) Scan #1 +- (10) Scan #1 (1) Project #2 +- (2) Join #2 :- (3) Project #1 : +- (4) Join #1 : :- (5) Scan #1 : +- (6) Lines 5-5 +- (7) Lines 3-6
  • 40. Lesson #3: Broadcast Join Tuning 40#UnifiedAnalytics #SparkAISummit
  • 41. Node 1 A 1 B 3 C 6 D 7 AA 2 BB 5 CC 9 Node 2 E 2 F 4 G 5 H 8 EE 3 FF 8 Node 1 Node 2 A 1 B 3 C 6 D 7 AA 2 BB 5 CC 9 EE 3 FF 8 E 2 F 4 G 5 H 8 AA 2 BB 5 CC 9 EE 3 FF 8 Broadcast Join #UnifiedAnalytics #SparkAISummit 41 Lesson #3: Broadcast Join Review
  • 42. • Spark’s broadcasting mechanism is inefficient – Broadcasted data goes through the driver – No global limit on broadcasted data – Complex jobs can make driver run out of memory Lesson #3: Spark Broadcast 42#UnifiedAnalytics #SparkAISummit Driver Executor 1 Executor 2 (1) Driver collects broadcasted data from executors (2) Driver sends broadcasted data to executors
  • 43. • Initially disabled broadcast joins for stability • Expectation: small number of joins, all large joins Lesson #3: Disabling Broadcast Joins 43#UnifiedAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold = -1
  • 44. Lesson #3: Re-Enabling Broadcast Joins 44 • Reality: large number of joins, many are small • Re-enabled broadcast join with a low threshold • 2-10x runtime improvement #UnifiedAnalytics #SparkAISummit spark.sql.autoBroadcastJoinThreshold = 1000000
  • 45. Lesson #4: Case-Insensitive Grouping 45#UnifiedAnalytics #SparkAISummit
  • 46. Prism Prism Prism HDFS / S3 Query Engine Spark Driver Spark Executor Interactive Data Prep Spark Driver Spark Executor Spark Driver Data Prep Publishing YARN Spark Executor Spark Executor Lesson #4: Spark in Query Engine #UnifiedAnalytics #SparkAISummit 46
  • 48. Sum of Billing Amount per Billing Location BillingLocation BillingAmount CALIFORNIA 100000 california 50000 california 40000 Illinois 25000 Texas 15000 TeXas 60000 texas 5000 BillingLocation TotalBillingAmount CALIFORNIA 100000 california 90000 TeXas 60000 Illinois 25000 texas 15000 Texas 5000 SELECT BillingLocation, SUM(BillingAmount) AS TotalBillingAmount FROM InsuranceClaims GROUP BY BillingLocation ORDER BY TotalBillingAmount 48#UnifiedAnalytics #SparkAISummit Lesson #4: Grouping on String Columns
  • 49. Sum of Billing Amount per Billing Location SELECT MIN(BillingLocation) AS BillingLocation, SUM(BillingAmount) AS TotalBillingAmount FROM InsuranceClaims GROUP BY UPPER(BillingLocation) ORDER BY TotalBillingAmount BillingLocation TotalBillingAmount CALIFORNIA 190000 TeXas 80000 Illinois 25000 In Workday, grouping on strings columns is case insensitive 49 BillingLocation BillingAmount CALIFORNIA 100000 california 50000 california 40000 Illinois 25000 Texas 15000 TeXas 60000 texas 5000 #UnifiedAnalytics #SparkAISummit Lesson #4: Grouping on String Columns
  • 50. GROUP BY stringField GROUP BY UPPER(stringField) + MIN(stringField) ~7x regression 50#UnifiedAnalytics #SparkAISummit Lesson #4: Case-Insensitive Grouping is Costly
  • 51. Aggregation on strings uses Spark uses SortAggregate operator ➔ Modified Spark’s HashAggregate to support strings Regression reduced to ~3x SortAggregate HashAggregate 51#UnifiedAnalytics #SparkAISummit Lesson #4: Aggregation on String Columns
  • 52. In Spark’s HashAggregate operator, functions used in GROUPING operator were getting evaluated twice Regression reduced to ~2x UPPER evaluated twice UPPER evaluated only once 52#UnifiedAnalytics #SparkAISummit Lesson #4: Reducing Function Evaluations
  • 53. Precompute uppercase for all characters ➔ replace toUpperCase() on each char by a simple array lookup Regression reduced to ~1.5x (and want to decrease more...) UPPER Optimized UPPER 53#UnifiedAnalytics #SparkAISummit Lesson #4: Optimizing Spark’s UPPER Function
  • 54. And one more thing... 54#UnifiedAnalytics #SparkAISummit
  • 55. HDFS / S3 Prism 1 Tenant 1 Prism 2 Tenant 2 Prism 3 Tenant 3 Prism 4 Tenant 4 Spark Cluster Spark Cluster Spark Cluster Spark Cluster Current – Single-Tenanted Spark Clusters 55#UnifiedAnalytics #SparkAISummit
  • 56. HDFS / S3 Spark Cluster Spark Cluster Tenant 2 Tenant 4Tenant 3 Tenant 6Tenant 5 Tenant 7Tenant 1 Tenant 8 Prism 1 Prism 2 Prism 3 Spark Cluster Future – Multi-Tenanted Spark Clusters 56#UnifiedAnalytics #SparkAISummit