This presentation discusses Workday's use of Apache Spark for self-service data preparation and analytics within its SaaS platform. It covers Workday's unified analytics platform powered by Spark, how Prism uses Spark for interactive data prep and publishing, and lessons learned in areas like nested SQL optimization, plan deduplication, broadcast join tuning, and case-insensitive string grouping. The presentation aims to share Workday's production experiences leveraging Spark for analytics in a multi-tenant SaaS environment.
What is Advanced Excel and what are some best practices for designing and cre...
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
1. Pavel Hardak (Product Manager, Workday)
Jianneng Li (Software Engineer, Workday)
Lessons Learned Using Apache Spark
for Self-Service Data Prep (and More)
in SaaS World
#UnifiedAnalytics #SparkAISummit
2. This presentation may contain forward-looking statements for which there are risks, uncertainties, and
assumptions. If the risks materialize or assumptions prove incorrect, Workday’s business results and directions
could differ materially from results implied by the forward-looking statements. Forward-looking statements
include any statements regarding strategies or plans for future operations; any statements concerning new
features, enhancements or upgrades to our existing applications or plans for future applications; and any
statements of belief. Further information on risks that could affect Workday’s results is included in our filings
with the Securities and Exchange Commission which are available on the Workday investor relations
webpage: www.workday.com/company/investor_relations.php
Workday assumes no obligation for and does not intend to update any forward-looking statements. Any
unreleased services, features, functionality or enhancements referenced in any Workday document, roadmap,
blog, our website, press release or public statement that are not currently available are subject to change at
Workday’s discretion and may not be delivered as planned or at all.
Customers who purchase Workday, Inc. services should make their purchase decisions upon services,
features, and functions that are currently available.
Safe Harbor Statement
#UnifiedAnalytics #SparkAISummit 2
3. Agenda
● Workday - Finance and HCM in the cloud
● Workday Platform - “Power of One”
● Prism Analytics - Powered by Apache Spark
● Production Stories & Lessons Learned
● Questions
3#UnifiedAnalytics #SparkAISummit 3
4. #UnifiedAnalytics #SparkAISummit 4
● “Pure” SaaS apps suite
○ Finance and HCM
● Customers: 2,500+
○ 200+ of Fortune 500
● Revenue: $2.82B
○ Growth: 32% YoY
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism Analytics
and Reporting
6. 6
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
#UnifiedAnalytics #SparkAISummit
7. #UnifiedAnalytics #SparkAISummit 7
Durable
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Object Data Model
MetadataExtensible
8. #UnifiedAnalytics #SparkAISummit 8
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Security
Encryption Privacy and
Compliance
Trust
9. #UnifiedAnalytics #SparkAISummit 9
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Reporting and Analytics
Dashboards CollaborationDistribution
11. Workday Planning
Workday
Financial Management
Workday
Human Capital
Management
Workday Prism
Analytics and
Reporting
Prism Analytics
Integrate 3rd
Party Data
Data Management
Data Preparation
Data Discovery
Report Publishing
11#UnifiedAnalytics #SparkAISummit
Plan
Execute
Analyze
Planning
Financial Management
Human Capital
Management
Prism
Analytics and
Reporting
12. Workday Prism Analytics
The full spectrum of Finance and HCM insights, all within Workday.
Workday Data + Non-Workday Data
#UnifiedAnalytics #SparkAISummit 12
13. Finance, HCM
Operational
Industry systems
Legacy systems More…
CRM Service ticketing
Surveys Point of Sale
Stock grants
Map
Ingest
Preparation AnalysisAcquisition
Reporting
Worksheets
Data Discovery
Cleanse and Transform
Blend Datasets
Apply Security Permissions
Publish Data Source
Prism Analytics Workflow
13#UnifiedAnalytics #SparkAISummit
17. Interactive Data Prep in Prism
Powered by Spark
Edit Transform
17#UnifiedAnalytics #SparkAISummit
18. Data Prep Publishing in Prism
Also powered by Spark
18#UnifiedAnalytics #SparkAISummit
19. 19#UnifiedAnalytics #SparkAISummit
Interactive Publishing
Data size 100 - 100K rows Billions of rows
Sampling Yes No
Caching Yes No
Latency Seconds Minutes to hours
Result Returned in memory Written to disk
SLA Best effort Consistent performance
Data Prep: Interactive vs. Publishing
25. Implementing Additional Data Types
• Prism has a richer type system than Catalyst
• Uses StructType and StructField to implement
additional data types
25#UnifiedAnalytics #SparkAISummit
29. Lesson #1: Nested SQL
29#UnifiedAnalytics #SparkAISummit
• SQL requires computed columns to be nested
– SELECT 1 as c1, c1 + 1 as c2; /* ✗ */
– SELECT c1 + 1 as c2 FROM (SELECT 1 as c1); /* ✓ */
• First version: one nesting per computed column
– Does not scale to 100s of columns
– Takes a long time to compile and optimize
30. Lesson #1: Example Dependency Graph
[first.name], [last.name], [income],
concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal.tax],
[income] *0.10 as [state.tax],
concat([full.name],”@workday.com”) as [email]
first.name last.name income
full.name federal.tax
email
state.tax
2nd level
1st level
30#UnifiedAnalytics #SparkAISummit
31. select [income] * 0.10 as [state_tax], *
from (select [income] * 0.28 as [federal_tax], *
from (select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name], *
from (select [first.name], [last.name], [income] from base_table))))
Lesson #1: SQL Before Optimization
4 levels of nested SQL
31#UnifiedAnalytics #SparkAISummit
32. Lesson #1: SQL After Optimization
2 levels of nested SQL
32
select concat([full.name],”@workday.com”) as [email], *
from (select concat([first.name],”.”, [last.name]) as [full.name],
[income] * 0.28 as [federal_tax],
[income] * 0.10 as [state_tax], *
from (select [first.name], [last.name], [income] from base_table)))
#UnifiedAnalytics #SparkAISummit
34. Lesson #2: Plan Blowup
34#UnifiedAnalytics #SparkAISummit
• Generated plans can have duplicate operators
• E.g. self joins and self unions
• Need to de-duplicate to improve performance
41. Node 1
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
Node 2
E 2
F 4
G 5
H 8
EE 3
FF 8
Node 1 Node 2
A 1
B 3
C 6
D 7
AA 2
BB 5
CC 9
EE 3
FF 8
E 2
F 4
G 5
H 8
AA 2
BB 5
CC 9
EE 3
FF 8
Broadcast
Join
#UnifiedAnalytics #SparkAISummit 41
Lesson #3: Broadcast Join Review
42. • Spark’s broadcasting mechanism is inefficient
– Broadcasted data goes through the driver
– No global limit on broadcasted data
– Complex jobs can make driver run out of memory
Lesson #3: Spark Broadcast
42#UnifiedAnalytics #SparkAISummit
Driver
Executor 1
Executor 2
(1) Driver collects broadcasted data from executors
(2) Driver sends broadcasted data to executors
43. • Initially disabled broadcast joins for stability
• Expectation: small number of joins, all large joins
Lesson #3: Disabling Broadcast Joins
43#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = -1
44. Lesson #3: Re-Enabling Broadcast Joins
44
• Reality: large number of joins, many are small
• Re-enabled broadcast join with a low threshold
• 2-10x runtime improvement
#UnifiedAnalytics #SparkAISummit
spark.sql.autoBroadcastJoinThreshold = 1000000
48. Sum of Billing Amount per Billing Location
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
BillingLocation TotalBillingAmount
CALIFORNIA 100000
california 90000
TeXas 60000
Illinois 25000
texas 15000
Texas 5000
SELECT BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY BillingLocation
ORDER BY TotalBillingAmount
48#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
49. Sum of Billing Amount per Billing Location
SELECT MIN(BillingLocation) AS BillingLocation,
SUM(BillingAmount) AS TotalBillingAmount
FROM InsuranceClaims
GROUP BY UPPER(BillingLocation)
ORDER BY TotalBillingAmount
BillingLocation TotalBillingAmount
CALIFORNIA 190000
TeXas 80000
Illinois 25000
In Workday, grouping on strings columns is case insensitive
49
BillingLocation BillingAmount
CALIFORNIA 100000
california 50000
california 40000
Illinois 25000
Texas 15000
TeXas 60000
texas 5000
#UnifiedAnalytics #SparkAISummit
Lesson #4: Grouping on String Columns
50. GROUP BY stringField
GROUP BY UPPER(stringField)
+
MIN(stringField)
~7x regression
50#UnifiedAnalytics #SparkAISummit
Lesson #4: Case-Insensitive Grouping is Costly
51. Aggregation on strings uses Spark uses SortAggregate operator
➔ Modified Spark’s HashAggregate to support strings
Regression reduced to
~3x
SortAggregate HashAggregate
51#UnifiedAnalytics #SparkAISummit
Lesson #4: Aggregation on String Columns
52. In Spark’s HashAggregate operator, functions used in
GROUPING operator were getting evaluated twice
Regression reduced to
~2x
UPPER evaluated
twice
UPPER evaluated
only once
52#UnifiedAnalytics #SparkAISummit
Lesson #4: Reducing Function Evaluations
53. Precompute uppercase for all characters
➔ replace toUpperCase() on each char by a simple array lookup
Regression reduced to ~1.5x
(and want to decrease more...)
UPPER Optimized UPPER
53#UnifiedAnalytics #SparkAISummit
Lesson #4: Optimizing Spark’s UPPER Function
54. And one more thing...
54#UnifiedAnalytics #SparkAISummit