Spark Development Lifecycle at Workday - ApacheCon 2020

Apache Spark Development Lifecycle @ Workday
Pavel Hardak – Eren Avsarogullari

•What is Workday?
•“Power of One” and Prism Analytics
•How Apache Spark fits in?
•Custom Spark Upgrade Model
•Runtime Metrics Pipeline
•What is the next?
Agenda

• FY20 Revenue $3.6B
• ~28% Y/Y Growth
• >7,700 customers
• >45% of Fortune 500
• >12,300 employees
• NASDAQ: WDAY
About Workday
Enterprise Business Applications for a Changing World
• Human Capital, Financials, Planning,
Analytics
• Cloud native, multi-tenant
• 30% revenue re-invested in product
each year
• >40 Advisory Partners
• >200 Software Partners
Planning
Financial
Management
Human
Capital Management
Analytics & Benchmarking

Planning
Financial
Management
Human Capital
Management
Analytics

Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform

Durable
Object Data Model
MetadataExtensible
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform

Security
Encryption Privacy and
Compliance
Trust
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform

Reporting and Analytics
ExploratoryDescriptive
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Cloud
Machine
Learning
One Platform
Augmented

The Leading Enterprise Cloud for Finance and HR
37 Million +
workers
100 Billion +
transactions per year
96.1%
transactions < 1
seconds
99.9%
actual availability
200+
companies
#1
Future 50, Fortune
#2
40 Best Workplaces in
Technology, Fortune
10 Thousand +
certified resources
in the ecosystem

Financial Employees
GL HR &
Payroll
Third-Party
HR & FIN
Industry &
Homegrown
CRM Marketing Service Subsidiaries Contract
Labor
Workday Maintains Your Data Gravity

Workday Prism Analytics
The full spectrum of workforce,
financial, and operational
insights, all within Workday.
Workday
Data
Non-Workday
Data

Prism Analytics Momentum - 100% YoY growth
Workday Confidential
Over
Prism Analytics
Customers500

Table
Ingestion
Data Prep
Examples
Engine
Lens Build
Engine
Query Engine
and Mercury
Workday Spark Runtime Engine
Compute (YARN) and Storage (HDFS/S3)
Prism UI and APIs
Accounting
Center
People
Analytics
DBFR Analytics PlatformCosmos DD4A
Apache Spark as foundational technology

HDFS / S3
Prism 01
Tenant 01
Prism 02
Tenant 02
Prism 03
Tenant 03
Prism 04
Tenant 04
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Prism Tenants - Deployment (simplified)

Workday in the Cloud
ASH
PDX
ATL
PROD & NPRD
ENG
PROD & NPRD
DR for PDX
SALES
DR for ASH
PROD & NPRD
DUB
AMS
DR for DUB
ORE
MTL
PROD & NPRD
NPRD
COL
PROD

Prism
Prism
Prism
Prism
HDFS / S3
Spark
Driver
Data Prep
Interactive
Spark
Driver
Spark
Executor
Spark
Executor
Spark
Driver Lens Build
Phase 1
Lens Build
Phase 2
YARN
ADS
Spark
Executor
Query
Engine
Prism-enabled Tenant - Today

Workday Spark = Apache Spark ++
Apache Spark
Autonomous
Operational
Stability
Core
Stability
Complex Application logic
as Spark Plans
Performance & Scalability for
batch processing
Serviceability
Multi-tenancy
Ingest Latency
Interactive Query
Performance

With this scale, complexity, dependencies…
How can you do Spark version upgrades?

Spark Upgrade challenges:
‒ high number of tenants,
‒ long-running Spark Applications,
‒ progressive roll-out,
‒ rollback case,
‒ maintaining custom Spark fork
Custom Spark Upgrade Model
Custom
Repo
Spark
Version
Custom
Repo
Spark Current
Version
Shim API
Spark Next
Version
Previous Approach
New Approach
Spark single-version support against a single repo
Spark multi-versions support against a single repo
This upgrade model is not specific for Spark upgrade so can be
applied for any internal & external API upgrades when dealing with
these kind of challenges.
This upgrade model is also
used for major and minor
Spark version upgrades.

•Remove PII Data from Logs: Spark query plans and DataFrame schema
obfuscation.
•Catalyst Optimizer: Additional optimization rules on aggregation and large
case statements optimizations.
•Extension for Physical Plan: Enable correlation between Physical Operators
and their runtime metrics.
•Rest APIs: SQL Rest API improvements to query and aggregate physical
operation level metrics.
•Benchmark Module: Additional module to run benchmark tests on introduced
new Spark patches by using standard TPCH and custom queries.
Custom Spark Release Preparation

Shim API
SparkShim
Interface
SparkShimImpl
for Spark v2.3.0
SparkShimImpl
for Spark v2.4.4
Spark API diffs between
both versions may introduce
both compile-time(e.g: Invalid type) and/or
runtime issues (e.g: NoSuchMethodError)

Compile-time & Runtime Version Selections
Classpath Types Description
Compile
-Time
compileClasspath +
testCompileClasspath
Spark compile-time version is
the current version.
Runtime runtimeClasspath +
testRuntimeClasspath
Spark runtime version is
selected by feature toggle as
current or next version.
A sample Gradle build script code snippet on selections of
both Spark and Shim compile-time and runtime classpath versions:Selected Spark versions by classpath types:
Feature Toggle is being used to select Spark version on:
- Build Time (runtime version selection for classpath)
- Test Pipelines (to run UT, IT and Perf Tests by Spark version)
- Environment (to enable Spark version at env level – test, preprod or prod)
Shim API artifacts are shipped in addition to Spark artifacts (by version)

Verification & Progressive Roll-out & Cleanup
Progressive Roll-out Phase
WAVE III
Scope: All Tenants (Internal/Impl/Prod)
Duration: 4 Weeks
WAVE II
Scope: Multiple Tenants (Impl / NonProd)
Duration: 2 Weeks
WAVE I
Scope: Single Tenant (Internal)
Duration: 2 Weeks
Verification Phase
Verify following test pipelines against to both Spark
versions:
• Automated Regression Testing: Running Unit &
Integration Test Pipelines
• Performance Testing:
‒ Spark Benchmark Pipeline: Spark current vs
new version Perf Tests (by executing standard
TPCH and custom queries.) + Hadoop
‒ End2End Perf Pipeline: Custom applications +
Spark + Hadoop
Previous Spark version:
‒ Fork,
‒ Artifacts from artifactory /
mvn repository)
Shim API
Cleanup Phase

Spark SQL Engine - Query Planning & Execution
SQL
Dataset
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan Physical
Plan
CostModel
Selected
Physical
Plan
DAG
Execution
SQL
Metrics
Application Job Stage Task
Spark UI Rest APIs Event Logs
Logical Planning Physical Planning Execution
Analysis Optimizations Physical Plans
Generation

Runtime Metrics Pipeline Architecture
Proton
(Application Server)
Data
Acquisition
Data
Preparation
Query
Engine
HDFS / S3
Spark History
Server Data Warehouse
Stats
App
Hadoop Cluster
Spark
Applications
Spark Hive Tables
• app_metrics
• job_metrics
• stage_metrics
• task_metrics
• executor_metrics
• sql_metrics
Spark Rest APIs
• Application
• Job
• Stage
• Task
• Executors
• SQL (New)
1x1

New Spark SQL Rest API [coming with v3.1.0]
New SQL Rest Endpoints

Comparison of new Spark SQL Rest API Json Outputs
Improved VersionOlder Version (Cherry-picked from OSS)
Improvements
1. Correlation between
physical operators
and their runtime
metrics
2. wholeStageCodege
nId support across
multiple physical
operators
3. Normalization on
metric values to be
able to run
aggregations

Sample Queries on Spark SQL Metrics
What is total loaded
number of input/output
rows by file type, tenant,
application, date?
What are the top 25
tenants running Join, Filter,
Sort (etc..) operations?
What are the mostly used
operations by tenants,
applications, dates?
File Scan Operation
What is number of files
by file type, tenant,
application, date?
What is total scan time
and total metadata time
by min, med, max, file
type, tenant, application,
date?
What is total number of
operations by tenants,
applications, dates?
What are Top 25 Tenants
Having Max Broadcasted
Data Size (GB)?
What is the total number of
joins, BroadcastHashJoin
or SortMergeJoin across all
tenants by day?
Join
What are Top 25 Tenants
Having Max Time to
Collect during Broadcast
(Minute)?
What are Top 25
Tenants Having Max
Time To Broadcast
(Minute) or To Build
during Broadcast
(Minute)?
...
...
...

Correlation between Physical Operators & SQL Metrics
•We also integrated our physical plans with runtime SQL metrics
•We can have correlation between Physical Operators and their Runtime Metrics from application logs for troubleshooting and debugging purposes

Developed patches were also backported to OSS repo for community usage:
•[SPARK-31440][SQL] Improve SQL Rest API
https://github.com/apache/spark/pull/28208
•[SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API
•[SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation
Backported Patches to Spark OSS Repo [v3.1.0]

Spark 3.0 introduced following features:
‒ Adaptive Query Execution (SPARK-31412)
‒ Dynamic Partition Pruning (SPARK-11150)
‒ Scala 2.12 Support (SPARK-26132)
‒ JDK 11 Support (SPARK-24417)
‒ Hadoop 3 Support (SPARK-23534)
• Spark 3.x Upgrade (+ Scala, JDK, Hadoop)
• Performance, Troubleshooting and Debugging Improvements
• Multi-Tenancy Support
What is the next?

HDFS / S3
Prism 01
Tenant 01
Prism 02
Tenant 02
Prism 03
Tenant 03
Prism 04
Tenant 04
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Prism Deployment - Today

Prism Deployment - “Multiverse”
Spark Cluster Spark Cluster
HDFS / S3
Tenant 02 Tenant 04Tenant 03 Tenant 06Tenant 05 Tenant 07Tenant 01 Tenant 08
Prism 01 Prism 02 Prism 03
Spark Cluster

Thank You!
Q & A

Spark Development Lifecycle at Workday - ApacheCon 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Development Lifecycle at Workday - ApacheCon 2020

Similar to Spark Development Lifecycle at Workday - ApacheCon 2020 (20)

Recently uploaded

Recently uploaded (20)

Spark Development Lifecycle at Workday - ApacheCon 2020