At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
3. About us
METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com
CONSUMER DATAMEDIA
#1 B2C weather portal in DACH
with up to 20mil monthly UU (AGOF)
Full service, cross-media
>50 daily TV/video productions
Weather and geo-based solutions for
advertiser, FMCG & retailer …
Weather Data
Impact
Weather Based
Advertising
Weather Driven
Forecasting
Footfall & Location
Insights
Weather API
4. Example product: GfK Retail Crowd Monitor (in cooperation with GfK)
Comparison of visit frequencies in major German city centers
Carsten Herbe - Technical Lead Data Products
» Architecture, Dev & Ops for Infrastructure & Data Pipelines
» Today: #aws #spark #databricks
» Past: #hadoop #kafka #dwh #bi
6. High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
Athena
partner
EMR
analyze
Athena
Lambda
7. Motivation for moving to Databricks & Delta
It’s not about performance but productivity
• GDPR deletes in minutes without downtime instead of hours/days with downtime
• Data corrections / enhancements without downtime instead of whole days of
downtime
Delta
• Improved usability & productivity using Databricks Notebooks
• Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters)
Databricks
• Improve collaboration between Data Scientists and Data Engineers / Dev Team
• Replace custom solution (that has been started)
MLflow
8. Welcome to the world of Databricks & Delta
Infrastructure view
9. High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
Athena
partner
EMR
analyze
Athena
AWS Glue
Data Catalog
StepFunctionsCloudWatch Lambda
10. Architecture: How Databricks fits in
Just plug & play?
ingestion
Ingestions
API
data analytics serving
REST
API
delivery
customer
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
partner
Athena
Lambda
RAW
data
CLEAN
data
CALC
data
Databricks
analyze
Databricks
cleanse
Databricks
calc
Databricks
calc &
aggregate
Databricks
cleanse
11. Databricks Workspaces
We use one Workspace as all our projects already share one AWS account and are logically separated
• One workspace for all stages/projects
• must be created manually by Databricks
Single classic
Workspace
• dedicated workspaces for stages/projects
• must all be created manually by Databricks
• No SSO means user management per workspace
Multiple classic
Workspaces
• dedicated workspaces for stages/projects
• use the recently published Account API
• you can create workspaces yourself (checkout Databricks terraform)
Multiple
Workspaces
with Account API
12. Workspace setup: what Databricks requires & creates
We could reuse our IAM policies and roles with slight modifications.
• S3 bucket for the Databricks workspace (e.g. notebooks)
• IAM role to be able to deploy from Databricks account into your account
• EC2 instance role(s) for deploying/starting a Databricks cluster
Requires
• VPC & related AWS resources
• you cannot reuse an existing VPC
• so you must do new VPC peerings
• NOTE: this changes with the new Account API
• you can provide your specific IP range to Databricks before workspace creation
Creates
13. Security: Combining Databricks & AWS IAM
We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total
• Each user must have a valid mail address à same for technical users!
• You can create tokens for users à API access
• You can restrict access to clusters based on user or group
• launch (both analytical & job) clusters only with specific EC2 instance role
Users, groups,
permissions
• Use Databricks cluster with your AWS IAM role
• requires SSO
• allows you to share one cluster between projects
IAM
passthrough
14. Cluster configurations
init scripts, instance types, spot instances
• Init scripts (pip install …) quite
similar to EMR init scripts
• one EC2 instance type for
workers
• no instance fleets as with EMR
• mix of on-demand and spot-
instances (with fallback to on-
demand)
• for analytics & dev we use 1 on-
demand for driver and spot
otherwise
• Autoscaling for shared (dev)
clusters
16. From Parquet tables to Delta tables
Conversion was easy. Some regular housekeeping required.
• CONVERT TO DELTA mytab PARTITIONED BY (...)
• requires some compute resources to analyze existing data
Parquet to
Delta
• UPDATE/DELETE generates a lot of small files
• à configure as table property à done as part of DML
• or à periodical housekeeping
OPTIMIZE
• Improves selective queries with WHERE on one or more columns
• Cannot be configured as table property à housekeeping
• we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’)
ZORDER
• Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel)
• After retention period they should be deleted using VACUUM
• à periodical housekeeping required
VACUUM
17. AWS Glue Data Catalog vs Databricks Hive Metastore
We continued with Glue. No we have two external tables: one for Delta, one for Athena …
Glue Data Catalog Databricks Metastore
can be accessed by
multiple Databricks
workspaces
can only be accessed
by the Databricks
workspace that owns it
can be accessed by
other tools like Athena
cannot be used by
Athena
officially not supported
with Databricks
Connect
supported with
Databricks connect
Native Delta Tables “Symlink” Tables
CREATE TABLE …
USING DELTA
[LOCATION …]
GENERATE symlink_format_manifest FOR
TABLE …
CREATE EXTERNAL TABLE …
LOCATION ‘…/symlink_format_manifest/’
Full Delta Features:
DML, TimeTravel
Only SELECT
Used by our Spark
applications and
interactive analytics
Used for reporting with Tableau &
Athena
18. Workflows (1/2): From single steps on EMR …
Detailed graphical overview of the workflow
StepFunctions before Databricks
Create or use existing EMR cluster (latter is quite
handy for dev)
Lambda for adding step to EMR cluster
access control by IAM role
One step = one Spark job
Good (graphical) overview of pipeline
19. Workflows (2/2): … to bulk steps with Databricks Job clusters
cheaper Job clusters vs more expensive All-Purpose Clusters
StepFunctions with Databricks
You just can run one job on a Job cluster
Using an All-Purpose cluster is more expensive.
One job cluster for each steps: lot of time for cluster
creation à painful for dev
One step = one sequence of Spark jobs à workflow
code on driver required!
Lambda for launching Spark job on Databricks
cluster
Databricks token for authentication
Only partial picture
(but Python workflow is easier to implement)
20. Adapting our (Spark) applications
Hardly any work required in existing Spark applications
• Athena cannot write to Delta
• Migrate to Spark
Athena steps
• replace with DML (DELETE/MERGE/UPDATE)
• instead of deleting objects in mytab/year=2020/month=10/day=1:
DELETE FROM mytab
WHERE year=2020 AND month=10 AND day=1
• works even faster than S3 deletes!
S3 object
deletions
• Replace df.write.mode("append").format(”parquet")
with df.write.mode("append").format("delta")
• No more explicit adding of partitions after writing to an S3 folder
df.write
22. Databricks Notebooks
Here we highlight the differences to EMR Notebooks
• Notebooks run on the same binaries as the cluster
• i.e. all packages installed by the init script are also available on the notebook host
• this way we could use direct visualization package like folium in the notebook
Python
packages
• You can access notebooks without a running cluster
• Attaching notebooks to another cluster is just a click
• no more restarts as with EMR Notebooks
Accessibility
• Code Completion
• Quick visualization of SQL results
Usability
• If you prefer to write Python programs instead of Notebooks
• submit a local script from your laptop to a running All-Purpose cluster
• Beware of different time zones on your machine and the cloud instance!
Databricks
Connect
23. Collaboration & Git/Bitbucket
Our team members tried different ways, but we all ended up using batch workspace export/import
• Collaborative Notebook editing (like Office 365) at the same timeBuilt-in
• You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks
workspace
• You must manually link an existing notebook in your workspace to Git/Bitbucket
Direct Git
integration
• install Databricks CLI (you want to do this anyway)
• import into workspace: databricks workspace import_dir ...
• export to local machine: databricks workspace export_dir ...
• git add / commit / push
workspace
export/import
25. Summary
We already investigated performance & costs during PoC. So we felt comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• functionality of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort
26. Conclusion & outlook
We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• complexity of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort
• MLflow
• Autoloader
• Account API
More features
www.meteonomiqs.com
Carsten Herbe
carsten.herbe@meteonomiqs.com
T +49 89 412 007-289 M +49 151 4416 5763