Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 27

Moving to Databricks & Delta

1

Share

Download to read offline

At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.

Moving to Databricks & Delta

  1. 1. Moving to Databricks Carsten Herbe Technical Lead Data Products at wetter.com GmbH 19.11.2020
  2. 2. About wetter.com & METEONOMIQS What do we need Spark for?
  3. 3. About us METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com CONSUMER DATAMEDIA #1 B2C weather portal in DACH with up to 20mil monthly UU (AGOF) Full service, cross-media >50 daily TV/video productions Weather and geo-based solutions for advertiser, FMCG & retailer … Weather Data Impact Weather Based Advertising Weather Driven Forecasting Footfall & Location Insights Weather API
  4. 4. Example product: GfK Retail Crowd Monitor (in cooperation with GfK) Comparison of visit frequencies in major German city centers Carsten Herbe - Technical Lead Data Products » Architecture, Dev & Ops for Infrastructure & Data Pipelines » Today: #aws #spark #databricks » Past: #hadoop #kafka #dwh #bi
  5. 5. Our world before Databricks … Architecture & Motivation
  6. 6. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch Athena partner EMR analyze Athena Lambda
  7. 7. Motivation for moving to Databricks & Delta It’s not about performance but productivity • GDPR deletes in minutes without downtime instead of hours/days with downtime • Data corrections / enhancements without downtime instead of whole days of downtime Delta • Improved usability & productivity using Databricks Notebooks • Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters) Databricks • Improve collaboration between Data Scientists and Data Engineers / Dev Team • Replace custom solution (that has been started) MLflow
  8. 8. Welcome to the world of Databricks & Delta Infrastructure view
  9. 9. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports Athena partner EMR analyze Athena AWS Glue Data Catalog StepFunctionsCloudWatch Lambda
  10. 10. Architecture: How Databricks fits in Just plug & play? ingestion Ingestions API data analytics serving REST API delivery customer S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch partner Athena Lambda RAW data CLEAN data CALC data Databricks analyze Databricks cleanse Databricks calc Databricks calc & aggregate Databricks cleanse
  11. 11. Databricks Workspaces We use one Workspace as all our projects already share one AWS account and are logically separated • One workspace for all stages/projects • must be created manually by Databricks Single classic Workspace • dedicated workspaces for stages/projects • must all be created manually by Databricks • No SSO means user management per workspace Multiple classic Workspaces • dedicated workspaces for stages/projects • use the recently published Account API • you can create workspaces yourself (checkout Databricks terraform) Multiple Workspaces with Account API
  12. 12. Workspace setup: what Databricks requires & creates We could reuse our IAM policies and roles with slight modifications. • S3 bucket for the Databricks workspace (e.g. notebooks) • IAM role to be able to deploy from Databricks account into your account • EC2 instance role(s) for deploying/starting a Databricks cluster Requires • VPC & related AWS resources • you cannot reuse an existing VPC • so you must do new VPC peerings • NOTE: this changes with the new Account API • you can provide your specific IP range to Databricks before workspace creation Creates
  13. 13. Security: Combining Databricks & AWS IAM We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total • Each user must have a valid mail address à same for technical users! • You can create tokens for users à API access • You can restrict access to clusters based on user or group • launch (both analytical & job) clusters only with specific EC2 instance role Users, groups, permissions • Use Databricks cluster with your AWS IAM role • requires SSO • allows you to share one cluster between projects IAM passthrough
  14. 14. Cluster configurations init scripts, instance types, spot instances • Init scripts (pip install …) quite similar to EMR init scripts • one EC2 instance type for workers • no instance fleets as with EMR • mix of on-demand and spot- instances (with fallback to on- demand) • for analytics & dev we use 1 on- demand for driver and spot otherwise • Autoscaling for shared (dev) clusters
  15. 15. Migrating our Spark Pipelines Data & Application
  16. 16. From Parquet tables to Delta tables Conversion was easy. Some regular housekeeping required. • CONVERT TO DELTA mytab PARTITIONED BY (...) • requires some compute resources to analyze existing data Parquet to Delta • UPDATE/DELETE generates a lot of small files • à configure as table property à done as part of DML • or à periodical housekeeping OPTIMIZE • Improves selective queries with WHERE on one or more columns • Cannot be configured as table property à housekeeping • we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’) ZORDER • Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel) • After retention period they should be deleted using VACUUM • à periodical housekeeping required VACUUM
  17. 17. AWS Glue Data Catalog vs Databricks Hive Metastore We continued with Glue. No we have two external tables: one for Delta, one for Athena … Glue Data Catalog Databricks Metastore can be accessed by multiple Databricks workspaces can only be accessed by the Databricks workspace that owns it can be accessed by other tools like Athena cannot be used by Athena officially not supported with Databricks Connect supported with Databricks connect Native Delta Tables “Symlink” Tables CREATE TABLE … USING DELTA [LOCATION …] GENERATE symlink_format_manifest FOR TABLE … CREATE EXTERNAL TABLE … LOCATION ‘…/symlink_format_manifest/’ Full Delta Features: DML, TimeTravel Only SELECT Used by our Spark applications and interactive analytics Used for reporting with Tableau & Athena
  18. 18. Workflows (1/2): From single steps on EMR … Detailed graphical overview of the workflow StepFunctions before Databricks Create or use existing EMR cluster (latter is quite handy for dev) Lambda for adding step to EMR cluster access control by IAM role One step = one Spark job Good (graphical) overview of pipeline
  19. 19. Workflows (2/2): … to bulk steps with Databricks Job clusters cheaper Job clusters vs more expensive All-Purpose Clusters StepFunctions with Databricks You just can run one job on a Job cluster Using an All-Purpose cluster is more expensive. One job cluster for each steps: lot of time for cluster creation à painful for dev One step = one sequence of Spark jobs à workflow code on driver required! Lambda for launching Spark job on Databricks cluster Databricks token for authentication Only partial picture (but Python workflow is easier to implement)
  20. 20. Adapting our (Spark) applications Hardly any work required in existing Spark applications • Athena cannot write to Delta • Migrate to Spark Athena steps • replace with DML (DELETE/MERGE/UPDATE) • instead of deleting objects in mytab/year=2020/month=10/day=1: DELETE FROM mytab WHERE year=2020 AND month=10 AND day=1 • works even faster than S3 deletes! S3 object deletions • Replace df.write.mode("append").format(”parquet") with df.write.mode("append").format("delta") • No more explicit adding of partitions after writing to an S3 folder df.write
  21. 21. Analyzing, developing & collaborating with Databricks Developer & Data Scientist view
  22. 22. Databricks Notebooks Here we highlight the differences to EMR Notebooks • Notebooks run on the same binaries as the cluster • i.e. all packages installed by the init script are also available on the notebook host • this way we could use direct visualization package like folium in the notebook Python packages • You can access notebooks without a running cluster • Attaching notebooks to another cluster is just a click • no more restarts as with EMR Notebooks Accessibility • Code Completion • Quick visualization of SQL results Usability • If you prefer to write Python programs instead of Notebooks • submit a local script from your laptop to a running All-Purpose cluster • Beware of different time zones on your machine and the cloud instance! Databricks Connect
  23. 23. Collaboration & Git/Bitbucket Our team members tried different ways, but we all ended up using batch workspace export/import • Collaborative Notebook editing (like Office 365) at the same timeBuilt-in • You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks workspace • You must manually link an existing notebook in your workspace to Git/Bitbucket Direct Git integration • install Databricks CLI (you want to do this anyway) • import into workspace: databricks workspace import_dir ... • export to local machine: databricks workspace export_dir ... • git add / commit / push workspace export/import
  24. 24. Summary
  25. 25. Summary We already investigated performance & costs during PoC. So we felt comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • functionality of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort
  26. 26. Conclusion & outlook We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • complexity of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort • MLflow • Autoloader • Account API More features www.meteonomiqs.com Carsten Herbe carsten.herbe@meteonomiqs.com T +49 89 412 007-289 M +49 151 4416 5763
  27. 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×