SlideShare a Scribd company logo
1 of 27
Download to read offline
Moving to Databricks
Carsten Herbe
Technical Lead Data Products at wetter.com GmbH
19.11.2020
About wetter.com & METEONOMIQS
What do we need Spark for?
About us
METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com
CONSUMER DATAMEDIA
#1 B2C weather portal in DACH
with up to 20mil monthly UU (AGOF)
Full service, cross-media
>50 daily TV/video productions
Weather and geo-based solutions for
advertiser, FMCG & retailer …
Weather Data
Impact
Weather Based
Advertising
Weather Driven
Forecasting
Footfall & Location
Insights
Weather API
Example product: GfK Retail Crowd Monitor (in cooperation with GfK)
Comparison of visit frequencies in major German city centers
Carsten Herbe - Technical Lead Data Products
» Architecture, Dev & Ops for Infrastructure & Data Pipelines
» Today: #aws #spark #databricks
» Past: #hadoop #kafka #dwh #bi
Our world before Databricks …
Architecture & Motivation
High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
Athena
partner
EMR
analyze
Athena
Lambda
Motivation for moving to Databricks & Delta
It’s not about performance but productivity
• GDPR deletes in minutes without downtime instead of hours/days with downtime
• Data corrections / enhancements without downtime instead of whole days of
downtime
Delta
• Improved usability & productivity using Databricks Notebooks
• Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters)
Databricks
• Improve collaboration between Data Scientists and Data Engineers / Dev Team
• Replace custom solution (that has been started)
MLflow
Welcome to the world of Databricks & Delta
Infrastructure view
High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
Athena
partner
EMR
analyze
Athena
AWS Glue
Data Catalog
StepFunctionsCloudWatch Lambda
Architecture: How Databricks fits in
Just plug & play?
ingestion
Ingestions
API
data analytics serving
REST
API
delivery
customer
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
partner
Athena
Lambda
RAW
data
CLEAN
data
CALC
data
Databricks
analyze
Databricks
cleanse
Databricks
calc
Databricks
calc &
aggregate
Databricks
cleanse
Databricks Workspaces
We use one Workspace as all our projects already share one AWS account and are logically separated
• One workspace for all stages/projects
• must be created manually by Databricks
Single classic
Workspace
• dedicated workspaces for stages/projects
• must all be created manually by Databricks
• No SSO means user management per workspace
Multiple classic
Workspaces
• dedicated workspaces for stages/projects
• use the recently published Account API
• you can create workspaces yourself (checkout Databricks terraform)
Multiple
Workspaces
with Account API
Workspace setup: what Databricks requires & creates
We could reuse our IAM policies and roles with slight modifications.
• S3 bucket for the Databricks workspace (e.g. notebooks)
• IAM role to be able to deploy from Databricks account into your account
• EC2 instance role(s) for deploying/starting a Databricks cluster
Requires
• VPC & related AWS resources
• you cannot reuse an existing VPC
• so you must do new VPC peerings
• NOTE: this changes with the new Account API
• you can provide your specific IP range to Databricks before workspace creation
Creates
Security: Combining Databricks & AWS IAM
We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total
• Each user must have a valid mail address à same for technical users!
• You can create tokens for users à API access
• You can restrict access to clusters based on user or group
• launch (both analytical & job) clusters only with specific EC2 instance role
Users, groups,
permissions
• Use Databricks cluster with your AWS IAM role
• requires SSO
• allows you to share one cluster between projects
IAM
passthrough
Cluster configurations
init scripts, instance types, spot instances
• Init scripts (pip install …) quite
similar to EMR init scripts
• one EC2 instance type for
workers
• no instance fleets as with EMR
• mix of on-demand and spot-
instances (with fallback to on-
demand)
• for analytics & dev we use 1 on-
demand for driver and spot
otherwise
• Autoscaling for shared (dev)
clusters
Migrating our Spark Pipelines
Data & Application
From Parquet tables to Delta tables
Conversion was easy. Some regular housekeeping required.
• CONVERT TO DELTA mytab PARTITIONED BY (...)
• requires some compute resources to analyze existing data
Parquet to
Delta
• UPDATE/DELETE generates a lot of small files
• à configure as table property à done as part of DML
• or à periodical housekeeping
OPTIMIZE
• Improves selective queries with WHERE on one or more columns
• Cannot be configured as table property à housekeeping
• we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’)
ZORDER
• Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel)
• After retention period they should be deleted using VACUUM
• à periodical housekeeping required
VACUUM
AWS Glue Data Catalog vs Databricks Hive Metastore
We continued with Glue. No we have two external tables: one for Delta, one for Athena …
Glue Data Catalog Databricks Metastore
can be accessed by
multiple Databricks
workspaces
can only be accessed
by the Databricks
workspace that owns it
can be accessed by
other tools like Athena
cannot be used by
Athena
officially not supported
with Databricks
Connect
supported with
Databricks connect
Native Delta Tables “Symlink” Tables
CREATE TABLE …
USING DELTA
[LOCATION …]
GENERATE symlink_format_manifest FOR
TABLE …
CREATE EXTERNAL TABLE …
LOCATION ‘…/symlink_format_manifest/’
Full Delta Features:
DML, TimeTravel
Only SELECT
Used by our Spark
applications and
interactive analytics
Used for reporting with Tableau &
Athena
Workflows (1/2): From single steps on EMR …
Detailed graphical overview of the workflow
StepFunctions before Databricks
Create or use existing EMR cluster (latter is quite
handy for dev)
Lambda for adding step to EMR cluster
access control by IAM role
One step = one Spark job
Good (graphical) overview of pipeline
Workflows (2/2): … to bulk steps with Databricks Job clusters
cheaper Job clusters vs more expensive All-Purpose Clusters
StepFunctions with Databricks
You just can run one job on a Job cluster
Using an All-Purpose cluster is more expensive.
One job cluster for each steps: lot of time for cluster
creation à painful for dev
One step = one sequence of Spark jobs à workflow
code on driver required!
Lambda for launching Spark job on Databricks
cluster
Databricks token for authentication
Only partial picture
(but Python workflow is easier to implement)
Adapting our (Spark) applications
Hardly any work required in existing Spark applications
• Athena cannot write to Delta
• Migrate to Spark
Athena steps
• replace with DML (DELETE/MERGE/UPDATE)
• instead of deleting objects in mytab/year=2020/month=10/day=1:
DELETE FROM mytab
WHERE year=2020 AND month=10 AND day=1
• works even faster than S3 deletes!
S3 object
deletions
• Replace df.write.mode("append").format(”parquet")
with df.write.mode("append").format("delta")
• No more explicit adding of partitions after writing to an S3 folder
df.write
Analyzing, developing & collaborating with Databricks
Developer & Data Scientist view
Databricks Notebooks
Here we highlight the differences to EMR Notebooks
• Notebooks run on the same binaries as the cluster
• i.e. all packages installed by the init script are also available on the notebook host
• this way we could use direct visualization package like folium in the notebook
Python
packages
• You can access notebooks without a running cluster
• Attaching notebooks to another cluster is just a click
• no more restarts as with EMR Notebooks
Accessibility
• Code Completion
• Quick visualization of SQL results
Usability
• If you prefer to write Python programs instead of Notebooks
• submit a local script from your laptop to a running All-Purpose cluster
• Beware of different time zones on your machine and the cloud instance!
Databricks
Connect
Collaboration & Git/Bitbucket
Our team members tried different ways, but we all ended up using batch workspace export/import
• Collaborative Notebook editing (like Office 365) at the same timeBuilt-in
• You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks
workspace
• You must manually link an existing notebook in your workspace to Git/Bitbucket
Direct Git
integration
• install Databricks CLI (you want to do this anyway)
• import into workspace: databricks workspace import_dir ...
• export to local machine: databricks workspace export_dir ...
• git add / commit / push
workspace
export/import
Summary
Summary
We already investigated performance & costs during PoC. So we felt comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• functionality of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort
Conclusion & outlook
We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• complexity of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort
• MLflow
• Autoloader
• Account API
More features
www.meteonomiqs.com
Carsten Herbe
carsten.herbe@meteonomiqs.com
T +49 89 412 007-289 M +49 151 4416 5763
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure Antonios Chatzipavlis
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 

What's hot (20)

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 

Similar to Moving to Databricks & Delta

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Serverless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMServerless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMRightScale
 
DSD-INT 2018 Delft-FEWS new features - Boot Ververs
DSD-INT 2018 Delft-FEWS new features - Boot VerversDSD-INT 2018 Delft-FEWS new features - Boot Ververs
DSD-INT 2018 Delft-FEWS new features - Boot VerversDeltares
 
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar WarkenVadym Kazulkin
 
Serverless Meetup - Event Sourcing
Serverless Meetup - Event SourcingServerless Meetup - Event Sourcing
Serverless Meetup - Event SourcingLuca Bianchi
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...DataWorks Summit
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
 
DWX 2023 - Datenbank-Schema Deployment im Kubernetes Release
DWX 2023 - Datenbank-Schema Deployment im Kubernetes ReleaseDWX 2023 - Datenbank-Schema Deployment im Kubernetes Release
DWX 2023 - Datenbank-Schema Deployment im Kubernetes ReleaseMarc Müller
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...Amazon Web Services
 

Similar to Moving to Databricks & Delta (20)

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Serverless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMServerless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBM
 
DSD-INT 2018 Delft-FEWS new features - Boot Ververs
DSD-INT 2018 Delft-FEWS new features - Boot VerversDSD-INT 2018 Delft-FEWS new features - Boot Ververs
DSD-INT 2018 Delft-FEWS new features - Boot Ververs
 
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken
"Serverless Java Applications" at Froscon 2018 by Vadym Kazulkin/Elmar Warken
 
Serverless Meetup - Event Sourcing
Serverless Meetup - Event SourcingServerless Meetup - Event Sourcing
Serverless Meetup - Event Sourcing
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
DWX 2023 - Datenbank-Schema Deployment im Kubernetes Release
DWX 2023 - Datenbank-Schema Deployment im Kubernetes ReleaseDWX 2023 - Datenbank-Schema Deployment im Kubernetes Release
DWX 2023 - Datenbank-Schema Deployment im Kubernetes Release
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...
AWS re:Invent 2016: Deploying Amazon WorkSpaces at Enterprise Scale to Delive...
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 

Recently uploaded

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Moving to Databricks & Delta

  • 1. Moving to Databricks Carsten Herbe Technical Lead Data Products at wetter.com GmbH 19.11.2020
  • 2. About wetter.com & METEONOMIQS What do we need Spark for?
  • 3. About us METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com CONSUMER DATAMEDIA #1 B2C weather portal in DACH with up to 20mil monthly UU (AGOF) Full service, cross-media >50 daily TV/video productions Weather and geo-based solutions for advertiser, FMCG & retailer … Weather Data Impact Weather Based Advertising Weather Driven Forecasting Footfall & Location Insights Weather API
  • 4. Example product: GfK Retail Crowd Monitor (in cooperation with GfK) Comparison of visit frequencies in major German city centers Carsten Herbe - Technical Lead Data Products » Architecture, Dev & Ops for Infrastructure & Data Pipelines » Today: #aws #spark #databricks » Past: #hadoop #kafka #dwh #bi
  • 5. Our world before Databricks … Architecture & Motivation
  • 6. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch Athena partner EMR analyze Athena Lambda
  • 7. Motivation for moving to Databricks & Delta It’s not about performance but productivity • GDPR deletes in minutes without downtime instead of hours/days with downtime • Data corrections / enhancements without downtime instead of whole days of downtime Delta • Improved usability & productivity using Databricks Notebooks • Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters) Databricks • Improve collaboration between Data Scientists and Data Engineers / Dev Team • Replace custom solution (that has been started) MLflow
  • 8. Welcome to the world of Databricks & Delta Infrastructure view
  • 9. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports Athena partner EMR analyze Athena AWS Glue Data Catalog StepFunctionsCloudWatch Lambda
  • 10. Architecture: How Databricks fits in Just plug & play? ingestion Ingestions API data analytics serving REST API delivery customer S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch partner Athena Lambda RAW data CLEAN data CALC data Databricks analyze Databricks cleanse Databricks calc Databricks calc & aggregate Databricks cleanse
  • 11. Databricks Workspaces We use one Workspace as all our projects already share one AWS account and are logically separated • One workspace for all stages/projects • must be created manually by Databricks Single classic Workspace • dedicated workspaces for stages/projects • must all be created manually by Databricks • No SSO means user management per workspace Multiple classic Workspaces • dedicated workspaces for stages/projects • use the recently published Account API • you can create workspaces yourself (checkout Databricks terraform) Multiple Workspaces with Account API
  • 12. Workspace setup: what Databricks requires & creates We could reuse our IAM policies and roles with slight modifications. • S3 bucket for the Databricks workspace (e.g. notebooks) • IAM role to be able to deploy from Databricks account into your account • EC2 instance role(s) for deploying/starting a Databricks cluster Requires • VPC & related AWS resources • you cannot reuse an existing VPC • so you must do new VPC peerings • NOTE: this changes with the new Account API • you can provide your specific IP range to Databricks before workspace creation Creates
  • 13. Security: Combining Databricks & AWS IAM We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total • Each user must have a valid mail address à same for technical users! • You can create tokens for users à API access • You can restrict access to clusters based on user or group • launch (both analytical & job) clusters only with specific EC2 instance role Users, groups, permissions • Use Databricks cluster with your AWS IAM role • requires SSO • allows you to share one cluster between projects IAM passthrough
  • 14. Cluster configurations init scripts, instance types, spot instances • Init scripts (pip install …) quite similar to EMR init scripts • one EC2 instance type for workers • no instance fleets as with EMR • mix of on-demand and spot- instances (with fallback to on- demand) • for analytics & dev we use 1 on- demand for driver and spot otherwise • Autoscaling for shared (dev) clusters
  • 15. Migrating our Spark Pipelines Data & Application
  • 16. From Parquet tables to Delta tables Conversion was easy. Some regular housekeeping required. • CONVERT TO DELTA mytab PARTITIONED BY (...) • requires some compute resources to analyze existing data Parquet to Delta • UPDATE/DELETE generates a lot of small files • à configure as table property à done as part of DML • or à periodical housekeeping OPTIMIZE • Improves selective queries with WHERE on one or more columns • Cannot be configured as table property à housekeeping • we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’) ZORDER • Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel) • After retention period they should be deleted using VACUUM • à periodical housekeeping required VACUUM
  • 17. AWS Glue Data Catalog vs Databricks Hive Metastore We continued with Glue. No we have two external tables: one for Delta, one for Athena … Glue Data Catalog Databricks Metastore can be accessed by multiple Databricks workspaces can only be accessed by the Databricks workspace that owns it can be accessed by other tools like Athena cannot be used by Athena officially not supported with Databricks Connect supported with Databricks connect Native Delta Tables “Symlink” Tables CREATE TABLE … USING DELTA [LOCATION …] GENERATE symlink_format_manifest FOR TABLE … CREATE EXTERNAL TABLE … LOCATION ‘…/symlink_format_manifest/’ Full Delta Features: DML, TimeTravel Only SELECT Used by our Spark applications and interactive analytics Used for reporting with Tableau & Athena
  • 18. Workflows (1/2): From single steps on EMR … Detailed graphical overview of the workflow StepFunctions before Databricks Create or use existing EMR cluster (latter is quite handy for dev) Lambda for adding step to EMR cluster access control by IAM role One step = one Spark job Good (graphical) overview of pipeline
  • 19. Workflows (2/2): … to bulk steps with Databricks Job clusters cheaper Job clusters vs more expensive All-Purpose Clusters StepFunctions with Databricks You just can run one job on a Job cluster Using an All-Purpose cluster is more expensive. One job cluster for each steps: lot of time for cluster creation à painful for dev One step = one sequence of Spark jobs à workflow code on driver required! Lambda for launching Spark job on Databricks cluster Databricks token for authentication Only partial picture (but Python workflow is easier to implement)
  • 20. Adapting our (Spark) applications Hardly any work required in existing Spark applications • Athena cannot write to Delta • Migrate to Spark Athena steps • replace with DML (DELETE/MERGE/UPDATE) • instead of deleting objects in mytab/year=2020/month=10/day=1: DELETE FROM mytab WHERE year=2020 AND month=10 AND day=1 • works even faster than S3 deletes! S3 object deletions • Replace df.write.mode("append").format(”parquet") with df.write.mode("append").format("delta") • No more explicit adding of partitions after writing to an S3 folder df.write
  • 21. Analyzing, developing & collaborating with Databricks Developer & Data Scientist view
  • 22. Databricks Notebooks Here we highlight the differences to EMR Notebooks • Notebooks run on the same binaries as the cluster • i.e. all packages installed by the init script are also available on the notebook host • this way we could use direct visualization package like folium in the notebook Python packages • You can access notebooks without a running cluster • Attaching notebooks to another cluster is just a click • no more restarts as with EMR Notebooks Accessibility • Code Completion • Quick visualization of SQL results Usability • If you prefer to write Python programs instead of Notebooks • submit a local script from your laptop to a running All-Purpose cluster • Beware of different time zones on your machine and the cloud instance! Databricks Connect
  • 23. Collaboration & Git/Bitbucket Our team members tried different ways, but we all ended up using batch workspace export/import • Collaborative Notebook editing (like Office 365) at the same timeBuilt-in • You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks workspace • You must manually link an existing notebook in your workspace to Git/Bitbucket Direct Git integration • install Databricks CLI (you want to do this anyway) • import into workspace: databricks workspace import_dir ... • export to local machine: databricks workspace export_dir ... • git add / commit / push workspace export/import
  • 25. Summary We already investigated performance & costs during PoC. So we felt comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • functionality of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort
  • 26. Conclusion & outlook We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • complexity of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort • MLflow • Autoloader • Account API More features www.meteonomiqs.com Carsten Herbe carsten.herbe@meteonomiqs.com T +49 89 412 007-289 M +49 151 4416 5763
  • 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.