SlideShare a Scribd company logo
1 of 50
Download to read offline
The Future? of Data Science/ML at Scale: A
Look at Lakehouse, Delta Lake & MLflow
Jules S. Damji
Databricks, Inc
2/10/2021 @ UCB iSchool
http://dbricks.co/ucbi-webinar
@2twitme
Talk Outline
§ What is the data problem?
§ What impedes advanced analytics?
§ Past & present data architecture to look to the future?
§ What’s “Lakehouse” paradigm
§ Delta Lake & MLflow
About
Cloud data platform for analytics,
engineering and data science
Runs a fleet of millions of VMs to
process exabytes of data/day
>7000 customers
The biggest challenges with data today:
data quality, staleness , data volume and scale
How to grapple data beyond 2020 . . .
Data Analyst Survey
60% reported data quality as top challenge
86% of analysts had to use stale data, with
41% using data that is >2 months old
90% regularly had unreliable data sources
over the last 12 months
ō
ō
ō
Data Scientist Survey
75%
51%
42%
Getting high-quality, timely data is hard…
but it’s partly a problem of our own making!
The Evolution of
Data Management
1980s: Data Warehouses
§ ETL data directly from operational
database systems
§ Purpose-built for SQL analytics & BI:
schemas, indexes, caching, etc.
§ Powerful management features such as
ACID transactions and time travel
ETL
Operational Data
Data Warehouses
BI Reports
2010s: New Problems for Data Warehouses
§ Could not support rapidly growing
unstructured and semi-structured data:
time series, logs, images, documents, etc.
§ High cost to store large datasets
§ No support for data science & ML
ETL
Operational Data
Data Warehouses
BI Reports
2010s: Data Lakes
§ Low-cost storage to hold all raw data
(e.g., Amazon S3, HDFS)
▪ $12/TB/month for S3 infrequent tier!
§ ETL jobs then load specific data into
warehouses, possibly for further ELT
§ Directly readable in ML libraries (e.g.,
TensorFlow, PyTorch) due to open file
format
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
Preparation
ETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more
complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc.
§ Extra ETL steps that can go wrong
Timeliness suffers & High Cost:
§ Extra ETL steps before data is available
in data warehouses
§ Continuous ETL, duplicated storage
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
Preparation
ETL
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Lakehouse Vision
Data lake storage for all data
Single platform for every use case
Management features
(transactions, versioning, etc.)
Lakehouse Systems
Implement data warehouse management and performance features
on top of directly-accessible data in open formats
Structured, Semi & Unstructured Data
Data Lake Storage
BI
Data
Science
Machine
Learning
Reports
Management &
Performance Layer
ETL
Can we get state-of-the-art performance &
governance features with this design?
Cheap storage
in open formats
for all data
Direct access
to data files
SQL
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Declarative access for data science & ML
Metadata Layers for Data Lakes
§ Track which files are part of a table version to offer
rich management features like transactions
▪ Clients can then access the underlying files at high speed
▪ Optimistic Concurrency
§ Implemented in multiple systems:
§ Examples:
ACID
Client Application
Metadata Layer
Data Lake
Which files
are part of
table v1?
f1, f2 f3
f1 f2 f3 f4
Example with
file1.parquet
file2.parquet
file3.parquet
“events” table
_delta_log / v1.parquet
/ v2.parquet
Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
track which files are part of
each version of the table
(e.g., v2 = file1, file2, file3)
_delta_log / v3.parquet
atomically add new log file
v3 = file1b, file2, file3b
Clients now always read a
consistent table version!
• If a client reads v2 of log, it sees
file1, file2, file3 (no delete)
• If a client reads v3 of log, it sees
file1b, file2, file3b (all deleted)
See our VLDB 2020 paper for details
Other Management Features with
§ Streaming I/O: treat a table as a
stream of changes to remove need
for message buses like Kafka
§ INSERT, UPSERT, DELETE & MERGE
§ Time travel to an old table version
§ Schema enforcement & evolution
§ Expectations for data quality
CREATE TABLE orders (
product_id INTEGER NOT NULL,
quantity INTEGER CHECK(quantity > 0),
list_price DECIMAL CHECK(list_price > 0),
discount_price DECIMAL
CHECK(discount_price > 0 AND
discount_price <= list_price)
);
spark.readStream
.format("delta")
.table("events")
Adoption
§ Already > 50% of Databricks I/O workload (exabytes/day)
§ Broad industry support
Ingest
from
Query
from
Store data in
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Lakehouse Engine Optimizations
Directly-accessible file storage optimizations can enable high SQL
performance:
§ Caching hot data in RAM/SSD, possibly transcoded
§ Data layout within files to cluster co-accessed data
(e.g., sorting or multi-dimensional clustering,)
§ Auxiliary data structures like statistics and indexes
§ Vectorized execution engines for modern CPUs
Minimize I/O for cold data,
which is the dominant cost
Match DWs on hot data
New query engines such as Databricks Delta Engine use these ideas
Example: Databricks Delta Engine
Vectorized engine for Spark SQL that uses SSD caching, multi-dimensional
clustering and zone maps over Parquet files
2996
7143 5793
37283
3302
0
10000
20000
30000
40000
DW1 DW2 DW3 DW4 Delta Engine
(on-demand)
TPC-DS 30TB Power Test Duration (s)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
DS/ML/DL over a Lakehouse
§ ML frameworks already support reading Parquet, ORC, etc.
§ New declarative interfaces for I/O enable further optimization
§ Example: Spark DataFrame API
compiles to relational algebra
...
model.fit(train_set)
Lazily evaluated queryplan
Optimized execution using
cache, statistics, index, etc
users
SELECT(kind = “buyer”)
PROJECT(date, zip, …)
PROJECT(NULL → 0)
users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“date”, “zip”, “price”]
.fillna(0)
User program
client library
Summary
Lakehouse systems combine the benefits of data warehouses & lakes
while simplifying enterprise data architectures
We believe they’ll take over in industry, as
most enterprise data is already in lakes
Structured, Semi & Unstructured Data
BI
Data
Science
Machine
Learning
Reports
Summary
Structured, Semi & Unstructured Data
BI
Data
Science
Machine
Learning
Reports
Result: simplify data architectures
to improve both reliability &
freshness
Machine Learning
Development is Complex
Traditional Software vs. Machine Learning
§ Goal: Meet a functional specification
§ Quality depends only on code
§ Typically pick one software stack w/
fewer libraries and tools
§ Limited deployment environments
§ Goal: Optimize metric (e.g., accuracy.
Constantly experiment to improve it
§ Quality depends on input data and
tuning parameters
§ Over time data changes; models drift…
§ Compare + combine many libraries,
model
§ Diverse deployment environments
Machine Learning
Traditional Software
But Building ML Applications is Complex
Data Prep
Training
Deployment
Raw Data
ML Engineer
Application Developer
Data Engineer
▪ Continuous, iterative process
▪ Dependent on reliable data
▪ Constantly update data & metrics
▪ Many teams and systems involved
μ
λ θ Tuning
Scale
μ
λ θ Tuning
Scale
Scale
Scale
Model
Exchange
Governance
: An Open-Source ML Platform
Experiment
management
Model management
Reproducible runs Model packaging
and deployment
T R A C K I N G P R O J E C T S M O D E L R E G I S T R Y
M O D E L S
Training
Deployment
Raw Data
Data Prep
ML Engineer
Application
Developer
Data Engineer
Any Language
Any ML Library
Key Concepts in MLflow Tracking
Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Tags and Notes: information about a run
Artifacts: files, data, and models
Source: what code ran?
Version: what of the code?
Run: an instance of code that runs by MLflow
Experiment: {Run, … Run}
$ mlflow ui
Model Development with MLflow is Simple!
import mlflow
data = load_text(file_name=file)
ngrams = extract_ngrams(data, N=n)
model = train_model(ngrams,
learning_rate=lr)
score = compute_accuracy(model)
with mlflow.start_run():
mlflow.log_param(“data_file”, file)
mlflow.log_param(“n”, n)
mlflow.log_param(“learn_rate”, lr)
mlflow.log_metric(“score”, score)
mlflow.sklearn.log_model(model)
Track parameters, metrics,
artifacts, output files & code version
Search using UI or API
Tracking for ML Experiments
Easily track parameters, metrics, and artifacts in popular ML libraries
Library integrations:
MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on many platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
Project Spec
Code
Data
Config
Local Execution
Remote Execution
MLflow Projects
Dependencies
MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on any platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
Model Format
Flavor 2
Flavor 1
ML Frameworks
Inference Code
Batch & Stream
Scoring
Serving Tools
Standard for ML models
MLflow Models
Model Flavors Example
model = mlflow.pyfunc.load_model(model_uri)
model.predict(pandas.input_dataframe)
….
MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on any platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
The Model Management Problem
When you work in a large organization with many models,
many data teams, management becomes a major
challenge:
• Where can I find the best version of this model?
• How was this model trained?
• How can I track docs for each model?
• How can I review models?
• How can I integrate with CI/CD?
MODEL
DEVELOPER
REVIEWER
MODEL
USER
???
Automated Jobs
REST Serving
Downstream
Users
Reviewers + CI/CD Tools
Staging Production Archived
Model Registry
Data Scientists Deployment Engineers
Parameters Metrics Artifacts
Models
Metadata
Tracking Server
Model Registry
VISION: Centralized and collaborative model lifecycle management
MLflow Model Registry
• Repository of named, versioned
models with controlled Access to Models
• Track each model’s stage: none,
staging, production, or archived
• Easily inspect a specific version and its run info
• Easily load a specific version
• Provides model description, lineage and activities
Model Registry Workflow API
Model Registry
MODEL
DEVELOPER
DOWNSTREAM
USERS
AUTOMATED JOBS
REST SERVING
REVIEWERS,
CI/CD TOOLS
mlflow.register_model(model_uri,"WeatherForecastModel")
mlflow.sklearn.log_model(model,
artifact_path=”sklearn_model”,
registered_model_name= “WeatherForecastModel”)
client = mlflow.tracking.Mlflowclient()
client.transition_model_version_stage(name=”WeatherForecastModel”,
version=5,
stage="Production")
model_uri= "models:/{model_name}/production".format(
model_name="WeatherForecastModel")
model_prod = mlflow.sklearn.load_model(model_uri)
model_prod.predict(data)
Databricks Webhooks allow setting
callbacks on registry events like
stage transitions to run CI/CD tools
Model Registry: Webhooks
j
u
s
t
l
a
u
n
c
h
e
d
Staging Production Archived
Data Scientists Deployment Engineers
Model Registry
Human
Reviewers
CI/CD Tools
Batch
Scoring
Real-time
Serving
v2
v3 v1
VERSION_REGISTERED: MyModel, v2
TRANSITION_REQUEST: MyModel, v2, Staging→Production
TAG_ADDED: MyModel, v2, BacktestPassed
MLflow Model Registry Recap
• Central Repository: Unique named registered models for
discovery across data teams
• Model Registry Workflow: Provides UI and API for registry
operations
• Model Versioning: Allow multiple versions of model in
different stages
• Model Stages: Allow stage transition: none, staging,
production, or archived
• CI/CD Integration: Easily load a specific version for testing
and inspection and webbooks for events notification
• Model Lineage: Provides model description, lineage and
activities
Staging Production Archived
Model Registry
Data Scientists Deployment Engineers
Summary
§ Lakehouse systems combine the benefits of data warehouses & lakes
while simplifying enterprise data architectures
§ With simplified architecture and use of Delta Lake as part of Lakehouse
storage layer and MLflow help to scale Advanced Analytics workloads
§ Other tools include Koalas (scalable EDA)
Learn More
§ Download and learn Delta Lake at delta.io
§ Download and learn MLflow at mlflow.org
§ Download and learn Koalas at Koalas GitHub
Resources & Fun to Read
§ Lakehouse: A New Generation of Open Platforms that Unify Data
Warehousing and Advanced Analytics
§ What is Lakehouse and why
§ What is Delta Lake and why
§ What is MLflow and Why
§ We don’t need data scientists, we need data engineers
§ Data Science is different now
Thank you! J
Q & A
jules@databricks.com
@2twitme
https://www.linkedin.com/in/dmatrix/

More Related Content

What's hot

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 

What's hot (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 

Similar to The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Tao Cheng
 

Similar to The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools (20)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
r4
r4r4
r4
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 

Recently uploaded (20)

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools

  • 1. The Future? of Data Science/ML at Scale: A Look at Lakehouse, Delta Lake & MLflow Jules S. Damji Databricks, Inc 2/10/2021 @ UCB iSchool http://dbricks.co/ucbi-webinar @2twitme
  • 2.
  • 3. Talk Outline § What is the data problem? § What impedes advanced analytics? § Past & present data architecture to look to the future? § What’s “Lakehouse” paradigm § Delta Lake & MLflow
  • 4. About Cloud data platform for analytics, engineering and data science Runs a fleet of millions of VMs to process exabytes of data/day >7000 customers
  • 5. The biggest challenges with data today: data quality, staleness , data volume and scale How to grapple data beyond 2020 . . .
  • 6. Data Analyst Survey 60% reported data quality as top challenge 86% of analysts had to use stale data, with 41% using data that is >2 months old 90% regularly had unreliable data sources over the last 12 months ≈ç ≈ç ≈ç
  • 8. Getting high-quality, timely data is hard… but it’s partly a problem of our own making!
  • 10. 1980s: Data Warehouses § ETL data directly from operational database systems § Purpose-built for SQL analytics & BI: schemas, indexes, caching, etc. § Powerful management features such as ACID transactions and time travel ETL Operational Data Data Warehouses BI Reports
  • 11. 2010s: New Problems for Data Warehouses § Could not support rapidly growing unstructured and semi-structured data: time series, logs, images, documents, etc. § High cost to store large datasets § No support for data science & ML ETL Operational Data Data Warehouses BI Reports
  • 12. 2010s: Data Lakes § Low-cost storage to hold all raw data (e.g., Amazon S3, HDFS) ▪ $12/TB/month for S3 infrequent tier! § ETL jobs then load specific data into warehouses, possibly for further ELT § Directly readable in ML libraries (e.g., TensorFlow, PyTorch) due to open file format BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Preparation ETL
  • 13. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc. § Extra ETL steps that can go wrong Timeliness suffers & High Cost: § Extra ETL steps before data is available in data warehouses § Continuous ETL, duplicated storage BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Preparation ETL
  • 14. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Lakehouse Vision Data lake storage for all data Single platform for every use case Management features (transactions, versioning, etc.)
  • 15. Lakehouse Systems Implement data warehouse management and performance features on top of directly-accessible data in open formats Structured, Semi & Unstructured Data Data Lake Storage BI Data Science Machine Learning Reports Management & Performance Layer ETL Can we get state-of-the-art performance & governance features with this design? Cheap storage in open formats for all data Direct access to data files SQL
  • 16. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Declarative access for data science & ML
  • 17. Metadata Layers for Data Lakes § Track which files are part of a table version to offer rich management features like transactions ▪ Clients can then access the underlying files at high speed ▪ Optimistic Concurrency § Implemented in multiple systems: § Examples: ACID Client Application Metadata Layer Data Lake Which files are part of table v1? f1, f2 f3 f1 f2 f3 f4
  • 18. Example with file1.parquet file2.parquet file3.parquet “events” table _delta_log / v1.parquet / v2.parquet Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite track which files are part of each version of the table (e.g., v2 = file1, file2, file3) _delta_log / v3.parquet atomically add new log file v3 = file1b, file2, file3b Clients now always read a consistent table version! • If a client reads v2 of log, it sees file1, file2, file3 (no delete) • If a client reads v3 of log, it sees file1b, file2, file3b (all deleted) See our VLDB 2020 paper for details
  • 19. Other Management Features with § Streaming I/O: treat a table as a stream of changes to remove need for message buses like Kafka § INSERT, UPSERT, DELETE & MERGE § Time travel to an old table version § Schema enforcement & evolution § Expectations for data quality CREATE TABLE orders ( product_id INTEGER NOT NULL, quantity INTEGER CHECK(quantity > 0), list_price DECIMAL CHECK(list_price > 0), discount_price DECIMAL CHECK(discount_price > 0 AND discount_price <= list_price) ); spark.readStream .format("delta") .table("events")
  • 20. Adoption § Already > 50% of Databricks I/O workload (exabytes/day) § Broad industry support Ingest from Query from Store data in
  • 21. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 22. Lakehouse Engine Optimizations Directly-accessible file storage optimizations can enable high SQL performance: § Caching hot data in RAM/SSD, possibly transcoded § Data layout within files to cluster co-accessed data (e.g., sorting or multi-dimensional clustering,) § Auxiliary data structures like statistics and indexes § Vectorized execution engines for modern CPUs Minimize I/O for cold data, which is the dominant cost Match DWs on hot data New query engines such as Databricks Delta Engine use these ideas
  • 23. Example: Databricks Delta Engine Vectorized engine for Spark SQL that uses SSD caching, multi-dimensional clustering and zone maps over Parquet files 2996 7143 5793 37283 3302 0 10000 20000 30000 40000 DW1 DW2 DW3 DW4 Delta Engine (on-demand) TPC-DS 30TB Power Test Duration (s)
  • 24. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 25. DS/ML/DL over a Lakehouse § ML frameworks already support reading Parquet, ORC, etc. § New declarative interfaces for I/O enable further optimization § Example: Spark DataFrame API compiles to relational algebra ... model.fit(train_set) Lazily evaluated queryplan Optimized execution using cache, statistics, index, etc users SELECT(kind = “buyer”) PROJECT(date, zip, …) PROJECT(NULL → 0) users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“date”, “zip”, “price”] .fillna(0) User program client library
  • 26. Summary Lakehouse systems combine the benefits of data warehouses & lakes while simplifying enterprise data architectures We believe they’ll take over in industry, as most enterprise data is already in lakes Structured, Semi & Unstructured Data BI Data Science Machine Learning Reports
  • 27. Summary Structured, Semi & Unstructured Data BI Data Science Machine Learning Reports Result: simplify data architectures to improve both reliability & freshness
  • 29. Traditional Software vs. Machine Learning § Goal: Meet a functional specification § Quality depends only on code § Typically pick one software stack w/ fewer libraries and tools § Limited deployment environments § Goal: Optimize metric (e.g., accuracy. Constantly experiment to improve it § Quality depends on input data and tuning parameters § Over time data changes; models drift… § Compare + combine many libraries, model § Diverse deployment environments Machine Learning Traditional Software
  • 30. But Building ML Applications is Complex Data Prep Training Deployment Raw Data ML Engineer Application Developer Data Engineer ▪ Continuous, iterative process ▪ Dependent on reliable data ▪ Constantly update data & metrics ▪ Many teams and systems involved μ λ θ Tuning Scale μ λ θ Tuning Scale Scale Scale Model Exchange Governance
  • 31. : An Open-Source ML Platform Experiment management Model management Reproducible runs Model packaging and deployment T R A C K I N G P R O J E C T S M O D E L R E G I S T R Y M O D E L S Training Deployment Raw Data Data Prep ML Engineer Application Developer Data Engineer Any Language Any ML Library
  • 32. Key Concepts in MLflow Tracking Parameters: key-value inputs to your code Metrics: numeric values (can update over time) Tags and Notes: information about a run Artifacts: files, data, and models Source: what code ran? Version: what of the code? Run: an instance of code that runs by MLflow Experiment: {Run, … Run}
  • 33. $ mlflow ui Model Development with MLflow is Simple! import mlflow data = load_text(file_name=file) ngrams = extract_ngrams(data, N=n) model = train_model(ngrams, learning_rate=lr) score = compute_accuracy(model) with mlflow.start_run(): mlflow.log_param(“data_file”, file) mlflow.log_param(“n”, n) mlflow.log_param(“learn_rate”, lr) mlflow.log_metric(“score”, score) mlflow.sklearn.log_model(model) Track parameters, metrics, artifacts, output files & code version Search using UI or API
  • 34. Tracking for ML Experiments Easily track parameters, metrics, and artifacts in popular ML libraries Library integrations:
  • 35. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on many platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  • 36. Project Spec Code Data Config Local Execution Remote Execution MLflow Projects Dependencies
  • 37. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on any platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  • 38. Model Format Flavor 2 Flavor 1 ML Frameworks Inference Code Batch & Stream Scoring Serving Tools Standard for ML models MLflow Models
  • 39. Model Flavors Example model = mlflow.pyfunc.load_model(model_uri) model.predict(pandas.input_dataframe) ….
  • 40. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on any platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  • 41. The Model Management Problem When you work in a large organization with many models, many data teams, management becomes a major challenge: • Where can I find the best version of this model? • How was this model trained? • How can I track docs for each model? • How can I review models? • How can I integrate with CI/CD? MODEL DEVELOPER REVIEWER MODEL USER ???
  • 42. Automated Jobs REST Serving Downstream Users Reviewers + CI/CD Tools Staging Production Archived Model Registry Data Scientists Deployment Engineers Parameters Metrics Artifacts Models Metadata Tracking Server Model Registry VISION: Centralized and collaborative model lifecycle management
  • 43. MLflow Model Registry • Repository of named, versioned models with controlled Access to Models • Track each model’s stage: none, staging, production, or archived • Easily inspect a specific version and its run info • Easily load a specific version • Provides model description, lineage and activities
  • 44. Model Registry Workflow API Model Registry MODEL DEVELOPER DOWNSTREAM USERS AUTOMATED JOBS REST SERVING REVIEWERS, CI/CD TOOLS mlflow.register_model(model_uri,"WeatherForecastModel") mlflow.sklearn.log_model(model, artifact_path=”sklearn_model”, registered_model_name= “WeatherForecastModel”) client = mlflow.tracking.Mlflowclient() client.transition_model_version_stage(name=”WeatherForecastModel”, version=5, stage="Production") model_uri= "models:/{model_name}/production".format( model_name="WeatherForecastModel") model_prod = mlflow.sklearn.load_model(model_uri) model_prod.predict(data)
  • 45. Databricks Webhooks allow setting callbacks on registry events like stage transitions to run CI/CD tools Model Registry: Webhooks j u s t l a u n c h e d Staging Production Archived Data Scientists Deployment Engineers Model Registry Human Reviewers CI/CD Tools Batch Scoring Real-time Serving v2 v3 v1 VERSION_REGISTERED: MyModel, v2 TRANSITION_REQUEST: MyModel, v2, Staging→Production TAG_ADDED: MyModel, v2, BacktestPassed
  • 46. MLflow Model Registry Recap • Central Repository: Unique named registered models for discovery across data teams • Model Registry Workflow: Provides UI and API for registry operations • Model Versioning: Allow multiple versions of model in different stages • Model Stages: Allow stage transition: none, staging, production, or archived • CI/CD Integration: Easily load a specific version for testing and inspection and webbooks for events notification • Model Lineage: Provides model description, lineage and activities Staging Production Archived Model Registry Data Scientists Deployment Engineers
  • 47. Summary § Lakehouse systems combine the benefits of data warehouses & lakes while simplifying enterprise data architectures § With simplified architecture and use of Delta Lake as part of Lakehouse storage layer and MLflow help to scale Advanced Analytics workloads § Other tools include Koalas (scalable EDA)
  • 48. Learn More § Download and learn Delta Lake at delta.io § Download and learn MLflow at mlflow.org § Download and learn Koalas at Koalas GitHub
  • 49. Resources & Fun to Read § Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics § What is Lakehouse and why § What is Delta Lake and why § What is MLflow and Why § We don’t need data scientists, we need data engineers § Data Science is different now
  • 50. Thank you! J Q & A jules@databricks.com @2twitme https://www.linkedin.com/in/dmatrix/