Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools
1. The Future? of Data Science/ML at Scale: A
Look at Lakehouse, Delta Lake & MLflow
Jules S. Damji
Databricks, Inc
2/10/2021 @ UCB iSchool
http://dbricks.co/ucbi-webinar
@2twitme
2.
3. Talk Outline
§ What is the data problem?
§ What impedes advanced analytics?
§ Past & present data architecture to look to the future?
§ What’s “Lakehouse” paradigm
§ Delta Lake & MLflow
4. About
Cloud data platform for analytics,
engineering and data science
Runs a fleet of millions of VMs to
process exabytes of data/day
>7000 customers
5. The biggest challenges with data today:
data quality, staleness , data volume and scale
How to grapple data beyond 2020 . . .
6. Data Analyst Survey
60% reported data quality as top challenge
86% of analysts had to use stale data, with
41% using data that is >2 months old
90% regularly had unreliable data sources
over the last 12 months
ō
ō
ō
10. 1980s: Data Warehouses
§ ETL data directly from operational
database systems
§ Purpose-built for SQL analytics & BI:
schemas, indexes, caching, etc.
§ Powerful management features such as
ACID transactions and time travel
ETL
Operational Data
Data Warehouses
BI Reports
11. 2010s: New Problems for Data Warehouses
§ Could not support rapidly growing
unstructured and semi-structured data:
time series, logs, images, documents, etc.
§ High cost to store large datasets
§ No support for data science & ML
ETL
Operational Data
Data Warehouses
BI Reports
12. 2010s: Data Lakes
§ Low-cost storage to hold all raw data
(e.g., Amazon S3, HDFS)
▪ $12/TB/month for S3 infrequent tier!
§ ETL jobs then load specific data into
warehouses, possibly for further ELT
§ Directly readable in ML libraries (e.g.,
TensorFlow, PyTorch) due to open file
format
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
Preparation
ETL
13. Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more
complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc.
§ Extra ETL steps that can go wrong
Timeliness suffers & High Cost:
§ Extra ETL steps before data is available
in data warehouses
§ Continuous ETL, duplicated storage
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
Preparation
ETL
15. Lakehouse Systems
Implement data warehouse management and performance features
on top of directly-accessible data in open formats
Structured, Semi & Unstructured Data
Data Lake Storage
BI
Data
Science
Machine
Learning
Reports
Management &
Performance Layer
ETL
Can we get state-of-the-art performance &
governance features with this design?
Cheap storage
in open formats
for all data
Direct access
to data files
SQL
16. Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Declarative access for data science & ML
17. Metadata Layers for Data Lakes
§ Track which files are part of a table version to offer
rich management features like transactions
▪ Clients can then access the underlying files at high speed
▪ Optimistic Concurrency
§ Implemented in multiple systems:
§ Examples:
ACID
Client Application
Metadata Layer
Data Lake
Which files
are part of
table v1?
f1, f2 f3
f1 f2 f3 f4
18. Example with
file1.parquet
file2.parquet
file3.parquet
“events” table
_delta_log / v1.parquet
/ v2.parquet
Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
track which files are part of
each version of the table
(e.g., v2 = file1, file2, file3)
_delta_log / v3.parquet
atomically add new log file
v3 = file1b, file2, file3b
Clients now always read a
consistent table version!
• If a client reads v2 of log, it sees
file1, file2, file3 (no delete)
• If a client reads v3 of log, it sees
file1b, file2, file3b (all deleted)
See our VLDB 2020 paper for details
19. Other Management Features with
§ Streaming I/O: treat a table as a
stream of changes to remove need
for message buses like Kafka
§ INSERT, UPSERT, DELETE & MERGE
§ Time travel to an old table version
§ Schema enforcement & evolution
§ Expectations for data quality
CREATE TABLE orders (
product_id INTEGER NOT NULL,
quantity INTEGER CHECK(quantity > 0),
list_price DECIMAL CHECK(list_price > 0),
discount_price DECIMAL
CHECK(discount_price > 0 AND
discount_price <= list_price)
);
spark.readStream
.format("delta")
.table("events")
20. Adoption
§ Already > 50% of Databricks I/O workload (exabytes/day)
§ Broad industry support
Ingest
from
Query
from
Store data in
21. Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
22. Lakehouse Engine Optimizations
Directly-accessible file storage optimizations can enable high SQL
performance:
§ Caching hot data in RAM/SSD, possibly transcoded
§ Data layout within files to cluster co-accessed data
(e.g., sorting or multi-dimensional clustering,)
§ Auxiliary data structures like statistics and indexes
§ Vectorized execution engines for modern CPUs
Minimize I/O for cold data,
which is the dominant cost
Match DWs on hot data
New query engines such as Databricks Delta Engine use these ideas
23. Example: Databricks Delta Engine
Vectorized engine for Spark SQL that uses SSD caching, multi-dimensional
clustering and zone maps over Parquet files
2996
7143 5793
37283
3302
0
10000
20000
30000
40000
DW1 DW2 DW3 DW4 Delta Engine
(on-demand)
TPC-DS 30TB Power Test Duration (s)
24. Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
25. DS/ML/DL over a Lakehouse
§ ML frameworks already support reading Parquet, ORC, etc.
§ New declarative interfaces for I/O enable further optimization
§ Example: Spark DataFrame API
compiles to relational algebra
...
model.fit(train_set)
Lazily evaluated queryplan
Optimized execution using
cache, statistics, index, etc
users
SELECT(kind = “buyer”)
PROJECT(date, zip, …)
PROJECT(NULL → 0)
users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“date”, “zip”, “price”]
.fillna(0)
User program
client library
26. Summary
Lakehouse systems combine the benefits of data warehouses & lakes
while simplifying enterprise data architectures
We believe they’ll take over in industry, as
most enterprise data is already in lakes
Structured, Semi & Unstructured Data
BI
Data
Science
Machine
Learning
Reports
27. Summary
Structured, Semi & Unstructured Data
BI
Data
Science
Machine
Learning
Reports
Result: simplify data architectures
to improve both reliability &
freshness
29. Traditional Software vs. Machine Learning
§ Goal: Meet a functional specification
§ Quality depends only on code
§ Typically pick one software stack w/
fewer libraries and tools
§ Limited deployment environments
§ Goal: Optimize metric (e.g., accuracy.
Constantly experiment to improve it
§ Quality depends on input data and
tuning parameters
§ Over time data changes; models drift…
§ Compare + combine many libraries,
model
§ Diverse deployment environments
Machine Learning
Traditional Software
30. But Building ML Applications is Complex
Data Prep
Training
Deployment
Raw Data
ML Engineer
Application Developer
Data Engineer
▪ Continuous, iterative process
▪ Dependent on reliable data
▪ Constantly update data & metrics
▪ Many teams and systems involved
μ
λ θ Tuning
Scale
μ
λ θ Tuning
Scale
Scale
Scale
Model
Exchange
Governance
31. : An Open-Source ML Platform
Experiment
management
Model management
Reproducible runs Model packaging
and deployment
T R A C K I N G P R O J E C T S M O D E L R E G I S T R Y
M O D E L S
Training
Deployment
Raw Data
Data Prep
ML Engineer
Application
Developer
Data Engineer
Any Language
Any ML Library
32. Key Concepts in MLflow Tracking
Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Tags and Notes: information about a run
Artifacts: files, data, and models
Source: what code ran?
Version: what of the code?
Run: an instance of code that runs by MLflow
Experiment: {Run, … Run}
33. $ mlflow ui
Model Development with MLflow is Simple!
import mlflow
data = load_text(file_name=file)
ngrams = extract_ngrams(data, N=n)
model = train_model(ngrams,
learning_rate=lr)
score = compute_accuracy(model)
with mlflow.start_run():
mlflow.log_param(“data_file”, file)
mlflow.log_param(“n”, n)
mlflow.log_param(“learn_rate”, lr)
mlflow.log_metric(“score”, score)
mlflow.sklearn.log_model(model)
Track parameters, metrics,
artifacts, output files & code version
Search using UI or API
34. Tracking for ML Experiments
Easily track parameters, metrics, and artifacts in popular ML libraries
Library integrations:
35. MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on many platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
37. MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on any platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
38. Model Format
Flavor 2
Flavor 1
ML Frameworks
Inference Code
Batch & Stream
Scoring
Serving Tools
Standard for ML models
MLflow Models
40. MLflow Components
Tracking
Record and query
experiments: code,
data, config, and results
Projects
Package data
science code in a
format that enables
reproducible runs
on any platform
Models
Deploy machine
learning models in
diverse serving
environments
Model
Registry
Store, annotate
and manage
models in a
central repository
mlflow.org github.com/mlflow twitter.com/MLflow
databricks.com
/mlflow
41. The Model Management Problem
When you work in a large organization with many models,
many data teams, management becomes a major
challenge:
• Where can I find the best version of this model?
• How was this model trained?
• How can I track docs for each model?
• How can I review models?
• How can I integrate with CI/CD?
MODEL
DEVELOPER
REVIEWER
MODEL
USER
???
42. Automated Jobs
REST Serving
Downstream
Users
Reviewers + CI/CD Tools
Staging Production Archived
Model Registry
Data Scientists Deployment Engineers
Parameters Metrics Artifacts
Models
Metadata
Tracking Server
Model Registry
VISION: Centralized and collaborative model lifecycle management
43. MLflow Model Registry
• Repository of named, versioned
models with controlled Access to Models
• Track each model’s stage: none,
staging, production, or archived
• Easily inspect a specific version and its run info
• Easily load a specific version
• Provides model description, lineage and activities
44. Model Registry Workflow API
Model Registry
MODEL
DEVELOPER
DOWNSTREAM
USERS
AUTOMATED JOBS
REST SERVING
REVIEWERS,
CI/CD TOOLS
mlflow.register_model(model_uri,"WeatherForecastModel")
mlflow.sklearn.log_model(model,
artifact_path=”sklearn_model”,
registered_model_name= “WeatherForecastModel”)
client = mlflow.tracking.Mlflowclient()
client.transition_model_version_stage(name=”WeatherForecastModel”,
version=5,
stage="Production")
model_uri= "models:/{model_name}/production".format(
model_name="WeatherForecastModel")
model_prod = mlflow.sklearn.load_model(model_uri)
model_prod.predict(data)
45. Databricks Webhooks allow setting
callbacks on registry events like
stage transitions to run CI/CD tools
Model Registry: Webhooks
j
u
s
t
l
a
u
n
c
h
e
d
Staging Production Archived
Data Scientists Deployment Engineers
Model Registry
Human
Reviewers
CI/CD Tools
Batch
Scoring
Real-time
Serving
v2
v3 v1
VERSION_REGISTERED: MyModel, v2
TRANSITION_REQUEST: MyModel, v2, Staging→Production
TAG_ADDED: MyModel, v2, BacktestPassed
46. MLflow Model Registry Recap
• Central Repository: Unique named registered models for
discovery across data teams
• Model Registry Workflow: Provides UI and API for registry
operations
• Model Versioning: Allow multiple versions of model in
different stages
• Model Stages: Allow stage transition: none, staging,
production, or archived
• CI/CD Integration: Easily load a specific version for testing
and inspection and webbooks for events notification
• Model Lineage: Provides model description, lineage and
activities
Staging Production Archived
Model Registry
Data Scientists Deployment Engineers
47. Summary
§ Lakehouse systems combine the benefits of data warehouses & lakes
while simplifying enterprise data architectures
§ With simplified architecture and use of Delta Lake as part of Lakehouse
storage layer and MLflow help to scale Advanced Analytics workloads
§ Other tools include Koalas (scalable EDA)
48. Learn More
§ Download and learn Delta Lake at delta.io
§ Download and learn MLflow at mlflow.org
§ Download and learn Koalas at Koalas GitHub
49. Resources & Fun to Read
§ Lakehouse: A New Generation of Open Platforms that Unify Data
Warehousing and Advanced Analytics
§ What is Lakehouse and why
§ What is Delta Lake and why
§ What is MLflow and Why
§ We don’t need data scientists, we need data engineers
§ Data Science is different now
50. Thank you! J
Q & A
jules@databricks.com
@2twitme
https://www.linkedin.com/in/dmatrix/