1. Hopsworks
Feature Store 2.0,
a new paradigm
Jim Dowling
Logical Clocks
2020-12-14
1st Global Feature Stores
for ML Meetup
2. Growing Consensus on how to manage complexity of AI
Feature Store Online
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Store Offline
Feature
Engineering
Connectors
to External
Data Sources
Data Model Prediction
φ(x)
2
3. Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE
4. End-to-End ML Pipelines and the Feature Store
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Features
Validate
Retrieve Feature Values
5. End-to-End ML Pipelines and the Feature Store with CI/CD
Code and
configuration
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Experiments/
Development
Features
Validate
Retrieve Feature Values
Log Predictions, Retrieve Feature Statistics for Data Drift Detection
6. End-to-End ML Pipelines and the Feature Store with CI/CD and Provenance
Code and
configuration
Data Lake,
Warehouse,
Kafka
Feature
Store
Model
registry
Feature
Engineering
Model
Serving
Model
Training
Model
Deploy
Model
Monitoring
Experiments/
Development
Scaleout
Metadata
Features
Validate
Retrieve Feature Values
Log Predictions, Retrieve Feature Statistics for Data Drift Detection
Elasticsearch
Sync
7. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
8. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Training
Datasets
Survivename PClass Sex Balance
Join
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
9. Hopsworks Feature Store Concepts: Features, Feature Groups, and Training Datasets
Features name Pclass Sex Survive Name Balance
Training
Datasets
Survivename PClass Sex Balance
Join
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
Azure
S3
HopsFS
10. Features are created/updated at different cadences
Click features every 10 secs
CDC data every 30 secs
User profile updates every hour
Featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
<10ms
TBs/PBs
11. FeatureGroup Ingestion in Hopsworks
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Profile Updates
Weblogs
Hof: Real-time feature
Engineering
Kafka Output
12. Hopsworks Feature Store V1 API
First Feature Store with a General Purpose DataFrame API
Feature Store is a cache for materialized features, not a library.
Online and Offline Feature Stores to support low latency and scale, respectively
Reuse of Features means JOINS – Spark as a join engine
13. Hopsworks Feature Store V2 API
Enforce feature-group scope and schema+data versioning as best practice
Better support for multiple feature stores - join features from development and
production feature stores
Better support for complex joins of features
First class API support for time-travel
Support any Python or Spark client with a single library
14. Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)
fg.add_tag(name=“ingestion, value=“Databricks:jim; Pii;notebook.ipynb”)
15. # Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN
key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
sc = fs.get_storage_connector("myBucket", "S3")
td = fs.create_training_dataset("training_dataset", version=1,
storage_connector=sc,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate':
0.1})
td.save(feature_join)
# When training a model, read the training data (use “test” to read test data):
ds = td.read(split="train")
Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/