SlideShare a Scribd company logo
1 of 19
Download to read offline
Real-time Feature Engineering
with Apache Spark Streaming
and Hof
Fabio Buso
Software Engineer @ Logical Clocks AB
Feature stores
▪ Repository for curated ML features ready to be used
▪ Ensure consistency between features used for training and features
used for serving
▪ Centralized place to collect:
▪ Metadata
▪ Statistics
▪ Labels/Tag
▪ Spark summit 2020 talk:
https://databricks.com/session_na20/building-a-feature-store-arou
nd-dataframes-and-apache-spark
Real-Time Feature Engineering
▪ Data arrives at the clients making inference requests
▪ Features cannot be pre-computed and cached in the online feature store
▪ Data needs to be featurized before being sent to the model for
prediction
▪ One-hot encode
▪ Normalization and scaling of numerical features
▪ Window aggregates
▪ Real time features needs to be augmented using the Feature Store
▪ Not all features are provided by the client
▪ Construct the feature vector using with features retrieved from in the online feature store
Real-Time Requirements
▪ Hide complexity from clients
▪ Strict response time SLA
▪ Use-cases are usually user facing
▪ Avoid feature engineering in the client
▪ Feature engineering needs to be implemented for each client using the model
▪ Hard to maintain consistency between training and inference
Approach 1: Preprocessing with tf.Transform
▪ Write feature engineering in
preprocessing_fn
▪ Transformation is specific to a model
▪ Hard to reuse / keep track of
transformations at scale
▪ No support for window aggregations
▪ Doesn’t scale with number of
features/requests
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = x_centered * y_normalized
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
'x_centered_times_y_normalized':
x_centered_times_y_normalized,
's_integerized': s_integerized
}
▪ Deployed as separated service
▪ Duplicated feature engineering code
▪ No support for Window aggregations
▪ No support for feature enrichment from
the online feature store
▪ Not easily extended to save featurized
data.
Approach 2: KFServing Transformer
class ImageTransformer(kfserving.KFModel):
def __init__(self, name, predictor_host):
super().__init__(name)
self.predictor_host = predictor_host
def preprocess(self, inputs):
return {'instances': [image_transform(instance) for
instance in inputs['instances']]}
def postprocess(self, inputs):
return inputs
Approach 3*: Hof
▪ Independent from the model
▪ Pandas UDFs and Spark 3 to scale feature
engineering
▪ First class support for online feature store
integration
▪ Pluggable to save requests and inference
vectors.
*Third time’s a charm
*Third time lucky
*Great things come in threes
Hof
▪ gRPC/HTTP endpoint to submit feature engineering requests
▪ Mostly stateless
▪ Forward request to a message queue (Kafka)
▪ Messages are consumed/processed by Spark Streaming application(s)
▪ Messages are sent back on another queue
▪ Response is forwarded back to the user
▪ One input topic
▪ N output topics
▪ N is the number of Hof instances running
▪ Message:
▪ Key: Topic to send back
▪ Message: Data to be feature engineered
▪ Topics lifecycle managed
automatically
▪ Hof instances talk to Hopsworks REST APIs to
create/destroy topics
Hof architecture
Message queue setup
Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
Hof architecture
Spark Application setup
▪ Hof does not enforce the
schema in the request:
▪ Avoid additional deserialization
▪ If requests are self contained,
multiple Spark applications
can run in parallel
▪ Increase availability and throughput
Hof architecture
Addons
▪ Additional Spark applications
can be plugged in
▪ Save incoming data on
HopsFS/S3:
▪ Make it available for future feature engineering
▪ Save feature engineering
output:
▪ Auditing
▪ Model training
▪ Detect skews in incoming data
▪ Trigger alerts and model re-training
Client request
{
'streaming': {
'transformation':’fraud’,
'data': {
‘customer_id’: 1
‘transaction_amount’: 145
}
}
}
Application code
Show example with pandas_udf# Feature group definition
import hsfs
def stream_function(df):
# aggregations
return df
fg = fs.create_streaming_feature_group("example_streaming", version=1)
fg.save(stream_function)
# Processing
import hsfs
fs = connection.get_feature_store()
fg = fs.get_streaming_feature_group("example_streaming", version=1)
fg.apply()
Hof architecture
Streaming + Online Feature Store
▪ Not all the inference vector
has to be computed real
time
▪ Features can be fetched
from the online feature
store
▪ Features are referenced using the training dataset
Client request
{
'streaming': {
'transformation':’fraud’,
'data': { … }},
‘online‘: {
‘training_dataset’: {
‘name’: ‘fraud model’,
‘version’: 1
},
`filter`: {`customer_id`:3}
}
}
DEMO
github.com/logicalclocks
hopsworks.ai
@logicalclocks
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

More from Databricks

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

More from Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Recently uploaded

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Real-time Feature Engineering with Apache Spark Streaming and Hof

  • 1. Real-time Feature Engineering with Apache Spark Streaming and Hof Fabio Buso Software Engineer @ Logical Clocks AB
  • 2. Feature stores ▪ Repository for curated ML features ready to be used ▪ Ensure consistency between features used for training and features used for serving ▪ Centralized place to collect: ▪ Metadata ▪ Statistics ▪ Labels/Tag ▪ Spark summit 2020 talk: https://databricks.com/session_na20/building-a-feature-store-arou nd-dataframes-and-apache-spark
  • 3. Real-Time Feature Engineering ▪ Data arrives at the clients making inference requests ▪ Features cannot be pre-computed and cached in the online feature store ▪ Data needs to be featurized before being sent to the model for prediction ▪ One-hot encode ▪ Normalization and scaling of numerical features ▪ Window aggregates ▪ Real time features needs to be augmented using the Feature Store ▪ Not all features are provided by the client ▪ Construct the feature vector using with features retrieved from in the online feature store
  • 4. Real-Time Requirements ▪ Hide complexity from clients ▪ Strict response time SLA ▪ Use-cases are usually user facing ▪ Avoid feature engineering in the client ▪ Feature engineering needs to be implemented for each client using the model ▪ Hard to maintain consistency between training and inference
  • 5. Approach 1: Preprocessing with tf.Transform ▪ Write feature engineering in preprocessing_fn ▪ Transformation is specific to a model ▪ Hard to reuse / keep track of transformations at scale ▪ No support for window aggregations ▪ Doesn’t scale with number of features/requests def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_integerized = tft.compute_and_apply_vocabulary(s) x_centered_times_y_normalized = x_centered * y_normalized return { 'x_centered': x_centered, 'y_normalized': y_normalized, 'x_centered_times_y_normalized': x_centered_times_y_normalized, 's_integerized': s_integerized }
  • 6. ▪ Deployed as separated service ▪ Duplicated feature engineering code ▪ No support for Window aggregations ▪ No support for feature enrichment from the online feature store ▪ Not easily extended to save featurized data. Approach 2: KFServing Transformer class ImageTransformer(kfserving.KFModel): def __init__(self, name, predictor_host): super().__init__(name) self.predictor_host = predictor_host def preprocess(self, inputs): return {'instances': [image_transform(instance) for instance in inputs['instances']]} def postprocess(self, inputs): return inputs
  • 7. Approach 3*: Hof ▪ Independent from the model ▪ Pandas UDFs and Spark 3 to scale feature engineering ▪ First class support for online feature store integration ▪ Pluggable to save requests and inference vectors. *Third time’s a charm *Third time lucky *Great things come in threes
  • 8. Hof ▪ gRPC/HTTP endpoint to submit feature engineering requests ▪ Mostly stateless ▪ Forward request to a message queue (Kafka) ▪ Messages are consumed/processed by Spark Streaming application(s) ▪ Messages are sent back on another queue ▪ Response is forwarded back to the user
  • 9. ▪ One input topic ▪ N output topics ▪ N is the number of Hof instances running ▪ Message: ▪ Key: Topic to send back ▪ Message: Data to be feature engineered ▪ Topics lifecycle managed automatically ▪ Hof instances talk to Hopsworks REST APIs to create/destroy topics Hof architecture Message queue setup
  • 10. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  • 11. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  • 12. Hof architecture Addons ▪ Additional Spark applications can be plugged in ▪ Save incoming data on HopsFS/S3: ▪ Make it available for future feature engineering ▪ Save feature engineering output: ▪ Auditing ▪ Model training ▪ Detect skews in incoming data ▪ Trigger alerts and model re-training
  • 13. Client request { 'streaming': { 'transformation':’fraud’, 'data': { ‘customer_id’: 1 ‘transaction_amount’: 145 } } }
  • 14. Application code Show example with pandas_udf# Feature group definition import hsfs def stream_function(df): # aggregations return df fg = fs.create_streaming_feature_group("example_streaming", version=1) fg.save(stream_function) # Processing import hsfs fs = connection.get_feature_store() fg = fs.get_streaming_feature_group("example_streaming", version=1) fg.apply()
  • 15. Hof architecture Streaming + Online Feature Store ▪ Not all the inference vector has to be computed real time ▪ Features can be fetched from the online feature store ▪ Features are referenced using the training dataset
  • 16. Client request { 'streaming': { 'transformation':’fraud’, 'data': { … }}, ‘online‘: { ‘training_dataset’: { ‘name’: ‘fraud model’, ‘version’: 1 }, `filter`: {`customer_id`:3} } }
  • 17. DEMO
  • 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.