Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way. The platform provides the abstractions and declarative capabilities for data extraction & feature engineering followed by model training and serving; using standard tools and libraries (e.g. Airflow, K8S, Spark, scikit-learn, etc.).
4. ML/AI tech & processes have rapidly
evolved in recent years
5. ... and are still shaping: Monitoring,
AutoML, MLOps, etc.
6. Natural Intelligence
- A global leader in multi-vertical online
comparison marketplaces
- Our matching technology enables consumers
to make confident purchasing decisions while
helping brands grow their business
7. NI started the journey to automate its core
business using ML/AI 1.5 years ago with
main focus on:
- Website personalization
- Adwords bidding
8. We decided to continue working with proven solutions
that we already utilized in our data platform
9. The Orchestration Barrier
- Diversity in infra (e.g. GCP, AWS)
- Numerous platforms and libraries
- Diversified skill-set
- Complex workflows
10. Impact
- Data Scientists can’t focus in algorithms & business-logic
- The time-to-market (TTM) of ML features & solutions
is often too long and unpredictable.
15. Build & Train Deploy & Manage
Monitor
Fetch Clean Prepare
Train Evaluate
Batch
Inference
Realtime
Inference
Validate
Deploy
Data Science infra
Algorithms, Frameworks, Auto Tuning,...
Data Lake & Infra
Data Stores
Meta Store
Workflow
and
Scheduling Processing Engines
Feature StoreModel Store
The
Problem
22. import json
import model_store
from model_store import ModelStore
_MODEL_STORE = ModelStore(model_store.PRODUCTION)
_PETAL_WIDTH = 'petal_width'
def predict(input_json):
"""
predicts probability input flower being Iris Virginica
"""
print(f'input_json={input_json}')
input_dict = json.loads(input_json)
model, version = _MODEL_STORE.load_latest_model()
result = str(model.predict_proba([[input_dict[_PETAL_WIDTH]]])[0][1])
print(f'result={result}')
return result
23. import sys
import time
import model_store
import numpy as np
from model_store import ModelStore
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
_CANDIDATE_MODEL_STORE = ModelStore(model_store.CANDIDATE)
_PRODUCTION_MODEL_STORE = ModelStore(model_store.PRODUCTION)
def train_model():
iris = datasets.load_iris()
X = iris["data"][:, 3:] # petal width
y = (iris["target"] == 2).astype(np.int)
model = LogisticRegression()
model.fit(X, y)
version = round(time.time())
print(f'Saving model with version {version} to candidate model store.')
_CANDIDATE_MODEL_STORE.save_model(model, version)
def validate_model():
model, version = _CANDIDATE_MODEL_STORE.load_latest_model()
print(f'Validating model with version {version} to candidate model store.')
if not isinstance(model.predict([[1]]), np.ndarray):
raise ValueError('Invalid model')
print(f'Deploying model with version {version} to production model store.')
_PRODUCTION_MODEL_STORE.save_model(model, version)
if __name__ == '__main__':
cmd = sys.argv[1]
if cmd == 'train':
train_model()
elif cmd == 'validate':
validate_model()
else:
raise ValueError(f"Unknown command {cmd}")