Scalable Clusters On Demand

Honeycomb
Scalable Clusters on Demand
December 1st, 2017

Simplifying life’s most meaningful transaction.

Intro
Bogdan & Gustavo
● Data & ML engineers at Opendoor
● Building ETL and ML infrastructure
● Prior experience building serving and data infrastructure at Google & Airbnb
Curious to learn more about us?
● https://medium.com/opendoor-labs
● https://blog.opendoor.com

Outline
1. Motivation
2. Existings Solutions
3. Honeycomb Architecture
4. Operators
5. Autoscaling
6. Pros & Cons
7. Learnings
8. Demo

● Cluster
○ Docker / Kubernetes
○ Datadog
○ Scalyr and Stackdriver for logs
● ETL
○ Parquet on S3
○ Spark and Dask computing engine
○ Airflow / Luigi for ETL
● Databases:
○ Postgres for serving data
○ BigQuery for analytics
Stack

Kubernetes
Kubernetes is an open-source platform designed to automate deploying,
scaling, and operating application containers.

Legacy architecture
● Multiple schedulers (1 per team)
○ Lack of visibility into dependencies
○ Consistency issues
● Configuration is tightly coupled with code
○ Dependencies are not clear
○ Configuration deploys restart all running pods
● Statically allocated clusters resources
○ Operationally expensive
○ Always overprovisioned

Single Airflow with monorepo for DAGs
Pros:
● Easy to set up
● Works well with when compute is delegates to other services (Hive,
Presto, Spark, etc)
Cons:
● Different teams need different library versions
● One team becomes a bottleneck for trying out new libraries / tools
● Every new dependency / library upgrade requires airflow restart
● Using airflow workers for computationally extensive jobs is considered
antipattern

Pros:
● Easy to setup and run
● No need to maintain
Cons:
● Not frequent spark upgrades
● Installing native dependencies in python may take hours
● Python packaging is not an easy problem
● Does not work with our secrets management
● Is not integrated with out logging
EMR or DataProc

Honeycomb Architecture
Honeycomb Deep Dive
How everything fits

Project Goal
Build company wide data processing system with low
maintenance cost
● Support data science and data engineering requirements
○ Spark and Dask clusters
○ Flexibility and freedom for teams to use different tools and libraries
● Efficient cluster utilization
○ No idle resources
○ Spot pricing
● Universal scheduling system
○ Central store for ETL configuration
○ Easy to use abstractions to schedule jobs

Inspiration
Bloomberg Airflow fork:
● Bloomberg fork
● Airflow Kubernetes talk
Airflow roadmap:
● Kubernetes Executor/Operator
● PR: Kubernetes Operators
● Kubernetes docker operator

Honeycomb Architecture
Airflow

Operators
Honeycomb Deep Dive
Run your task on your custom
environment

OdPodOperator
OdPodOperator
market_buybox =
od_pod_operator.OdPodOperator(
dag=dag,
task_id='market-buybox',
milli_cores=100,
image='opendoor/dwellings:master'
memory_mb=256,
command='run_task mls_buybox_task',
)

OdPodOperator: Template
OdPodOperator
apiVersion: v1
kind: Pod
spec:
restartPolicy: Never
containers:
- name: pod-airflow-k8s-container
image: $(IMAGE)
command:
- bash
args:
- -c
- $(RUN_CMD)
env: $(UNIX_ENV)
resources: $(RESOURCES)

OdPodOperator: Monitoring
OdPodOperator
- Pod is monitored for exit status of 0 (success) and non-zero (failure)
- Logs are redirected from the Pod to the Airflow worker

SparkSubmitOperator
Cluster Operator
spark_task = spark_submit_operator.SparkSubmitOperator(
dag=dag,
task_id='test-spark',
milli_cores=500,
memory_mb=1024,
image='opendoor/dwellings:master-latest',
worker_count=2,
worker_milli_cores=2500,
worker_memory_mb=13 * 1024,
spark_program_path='/opt/spark/src/main/python/pi.py',
spark_program_args='1000'
)

SparkSubmitOperator
Cluster Operator

Autoscaling
Honeycomb Deep Dive
All the resources that you
need when you need them

Autoscaling: Exclusive Nodes (Instances)
Autoscaling

Autoscaling: Pod Affinity (bin packing)
Autoscaling

Autoscaling
Autoscaling
AWS Honeycomb
Instance Group
AWS Master k8s Instance
Group
Cluster Autoscaler
Pod
AWS Honeycomb
Instance Group
Taint: honeycomb
Honeycomb Pod
Honeycomb Pod
Honeycomb Pod
Pending
Honeycomb Pod
Monitors
unscheduled Pods
Spin up a new
spot instance

Cost: autoscaling + spot pricing

Pros & Cons
Honeycomb Deep Dive

Pros
Pros & Cons
+ Empowered users via full stack freedom
+ Low maintenance for infrastructure teams
+ Visibility into company data processing
+ Cloud independence
+ Low cost

Cons
Pros & Cons
- Requires kubernetes cluster
- Stateless spark
- Autoscaling brings some complexity to the kubernetes
configuration

Current big data open source tools allows us to build
scalable data processing infrastructure at fairly low cost

Example Pipelines
Honeycomb Deep Dive

Market Buybox Predictions
Example Pipeline

Scalable Clusters On Demand

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable Clusters On Demand

Similar to Scalable Clusters On Demand (20)

Recently uploaded

Recently uploaded (20)

Scalable Clusters On Demand