At Opendoor, we do a lot of big data processing, and use Spark and Dask clusters for the computations. Our machine learning platform is written in Dask and we are actively moving data ingestion pipelines and geo computations to PySpark. The biggest challenge is that jobs vary in memory, cpu needs, and the load in not evenly distributed over time, which causes our workers and clusters to be over-provisioned. In addition to this, we need to enable data scientists and engineers run their code without having to upgrade the cluster for every request and deal with the dependency hell.
To solve all of these problems, we introduce a lightweight integration across some popular tools like Kubernetes, Docker, Airflow and Spark. Using a combination of these tools, we are able to spin up on-demand Spark and Dask clusters for our computing jobs, bring down the cost using autoscaling and spot pricing, unify DAGs across many teams with different stacks on the single Airflow instance, and all of it at minimal cost.
3. Intro
Bogdan & Gustavo
● Data & ML engineers at Opendoor
● Building ETL and ML infrastructure
● Prior experience building serving and data infrastructure at Google & Airbnb
Curious to learn more about us?
● https://medium.com/opendoor-labs
● https://blog.opendoor.com
6. ● Cluster
○ Docker / Kubernetes
○ Datadog
○ Scalyr and Stackdriver for logs
● ETL
○ Parquet on S3
○ Spark and Dask computing engine
○ Airflow / Luigi for ETL
● Databases:
○ Postgres for serving data
○ BigQuery for analytics
Stack
7. Kubernetes
Kubernetes is an open-source platform designed to automate deploying,
scaling, and operating application containers.
8. Legacy architecture
● Multiple schedulers (1 per team)
○ Lack of visibility into dependencies
○ Consistency issues
● Configuration is tightly coupled with code
○ Dependencies are not clear
○ Configuration deploys restart all running pods
● Statically allocated clusters resources
○ Operationally expensive
○ Always overprovisioned
11. Single Airflow with monorepo for DAGs
Pros:
● Easy to set up
● Works well with when compute is delegates to other services (Hive,
Presto, Spark, etc)
Cons:
● Different teams need different library versions
● One team becomes a bottleneck for trying out new libraries / tools
● Every new dependency / library upgrade requires airflow restart
● Using airflow workers for computationally extensive jobs is considered
antipattern
12. Pros:
● Easy to setup and run
● No need to maintain
Cons:
● Not frequent spark upgrades
● Installing native dependencies in python may take hours
● Python packaging is not an easy problem
● Does not work with our secrets management
● Is not integrated with out logging
EMR or DataProc
14. Project Goal
Build company wide data processing system with low
maintenance cost
● Support data science and data engineering requirements
○ Spark and Dask clusters
○ Flexibility and freedom for teams to use different tools and libraries
● Efficient cluster utilization
○ No idle resources
○ Spot pricing
● Universal scheduling system
○ Central store for ETL configuration
○ Easy to use abstractions to schedule jobs
26. Autoscaling
Autoscaling
AWS Honeycomb
Instance Group
AWS Master k8s Instance
Group
Cluster Autoscaler
Pod
AWS Honeycomb
Instance Group
Taint: honeycomb
Honeycomb Pod
Honeycomb Pod
Honeycomb Pod
Pending
Honeycomb Pod
Monitors
unscheduled Pods
Spin up a new
spot instance
29. Pros
Pros & Cons
+ Empowered users via full stack freedom
+ Low maintenance for infrastructure teams
+ Visibility into company data processing
+ Cloud independence
+ Low cost
30. Cons
Pros & Cons
- Requires kubernetes cluster
- Stateless spark
- Autoscaling brings some complexity to the kubernetes
configuration
31. Current big data open source tools allows us to build
scalable data processing infrastructure at fairly low cost