Wait! Exclusive 60 day trial to the world's largest digital library.
The SlideShare family just got bigger. You now have unlimited* access to books, audiobooks, magazines, and more from Scribd.Cancel anytime.
At Nielsen Identity, we use Apache Spark to process 10’s of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we’re spinning-up clusters with 1000’s of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.