This document discusses lessons learned from developing and managing massive (300TB+) Apache Spark pipelines in production. It covers using Spark for its performance, testability, and modularity benefits. It also discusses managing large scale through automation, simple interfaces, planning for growth by persisting data to HDFS, using efficient serialization and data structures, and testing on sampled data to address data skew issues.