Frameworks and technologies in the Hadoop ecosystem are undergoing rapid innovation, but the open source tooling around usability has lagged behind. We will present a suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Pig workflows and easily interact with and visualize datasets. Netflix?s big data teams have worked for the past year implementing this framework in the AWS cloud. During that time, we have seen a massive influx of data and a corresponding increase in new development on our platform. This toolset has been a critical enabler in minimizing development time and effort. Using the development of a recommendation algorithm as an example, we?ll walk through use cases for this stack of tools, showing how they interact to facilitate development. The presentation will include demos, implementation details, and our roadmap to open source various key services in the framework, including restful services that: provide comprehensive metadata management across data sources; enable visualization and caching of results of Hadoop jobs; visualize the execution plans produced by languages such as Pig and Hive; and provide detailed analytics on the currently executing workload and trends in historical performance.
7. Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)
Forklift
(Data Movement)
Looper
(Backloading)
Ignite
(A/B Test Analytics)
Spock
(Data Auditing)
Genie
(Hadoop PaaS)
Lipstick
(Pig Workflow
Visualization)
Event Service
(Orchestration)
Hadoop
S3
Other Processing
20. Whether your dataset is large or small, being
able to visualize it makes it easier to explain.
21. Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)
22. Sting
• Allows users to cache the results of a genie job
in memory
• Sub second response to OLAP style operations
(slicing, dicing, aggregations).
• Adhoc / recurring schedule
• Easy to use!
37. Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)
38. Lipstick
• Allows users to visualize their data flow
• Allows users to see common errors
• Allows users to easily monitor their jobs
• Empowers users to support themselves
• Facilitates communication between
infrastructure team and users
53. Wrapping up
• Demos at the Netflix booth in the exhibit hall
(see more Lipstick, Sting, and Genie).
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• We welcome feedback and contributions!
54. Charles Smith: charsmith@netflix.com
Jeff Magnusson: jmagnusson@netflix.com
Thank you!
Jobs: http://jobs.netflix.com
Netflix OSS: http://netflix.github.io
Tech Blog: http://techblog.netflix.com/
Editor's Notes
The scripts can be very complicated – compiling to many map/reduce steps and performing complex data transformations along the way.We’ve been happy with our choice of Pig in that it provides an abstraction to easily express complicated map/reduce logic along with some facilities for code reuse (udfs, macros). When workflows get sufficiently complicated however, Pig is almost so abstract that it becomes hard to follow the data flow logic and image how it will translate to map reduce.