A talk about infrastructure as a service, challenges and advantages, Apache Hadoop as an ecosystem of tools for distributed storage and data processing and Apache Whirr for deployment and simple management.
4. From 0 to working
pipelines ... in minutes!
* log analysis, ETL, crawling, machine learning etc.
5. Manage a 10+ nodes Hadoop cluster
with custom software as a ... cron job.
6. The plan
Infrastructure as a Service (context)
Apache Hadoop (data processing)
Apache Whirr (deployment)
Resources (food for thought)
Q/A (or asavu@apache.org)
17. “All problems in computer science
can be solved by another level of
indirection” David Wheeler
... except for the problem of too many levels of
indirection
20. Overview
• Java, C/C++ • can scale to 1000s of
machines
• set of distributed
systems (hdfs, mr etc.) • designed to be highly
available at the
• platform for distributed application level
data processing
• https://hadoop.apache.org/
• simple programming
model (map / reduce)
32. Overview
Apache Whirr provides a set of libraries for
running cloud services:
* cloud neutral & based on jclouds
* has a common service API
* smart defaults
* available as a command line tool
42. Fundamental Papers
• Google Filesystem (2003)
http://research.google.com/archive/gfs.html
• Google MapReduce (2004)
http://research.google.com/archive/mapreduce.html
• Google BigTable (2006)
http://research.google.com/archive/bigtable.html
• Amazon Dynamo
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-
sosp2007.pdf
43. Articles
• Getting Real About Distributed System
Reliability:
http://blog.empathybox.com/post/19574936361/getting-real-about-
distributed-system-reliability
• Towards a Topology of Failure:
http://steveloughran.blogspot.com/2011/11/towards-topology-of-failure.html
• Hadoop in Cloud Infrastructures:
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-
infrastructures.html
44. jclouds
“jclouds is an open source library that helps
you get started in the cloud and reuse your
java and clojure development skills”
http://www.jclouds.org/
45. RHadoop
A way of running R scripts on Hadoop
http://blog.revolutionanalytics.com/2012/03/r-
and-hadoop-step-by-step-tutorials.html