Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Hadoop Ecosystem

17,862 views

Published on

Published in: Technology
  • Login to see the comments

The Hadoop Ecosystem

  1. 1. The Hadoop Ecosystem J Singh, DataThinks.org March 12, 2012
  2. 2. The Hadoop Ecosystem• Introduction – What Hadoop is, and what it’s not – Origins and History – Hello Hadoop• The Hadoop Bestiary• The Hadoop Providers• Hosted Hadoop Frameworks© J Singh, 2011 2 2
  3. 3. What Hadoop is, and what it’s not• A Framework for Map Reduce• A Top-level Apache Project• Hadoop is • Hadoop is not  A Framework, not a “solution” A painless replacement for SQL • Think Linux or J2EE  Scalable Uniformly fast or efficient  Great for pipelining massive Great for ad hoc Analysis amounts of data to achieve the end result  Sometimes the only option© J Singh, 2011 3 3
  4. 4. You are ready for Hadoop when…• You no longer get enthused by the prospect of more data – Rate of data accumulation is increasing – The idea of moving data from hither to yon is positively scary – A hit man threatens to delete your data in the middle of the night • And you want to pay him to do it• Seriously, you are ready for Hadoop when analysis is the bottleneck – Could be because of data size – Could be because of the complexity of the data – Could be because of the level of analysis required – Could be because the analysis requirements are fluid© J Singh, 2011 4 4
  5. 5. MapReduce Conceptual Underpinnings• Based on Functional Programming model – From Lisp • (map square (1 2 3 4)) (1 4 9 16) • (reduce plus (1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4• Easy to distribute (based on each element of the vector)• New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time© J Singh, 2011 5 5
  6. 6. MapReduce Flow Word Count Example MapOut foo 1Lines Result bar 1foo bar foo 3 quux 1quux foo labs 1 foo 1foo labs quux 2 foo 1quux bar 1 labs 1 quux 1 © J Singh, 2011 6 6
  7. 7. Hello Hadoop• Word Count – Example with Unstructured Data – Load 5 books from Gutenberg.org into /tmp/gutenberg – Load them into HDFS – Run Hadoop • Results are put into HDFS – Copy results into file system – What could be simpler? – DIY instructions for Amazon EC2 available on DataThinks.org blog© J Singh, 2011 7 7
  8. 8. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary – Core: Hadoop Map Reduce and Hadoop Distributed File System – Data Access: HBase, Pig, Hive – Algorithms: Mahout – Data Import: Flume, Sqoop and Nutch• The Hadoop Providers• Hosted Hadoop Frameworks© J Singh, 2011 8 8
  9. 9. The Core: Hadoop and HDFS• Hadoop • Hadoop Distributed File System – One master, n slaves – Robust Data Storage across – Master machines, insulating against • Schedules mappers & reducers failure • Connects pipeline stages – Keeps n copies of each file • Handles failure semantics • Configurable number of copies • Distributes copies across racks and locations© J Singh, 2011 9 9
  10. 10. Hadoop Bestiary (p1a): Hbase, Pig• Database Primitives • Processing – Hbase – Pig • Wide column data structure • A high(-ish) level data-flow built on HDFS language and execution framework for parallel computation • Accesses HDFS and Hbase • Batch as well as Interactive • Integrates UDFs written in Java, Python, JavaScript • Compiles to map & reduce functions – not 100% efficiently© J Singh, 2011 10 10
  11. 11. In Pig (Latin) Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;© J Singh, 2011 11 11 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  12. 12. Pig Translation into Map Reduce Load Users Load Pages Users = load … Filter by age Fltrd = filter … Pages = load … Job 1 Join on name Joined = join … Group on url Grouped = group … Summed = … count()… Job 2 Count clicks Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5© J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12 12
  13. 13. Hadoop Bestiary (p1b): Hbase, Hive• Database Primitives • Processing – Hbase – Hive • Wide column data structure • Data Warehouse Infrastructure built on HDFS • QL, a subset of SQL that supports primitives supportable by Map Reduce • Support for custom mappers and reducers for more sophisticated analysis • Compiles to map & reduce functions – not 100% efficiently Hive Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT IP Address of the User) :: :: STORED AS SEQUENCEFILE;© J Singh, 2011 13 13
  14. 14. Hadoop Bestiary (p2): Mahout• Algorithms • Examples – Mahout – Clustering Algorithms • Scalable machine learning and • Canopy Clustering data mining • K-Means Clustering • Runs on top of Hadoop • … • Written in Java • In active development – Recommenders / Collaborative – Algorithms being added Filtering Algorithms – Other • Regression Algorithms • Neural Networks • Hidden Markov Models© J Singh, 2011 14 14
  15. 15. Hadoop Bestiary (p3): Data Import• Data Import Mechanisms • Data Import – Sqoop: Structured Data – Sqoop – Flume: Streams • Import from RDBMS to HDFS • Export too – Flume • Import streams – Text Files – System Logs – Nutch • Import from Web • Note: Nutch + Hadoop = Lucene© J Singh, 2011 15 15
  16. 16. Hadoop Bestiary (p4): Complete Picture© J Singh, 2011 16 16
  17. 17. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary• The Hadoop Providers – Apache – Cloudera – Options when your data lives in a Database• Hosted Hadoop Frameworks© J Singh, 2011 17 17
  18. 18. Apache Distribution• The Definitive Repository – The hub for Code, Documentation, Tutorials – Many contributors, for example • Pig was a Yahoo! Contribution • Hive came from Facebook • Sqoop came from Cloudera• Bare metal install option: – Download to your machine(s) from Apache – Install and Operate • Modify to fit your business better© J Singh, 2011 18 18
  19. 19. Cloudera• Cloudera : Hadoop :: Red Hat : Linux• Cloudera’s Distribution Including Apache Hadoop (CDH) – A packaged set of Hadoop modules that work together – Now at CDH3 – Largest contributor of code to Apache Hadoop• $76M in Venture funding so far© J Singh, 2011 19 19
  20. 20. When the data lives in a Database…• Objective: keeping Analytics and Data as close as possible• Options for RDBMS : • Options for NoSQL Databases – Sqoop data to/from HDFS – Sqoop-like connectors • Need to move the data • Need to move the data • Can utilize all parts of Hadoop – In-database analytics • Available for TeraData, – Built-in Map Reduce available Greenplum, etc. for most NoSQL databases • If you have the need • Knows about and tuned to the – And the $$$ storage mechanism • But typically only offers map and reduce – No Pig, Hive, …© J Singh, 2011 20 20
  21. 21. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary• The Hadoop Providers• Hadoop Platforms as a Service – Amazon Elastic MapReduce – Hadoop in Windows Azure – Google App Engine – Other • Infochimps • IBM SmartCloud© J Singh, 2011 21 21
  22. 22. Amazon Elastic Map Reduce (EMR)• Hosted Map Reduce – CLI on your laptop • Control over size of cluster • Automatic spin-up/down instances – Map & Reduce programs on S3 • Pig, Hive or • Custom in Java, Ruby, Python, Perl, PHP, R, C++, Cascading – Data In/Out on S3 or – Data In/Out on DynamoDB• Keep in mind: – Hadoop on EC2 is also an option© J Singh, 2011 22 22
  23. 23. Hadoop in Windows Azure• Basic Level – Hive Add-in for Excel – Hive ODBC Driver• Hadoop-based Distribution for Windows Server and Azure – Strategic Partnership with HortonWorks – Windows-based CLI on your laptop• Broadest Level – JavaScript framework for Hadoop – Hadoop connectors for SQL Server and Parallel Data Warehouse© J Singh, 2011 23 23
  24. 24. Google App Engine MapReduce• Map Reduce as a Service – Distinct from Google’s internal Map Reduce – Part of Google App Engine• Works with Google Datastore – A Wide Column Store• A “purely programmatic” environment – Write Map and Reduce functions in Python / Java© J Singh, 2011 24 24
  25. 25. Map Reduce Use at Google© J Singh, 2011 25 25
  26. 26. Take Aways• There are many flavors of Hadoop. – The important part is Functional Programming and Map Reduce – Don’t let the proliferation of choices stump you. – Experiment with it!© J Singh, 2011 26 26
  27. 27. Thank you• J Singh – President, Early Stage IT • Technology Services and Strategy for Startups• DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions© J Singh, 2011 27 27

×