SlideShare a Scribd company logo
1 of 30
Download to read offline
Object
Partners
Inc.




            Introduction to Hadoop

 Click to edit Master subtitle style
                                        Demo by:
   Presented by:
                                       Nick Adelman
   Joel Crabb
Object
Partners
Inc.                    Agenda

   Ø
       Terminology
   Ø
       Why does Hadoop Exist?
   Ø
       HDFS and Hbase
   Ø
       Examples
   Ø
       Getting Started
   Ø
       Demo
Object
Partners
Inc.                       Terminology

   Ø
       Hadoop
       –
           Core set of technologies hosted by Apache Foundation for
           storing and searching data sets in the Tera and Petabyte
           range
   Ø
       HDFS
       –
           Hadoop File System used as the basis for all Hadoop
           technologies
   Ø
       Hbase
       –
           Distributed Map based database which uses HDFS as its
           underlying data store
   Ø
       Map Reduce
       –
           A framework for programming distributed parallel
           processing algorithms
Object
Partners
Inc.                         Terminology

   Ø
       Distributed Computing
       –
           A computing paradigm that parallelizes computations over
           multiple compute nodes in order to decrease overall
           processing time
   Ø
       NOSQL
       –
           Programming paradigm which does not use a relational
           database as the backend data store
   Ø
       Big Data
       –
           Generic term used when working with large data sets
   Ø
       Name Node
       –
           Server that knows location of all files in cluster
Object
Partners
Inc.        Enterprise Architecture 101




           HDFS                   HDFS



                     Map Reduce
Data                                      Data
           Hbase                  Hbase
           RDBMS                  RDBMS
Object
Partners
Inc.            The New System Constraint

   Ø
       Hard disk seek time is the new constraint when
       working with a Petabyte data set
       –
           Spread the seek time among multiple servers
       –
           Isolate the data to a single read per disk
       –
           Faster to read too much data sequentially on disk and
           discard the excess

   Ø
       Working under this paradigm requires New Tools
Object
Partners
Inc.       New Tools: Why does Hadoop exist?

   Ø
       In the early 2000s Google had problems:

   Ø
       Problem 1: Store Tera and Petabytes of data:
       –
           Inexpensive, Reliable, Accessible
   Ø
       Answer: distributed file system

   Ø
       Problem 2: Distributed Computing is Hard
   Ø
       Answer: make distributed computing easier

   Ø
       Problem 3: Datasets too large for RDBMS
   Ø
       Answer: make a new way to store application data
Object
Partners
Inc.              Google’s Solution: Tool 1

   Ø
       Google File System (GFS)
       –
           A file system specifically built to manage large files and
           support distributed computing
   Ø
       Inexpensive:
       –
           Store files distributed across a cluster of cheap servers
   Ø
       Reliable:
       –
           Plan for server failure: if you have 1000 servers, one will
           fail every day
       –
           Always maintain three copies of each file (configurable)
   Ø
       Accessible:
       –
           File Chunk size is 64MB = Less file handles to manage
       –
           Master table keeps track of locations of each file copy


           Problem 1: Store Tera and Petabytes of data
Object
Partners
Inc.             Google’s Solution: Tool 2

   Ø
       Map Reduce – abstracts away the hard parts of
       distributed computing

   Ø
       Programmers no longer need to manage:
       –
           Where is the data?
       –
           What piece of data am I working on?
       –
           How do I move data and result sets?
       –
           How do I combine results?
   Ø
       Leverages the GFS
       –
           Send processing to the data
       –
           Multiple file copies means higher chance to use more
           nodes for each process


           Problem 2: Distributed Computing is Hard
Object
Partners
Inc.                 Tool 2: Map Reduce

 Ø
     Distributed parallel processing framework
 Ø
     Map - done N times on N servers
     –
         Perform an operation (search) on a chunk (GBs) of data
 Ø
     Search 100 GB
     –
         Process Map on 25 servers with 4GB of memory
     –
         100 GB processed in-parallel in-memory
     –
         Create Maps storing results (key-value pair)
 Ø
     Reduce
     –
         Take Maps from N nodes
     –
         Merge (reduce) maps to a single sorted map (result set)




           Problem 2: Distributed Computing is Hard
Object
Partners
Inc.               Google’s Solution: Tool 3

   Ø
       Bigtable: new paradigm in storing large data sets
       –
            “a sparse, distributed, persistent multi-dimensional sorted
            map”*
           *Bigtable: A Distributed Storage System for Structured Data
   Ø
       Sparse: Few entries in map are populated
   Ø
       Distributed: Data spread across multiple logical
       machines in multiple copies
   Ø
       Multi-dimensional: Maps within maps organize and
       store data
   Ø
       Sorted: Sorted by lexiographic keys
       –
            Lexiographic = alphabetically including numbers



           Problem 3: Data sets too large for RDBMS
Object
 Partners
 Inc.           Google’s Architecture


Map Reduce      Direct Access   Map Reduce



            Bigtable


                   GFS
Object
Partners
Inc.       Hadoop – If Something Works…
  Ø
      Hadoop was started to recreate these technologies
      in the Open Source community


       GFS                            HDFS

      Bigtable                        Hbase

       Map                              Map
      Reduce                           Reduce
Object
Partners
Inc.                      A Little More on HDFS

   Ø
         Plan for Failure
          –
              In a thousand node cluster, machines will fail often
          –
              HDFS is built to detect failure and redistribute files
   Ø
         Fast Data Access
          –
              Generally a batch processing system
   Ø
         Large Files – typically GB to TB files
   Ø
         Simple Coherency
          –
              Once file is closed, it cannot be updated or appended
   Ø
         Cloud Ready
          –
              Setup on Amazon EC2 / S3




       Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html
Object
Partners
Inc.                 A Little More on Hbase

   Ø
       Multi-dimensional Map
   Ø
       Map<byte[ ]
       –
           Map<byte[ ]
            • Map<byte[ ]
                –
                    Map<Long, byte[]>>>>


   Ø
       First Map: Row Key to Column Family
   Ø
       Second Map: Column Family to Column Label
   Ø
       Third Map: Column Label to Timestamp
   Ø
       Fourth Map: Timestamp to Value


   A Column Family is a grouping of columns of the same data type.
Object
Partners
Inc.       Hbase Storage Model
Object
Partners
Inc.                     Hbase Access

   Ø
       REST interface
       –
           http://wiki.apache.org/hadoop/Hbase/Stargate
   Ø
       Groovy
       –
           http://wiki.apache.org/hadoop/Hbase/Groovy
   Ø
       Scala
       –
           http://wiki.apache.org/hadoop/Hbase/Scala
Object
Partners
Inc.                   Industry Examples

   Ø Web/File Search (Yahoo!)
       Ø
           Yahoo! Is the main sponsor and contributor to Hadoop
       Ø
           Has over 25,000 servers running Hadoop


   Ø   Log aggregation (Amazon, Facebook, Baidu)
   Ø   RDBMS replacement (Google Analytics)
   Ø   Image store (Google Earth)
   Ø   Email store (Gmail)
   Ø   Natural Language Search (Microsoft)
   Ø   Many more…



       * Information from http://wiki.apache.org/hadoop/PoweredBy
Object
Partners
Inc.           Use Case #1: Yahoo! Search

Ø
    Problem circa 2006

Ø
    Yahoo! search is seen as inferior to Google’s
Ø
    Google is better at:
    –
        Storing Tera and Petabytes of unstructured data
    –
        Searching the data set efficiently
    –
        Applying custom analytics to data set
    –
        Presenting a more relevant result set
Object
Partners
Inc.             Use Case #1: Yahoo! Search

   Ø
        Solution – Emulate Google with Hadoop’s HDFS,
        Pig and Map Reduce
         –
             HDFS
              • Stores Petabytes of web page data distributed over a
                cluster of compute nodes (1000s)
              • Runs on commodity hardware
              • Average server – 2X4 core, 4 – 32 GB RAM *
         –
             Pig (Hadoop Sub-project)
              • Analytics processing platform
         –
             Map Reduce
              • Build indexes from raw web data




       * http://wiki.apache.org/hadoop/PoweredBy
Object
Partners   Use Case #2: RDBMS Replacement
Inc.


   Ø
       Google Analytics circa 2006
   Ø
       Problem
       –
           Store Terabytes of analytics data about website usage
       –
           GBs of data added per hour
       –
           Data added in small increments
       –
           Access and display data in < 3 seconds per request
Object
Partners
Inc.        Use Case #2: RDBMS Replacement

   Ø
       Solution – Bigtable, Map Reduce on GFS
   Ø
       Bigtable sits over GFS inputs small bits of data
   Ø
       In 2006, GA cluster supported ~220 TB*
   Ø
       Raw Click Table (200 TB)
       –
             Rows keyed by WebsiteName + Session Time
       –
             All website data stored consecutively on disk
   Ø
       Summary Table (20 TB)
       –
             Map Reduce of Raw Click Table for customer web views



           Pattern: Collect data in one Bigtable instance
               Map Reduce to a View Bigtable instance

           *Bigtable: A Distributed Storage System for Structured Data
Object
Partners
Inc.              Can You Use Hadoop?

   Ø
       IF…
       –
           You have a large amount of data (Terabytes+)
       –
           You can split your data collection data store
           from your online or analytics data store
       –
           You can order your data lexiographically
       –
           You can run analytics as batches
       –
           You cannot afford a large enough RDBMS
       –
           You need dynamic column additions
       –
           You need near linear performance as data set
           grows
Object
Partners
Inc.         Other Hadoop Technologies

   Ø
       Hive – SQL like query language to use Hadoop like
       a data warehouse
   Ø
       Pig – parallel data analysis framework
   Ø
       Zookeeper – Distributed application coordination
       framework
   Ø
       Chukwa – Data collection system for distributed
       computing
   Ø
       Avro – data serialization framework
Object
Partners
Inc.                New Skills for IT

   Ø
       Learning to restructure data
   Ø
       Learning to write Map Reduce programs
   Ø
       Learning to maintain a Hadoop cluster
   Ø
       Forgetting RDBMS/SQL dominated design
       principals




  It takes a new style of creativity to both structure data
  in Hadoop and write useful Map Reduce programs.
Object
Partners
Inc.                      Getting Started

   Ø
       You can install a test system on a single Unix box
   Ø
       For a full system a minimum of 3 servers
       –
           10 to 20 servers is a small cluster
   Ø
       Expect to spend a day to a week getting a multi-
       node cluster configured.
   Ø
       A book like Pro Hadoop, by Jason Venner may
       save you time but is based on the 0.19 Hadoop
       release (currently at 0.20)
Object
Partners
Inc.               Optional Quickstart

   Ø
       Cloudera has a preconfigured single node Hadoop
       instance available for download at:
       http://www.cloudera.com/hadoop-training-virtual-mach
   Ø
       Yahoo! Has a Hadoop distribution as well at:
       http://developer.yahoo.com/hadoop/distribution/
Object
Partners
Inc.                 Alternatives to Hbase

   Ø
       Project Voldemort
       –
           http://project-voldemort.com/
       –
           Used by Linked In
   Ø
       Hypertable
       –
           http://www.hypertable.org/
       –
           Used by BaiDu (Search leader of China)
   Ø
       Cassandra
       –
           http://cassandra.apache.org/
       –
           Apache sponsored distributed database
       –
           Used by Facebook
Object
Partners
Inc.              Helpful Information

  Ø
      http://hadoop.apache.org
  Ø
      http://hbase.apache.org
  Ø
      http://wiki.apache.org/hadoop/HadoopPresentations
  Ø
      http://labs.google.com/papers/bigtable.html
  Ø
      http://labs.google.com/papers/gfs.html
  Ø
      http://labs.google.com/papers/mapreduce.html
  Ø
      Twitter: @hbase
  Ø
      Two articles on Map Reduce in the 01/2010
      Communications of the ACM
DEMO
Object
Partners
Inc.

More Related Content

What's hot

Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf8840VinayShelke
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaEdureka!
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

What's hot (20)

Sqoop
SqoopSqoop
Sqoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Spark
SparkSpark
Spark
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | EdurekaMapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
MapReduce Tutorial | What is MapReduce | Hadoop MapReduce Tutorial | Edureka
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Viewers also liked

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Be an agile architect
Be an agile architectBe an agile architect
Be an agile architectjoelcrabb
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Big data and Hadoop introduction
Big data and Hadoop introductionBig data and Hadoop introduction
Big data and Hadoop introductionDzung Nguyen
 
How applications of bigdata drive industries
How applications of bigdata drive industriesHow applications of bigdata drive industries
How applications of bigdata drive industriesSimplilearn
 
Pattern of Innovation
Pattern of InnovationPattern of Innovation
Pattern of Innovationjoelcrabb
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comjoelcrabb
 
The BestBuy.com Cloud Architecture
The BestBuy.com Cloud ArchitectureThe BestBuy.com Cloud Architecture
The BestBuy.com Cloud Architecturejoelcrabb
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Fitbit presentation
Fitbit presentationFitbit presentation
Fitbit presentationjryan39
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 

Viewers also liked (20)

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Be an agile architect
Be an agile architectBe an agile architect
Be an agile architect
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Big data and Hadoop introduction
Big data and Hadoop introductionBig data and Hadoop introduction
Big data and Hadoop introduction
 
How applications of bigdata drive industries
How applications of bigdata drive industriesHow applications of bigdata drive industries
How applications of bigdata drive industries
 
Pattern of Innovation
Pattern of InnovationPattern of Innovation
Pattern of Innovation
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
 
The BestBuy.com Cloud Architecture
The BestBuy.com Cloud ArchitectureThe BestBuy.com Cloud Architecture
The BestBuy.com Cloud Architecture
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
FitBit
FitBitFitBit
FitBit
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Fitbit presentation
Fitbit presentationFitbit presentation
Fitbit presentation
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 

Similar to Introduction to Hadoop

Similar to Introduction to Hadoop (20)

Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Anju
AnjuAnju
Anju
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop
HadoopHadoop
Hadoop
 

Introduction to Hadoop

  • 1. Object Partners Inc. Introduction to Hadoop Click to edit Master subtitle style Demo by: Presented by: Nick Adelman Joel Crabb
  • 2. Object Partners Inc. Agenda Ø Terminology Ø Why does Hadoop Exist? Ø HDFS and Hbase Ø Examples Ø Getting Started Ø Demo
  • 3. Object Partners Inc. Terminology Ø Hadoop – Core set of technologies hosted by Apache Foundation for storing and searching data sets in the Tera and Petabyte range Ø HDFS – Hadoop File System used as the basis for all Hadoop technologies Ø Hbase – Distributed Map based database which uses HDFS as its underlying data store Ø Map Reduce – A framework for programming distributed parallel processing algorithms
  • 4. Object Partners Inc. Terminology Ø Distributed Computing – A computing paradigm that parallelizes computations over multiple compute nodes in order to decrease overall processing time Ø NOSQL – Programming paradigm which does not use a relational database as the backend data store Ø Big Data – Generic term used when working with large data sets Ø Name Node – Server that knows location of all files in cluster
  • 5. Object Partners Inc. Enterprise Architecture 101 HDFS HDFS Map Reduce Data Data Hbase Hbase RDBMS RDBMS
  • 6. Object Partners Inc. The New System Constraint Ø Hard disk seek time is the new constraint when working with a Petabyte data set – Spread the seek time among multiple servers – Isolate the data to a single read per disk – Faster to read too much data sequentially on disk and discard the excess Ø Working under this paradigm requires New Tools
  • 7. Object Partners Inc. New Tools: Why does Hadoop exist? Ø In the early 2000s Google had problems: Ø Problem 1: Store Tera and Petabytes of data: – Inexpensive, Reliable, Accessible Ø Answer: distributed file system Ø Problem 2: Distributed Computing is Hard Ø Answer: make distributed computing easier Ø Problem 3: Datasets too large for RDBMS Ø Answer: make a new way to store application data
  • 8. Object Partners Inc. Google’s Solution: Tool 1 Ø Google File System (GFS) – A file system specifically built to manage large files and support distributed computing Ø Inexpensive: – Store files distributed across a cluster of cheap servers Ø Reliable: – Plan for server failure: if you have 1000 servers, one will fail every day – Always maintain three copies of each file (configurable) Ø Accessible: – File Chunk size is 64MB = Less file handles to manage – Master table keeps track of locations of each file copy Problem 1: Store Tera and Petabytes of data
  • 9. Object Partners Inc. Google’s Solution: Tool 2 Ø Map Reduce – abstracts away the hard parts of distributed computing Ø Programmers no longer need to manage: – Where is the data? – What piece of data am I working on? – How do I move data and result sets? – How do I combine results? Ø Leverages the GFS – Send processing to the data – Multiple file copies means higher chance to use more nodes for each process Problem 2: Distributed Computing is Hard
  • 10. Object Partners Inc. Tool 2: Map Reduce Ø Distributed parallel processing framework Ø Map - done N times on N servers – Perform an operation (search) on a chunk (GBs) of data Ø Search 100 GB – Process Map on 25 servers with 4GB of memory – 100 GB processed in-parallel in-memory – Create Maps storing results (key-value pair) Ø Reduce – Take Maps from N nodes – Merge (reduce) maps to a single sorted map (result set) Problem 2: Distributed Computing is Hard
  • 11. Object Partners Inc. Google’s Solution: Tool 3 Ø Bigtable: new paradigm in storing large data sets – “a sparse, distributed, persistent multi-dimensional sorted map”* *Bigtable: A Distributed Storage System for Structured Data Ø Sparse: Few entries in map are populated Ø Distributed: Data spread across multiple logical machines in multiple copies Ø Multi-dimensional: Maps within maps organize and store data Ø Sorted: Sorted by lexiographic keys – Lexiographic = alphabetically including numbers Problem 3: Data sets too large for RDBMS
  • 12. Object Partners Inc. Google’s Architecture Map Reduce Direct Access Map Reduce Bigtable GFS
  • 13. Object Partners Inc. Hadoop – If Something Works… Ø Hadoop was started to recreate these technologies in the Open Source community GFS HDFS Bigtable Hbase Map Map Reduce Reduce
  • 14. Object Partners Inc. A Little More on HDFS Ø Plan for Failure – In a thousand node cluster, machines will fail often – HDFS is built to detect failure and redistribute files Ø Fast Data Access – Generally a batch processing system Ø Large Files – typically GB to TB files Ø Simple Coherency – Once file is closed, it cannot be updated or appended Ø Cloud Ready – Setup on Amazon EC2 / S3 Summarized from: http://hadoop.apache.org/common/docs/current/hdfs_design.html
  • 15. Object Partners Inc. A Little More on Hbase Ø Multi-dimensional Map Ø Map<byte[ ] – Map<byte[ ] • Map<byte[ ] – Map<Long, byte[]>>>> Ø First Map: Row Key to Column Family Ø Second Map: Column Family to Column Label Ø Third Map: Column Label to Timestamp Ø Fourth Map: Timestamp to Value A Column Family is a grouping of columns of the same data type.
  • 16. Object Partners Inc. Hbase Storage Model
  • 17. Object Partners Inc. Hbase Access Ø REST interface – http://wiki.apache.org/hadoop/Hbase/Stargate Ø Groovy – http://wiki.apache.org/hadoop/Hbase/Groovy Ø Scala – http://wiki.apache.org/hadoop/Hbase/Scala
  • 18. Object Partners Inc. Industry Examples Ø Web/File Search (Yahoo!) Ø Yahoo! Is the main sponsor and contributor to Hadoop Ø Has over 25,000 servers running Hadoop Ø Log aggregation (Amazon, Facebook, Baidu) Ø RDBMS replacement (Google Analytics) Ø Image store (Google Earth) Ø Email store (Gmail) Ø Natural Language Search (Microsoft) Ø Many more… * Information from http://wiki.apache.org/hadoop/PoweredBy
  • 19. Object Partners Inc. Use Case #1: Yahoo! Search Ø Problem circa 2006 Ø Yahoo! search is seen as inferior to Google’s Ø Google is better at: – Storing Tera and Petabytes of unstructured data – Searching the data set efficiently – Applying custom analytics to data set – Presenting a more relevant result set
  • 20. Object Partners Inc. Use Case #1: Yahoo! Search Ø Solution – Emulate Google with Hadoop’s HDFS, Pig and Map Reduce – HDFS • Stores Petabytes of web page data distributed over a cluster of compute nodes (1000s) • Runs on commodity hardware • Average server – 2X4 core, 4 – 32 GB RAM * – Pig (Hadoop Sub-project) • Analytics processing platform – Map Reduce • Build indexes from raw web data * http://wiki.apache.org/hadoop/PoweredBy
  • 21. Object Partners Use Case #2: RDBMS Replacement Inc. Ø Google Analytics circa 2006 Ø Problem – Store Terabytes of analytics data about website usage – GBs of data added per hour – Data added in small increments – Access and display data in < 3 seconds per request
  • 22. Object Partners Inc. Use Case #2: RDBMS Replacement Ø Solution – Bigtable, Map Reduce on GFS Ø Bigtable sits over GFS inputs small bits of data Ø In 2006, GA cluster supported ~220 TB* Ø Raw Click Table (200 TB) – Rows keyed by WebsiteName + Session Time – All website data stored consecutively on disk Ø Summary Table (20 TB) – Map Reduce of Raw Click Table for customer web views Pattern: Collect data in one Bigtable instance Map Reduce to a View Bigtable instance *Bigtable: A Distributed Storage System for Structured Data
  • 23. Object Partners Inc. Can You Use Hadoop? Ø IF… – You have a large amount of data (Terabytes+) – You can split your data collection data store from your online or analytics data store – You can order your data lexiographically – You can run analytics as batches – You cannot afford a large enough RDBMS – You need dynamic column additions – You need near linear performance as data set grows
  • 24. Object Partners Inc. Other Hadoop Technologies Ø Hive – SQL like query language to use Hadoop like a data warehouse Ø Pig – parallel data analysis framework Ø Zookeeper – Distributed application coordination framework Ø Chukwa – Data collection system for distributed computing Ø Avro – data serialization framework
  • 25. Object Partners Inc. New Skills for IT Ø Learning to restructure data Ø Learning to write Map Reduce programs Ø Learning to maintain a Hadoop cluster Ø Forgetting RDBMS/SQL dominated design principals It takes a new style of creativity to both structure data in Hadoop and write useful Map Reduce programs.
  • 26. Object Partners Inc. Getting Started Ø You can install a test system on a single Unix box Ø For a full system a minimum of 3 servers – 10 to 20 servers is a small cluster Ø Expect to spend a day to a week getting a multi- node cluster configured. Ø A book like Pro Hadoop, by Jason Venner may save you time but is based on the 0.19 Hadoop release (currently at 0.20)
  • 27. Object Partners Inc. Optional Quickstart Ø Cloudera has a preconfigured single node Hadoop instance available for download at: http://www.cloudera.com/hadoop-training-virtual-mach Ø Yahoo! Has a Hadoop distribution as well at: http://developer.yahoo.com/hadoop/distribution/
  • 28. Object Partners Inc. Alternatives to Hbase Ø Project Voldemort – http://project-voldemort.com/ – Used by Linked In Ø Hypertable – http://www.hypertable.org/ – Used by BaiDu (Search leader of China) Ø Cassandra – http://cassandra.apache.org/ – Apache sponsored distributed database – Used by Facebook
  • 29. Object Partners Inc. Helpful Information Ø http://hadoop.apache.org Ø http://hbase.apache.org Ø http://wiki.apache.org/hadoop/HadoopPresentations Ø http://labs.google.com/papers/bigtable.html Ø http://labs.google.com/papers/gfs.html Ø http://labs.google.com/papers/mapreduce.html Ø Twitter: @hbase Ø Two articles on Map Reduce in the 01/2010 Communications of the ACM