SlideShare a Scribd company logo
1 of 23
hadoopsphere.com                         View in Full Screen mode for better readability

                                                    Components that
                                                      constitute the
                                                       open source
                                                     Apache Hadoop
                                                        ecosystem
                                                                    -

                                                       Summary and categorization of
                                                  components available as Apache (ASF)
                                                  projects/sub-projects and serving the
                                                    Hadoop ecosystem. The document
                                                  does not include other open source or
                                                      commercial projects/products


                   Contributed by : Sachin Ghai |@sachinghai
hadoopsphere.com                                                                                       Distribution,
                                                                                                           Financial,
                                                                                                           Government,
                                                                                                           Heavy Industry,




                                                                   ‘Atmospheric ’ Layers
                                                                                                           Internet, Oil &
                                                                                           Application     Energy, Research,
                                                                                           Domains         Telecom

                                                                                           Discovery &     Lucene, Blur,
                                                                                           Visualization   Giraph
                                                                                           Analytics &
                                                                                           Intelligence    Mahout, Drill
                                                                                                           Pig, Hive,
                                                                                           Data            HCatalog, Tez,
                                                                                           Interactions    Gora
                                                                                           Hardware (&
                                                                                           Appliances)     Commodity H/w
                                                                                           Distribution    Apache
                                                                                           Secure          Knox
                         Persist
                                                                                                           Oozie, Zookeeper,
                                                                                                           Crunch, MRUnit,
                                                                                                           HDT, Ambari,




                                                                   ‘Core ‘ Layers
                                                                                                           Vaidya, BigTop,
                                                                                           Manage          Whirr
                                                                                                           MapReduce,
                                                                                           Run             YARN, Hama
                                                                                                           HDFS, HBase,
                                                                                                           Cassandra,
                                                                                                           Accumulo, Avro,
                                                                                           Persist         Trevni, Thrift
                                                                                                           Flume, Sqoop,
                                                                                           Transfer        Chukwa, Kafka
                       Contributed by : Sachin Ghai |@sachinghai
M
hadoopsphere.com



                   CORE LAYERS
                    which constitute
                      the Apache
                   Hadoop ecosystem




                                3
hadoopsphere.com



                                   PERSIST :
                             File System & Data
                                    Store –
                             • HDFS - Distributed file system that
                             provides high-throughput access.
                             Comprises of NameNode, Secondary
                             NameNode and DataNodes
                             • HBase - Distributed, scalable, big
                   Persist   data store
                             • Cassandra - Highly scalable,
                             eventually consistent, distributed,
                             structured key-value store
                             • Accumulo - Sorted, distributed
                             key/value data storage and retrieval
                             system




                                                            4
hadoopsphere.com



                                  PERSIST :
                                 Serialization –
                             • Avro - Data serialization system


                             • Trevni - A Column File format to
                             permit compatible, independent
                             implementations that read and/or
                             write files in this format
                   Persist   • Thrift - Framework, for scalable
                             cross-language services
                             development




                                                         5
hadoopsphere.com



                                       RUN:
                                Job Execution –
                             • MapReduce - Framework for
                             performing distributed data
                             processing. Comprises of JobTracker,
                             TaskTracker and JobHistoryServer
                             • YARN - Framework that facilitates
                             writing arbitrary distributed
                             processing frameworks and
                   Persist
                             applications.
                             • Hama - Pure BSP (Bulk Synchronous
                             Parallel) computing framework for
                             massive scientific computations such
                             as matrix, graph and network
                             algorithms




                                                          6
hadoopsphere.com



                                MANAGE :
                                       Work –
                             • Oozie - Workflow/coordination
                             system to manage Hadoop jobs

                             • Zookeeper - Centralized service
                             for maintaining configuration
                             information, naming, providing
                             distributed synchronization, and
                   Persist   providing group services




                                                       7
hadoopsphere.com



                                MANAGE :
                                         Dev –
                             • Crunch - Framework for writing,
                             testing, and running MapReduce
                             pipelines
                             • MRUnit - Java library that helps
                             developers unit test Apache
                             Hadoop MapReduce jobs
                             • HDT – Hadoop Development
                   Persist   Tools (HDT) comprise Eclipse
                             based tools for developing
                             applications on the Hadoop
                             platform




                                                         8
hadoopsphere.com



                                MANAGE :
                                         Ops –
                             • Ambari - Web-based tool for
                             provisioning, managing, and
                             monitoring Apache Hadoop
                             clusters
                             • Vaidya - Performance diagnostic
                             tool for MapReduce jobs
                             • BigTop - Project for the
                   Persist   development of packaging and
                             tests and ensuring interoperability
                             among Apache Hadoop related
                             projects
                             • Whirr - Set of libraries for
                             running cloud services like running
                             Hadoop clusters on EC2



                                                         9
hadoopsphere.com



                                  SECURE :
                             • Knox - System that provides a
                             single point of secure access for
                             Apache Hadoop clusters




                   Persist




                                                         10
hadoopsphere.com



                               TRANSFER :
                             • Flume - Distributed, reliable, and
                             available service for efficiently
                             collecting, aggregating, and
                             moving large amounts of log data
                             • Sqoop - Tool designed for
                             efficiently transferring bulk data
                             between Apache Hadoop and
                   Persist   structured datastores such as
                             relational databases.
                             • Chukwa - Open source data
                             collection system for monitoring
                             large distributed systems
                             • Kafka - Distributed publish-
                             subscribe messaging system



                                                         11
hadoopsphere.com
                             ATMOSPHERIC
                                  LAYERS
                                 which build
                                   up the
                                 capabilities
                                 beyond the
                                   core of
                   Persist




                                   Apache
                                   Hadoop
                                 ecosystem
                                         12
hadoopsphere.com
                                                       HARDWARE :
                                                                    • Commodity Hardware -
                                                                    Low-cost, easily available
                                                                    hardware working in
                                                                    parallel
                                        C
                                        o
                                        r
                                        e

                                        L   Atm
                                        a   osp
                                        y   heri
                   Persist              e   c
                                        r   Laye
                                        s   rs




                             Note: no appliances known to run on pure Apache Hadoop distribution;
                             SSD and other cheap hardware options not mentioned separately here

                                                                                        13
hadoopsphere.com
                                     DATA
                             INTERACTIONS:
                                • Pig - Platform for
                                analyzing large data sets
                                that consists of a high-
                                level language for
                                expressing data analysis
                                programs, coupled with
                                infrastructure for
                                evaluating these
                                programs
                   Persist      • Hive - Data warehouse
                                system that facilitates
                                easy data summarization,
                                ad-hoc queries and
                                analysis of large datasets
                                stored in Hadoop
                                compatible file systems



                                                  14
hadoopsphere.com
                                                DATA
                                        INTERACTIONS:
                                           • HCatalog - Table and
                                           storage management
                                           service for data created
                                           using Apache Hadoop
                             C             • Tez - Generic
                             o
                             r
                                           application framework
                             e             which can be used to
                             L   Atm       process complex data-
                             a   osp
                             y   heri      processing task DAGs and
                             e   c
                   Persist
                             r   Laye      runs natively on Apache
                             s   rs
                                           Hadoop YARN
                                           •Gora - Framework for
                                           in-memory data model
                                           and persistence with
                                           MapReduce support




                                                            15
hadoopsphere.com
                               ANALYTICS &
                             INTELLIGENCE :
                                • Mahout - Scalable
                                machine learning and
                                data mining algorithm
                                library. Supports
                                Recommendation mining,
                                Clustering, Classification
                                and Frequent itemset
                                mining

                   Persist      • Drill - Distributed
                                system for interactive
                                analysis of large-scale
                                datasets. Comprises of
                                user interface (CLI, REST),
                                pluggable query language
                                and pluggable data
                                source.


                                                   16
hadoopsphere.com
                                DISCOVERY &
                             VISUALIZATION :
                                  • Lucene - Open-source
                                  search software including
                                  Java based indexing and
                                  search component
                                  Lucene Core and high
                                  performance search
                                  server component Solr

                                  • Blur - Search engine
                   Persist        capable of querying
                                  massive amounts of
                                  structured data at
                                  incredible speeds in a
                                  cloud computing
                                  environment




                                                    17
hadoopsphere.com
                                         DISCOVERY &
                                      VISUALIZATION :
                                                      • Giraph - Graph-
                                                      processing framework
                                                      leveraging existing
                                                      Hadoop infrastructure.
                                                      Follows bulk synchronous
                                                      parallel model to run
                                                      large scale algorithms.
                                                      Supports directed,
                                                      undirected, weighted,
                   Persist                            unweighted and
                                                      multigraphs




                             Note: no pure visualization projects currently as part of
                                                                                 ASF

                                                                            18
hadoopsphere.com
                             APPLICATION
                               DOMAINS :
                               • Distribution - Includes
                               applications in Travel,
                               Transport, FMCG, supply
                               chain e.g. Expedia
                               • Financial - Includes
                               applications in financial,
                               banking, insurance e.g.
                               Visa
                               • Government - Includes
                   Persist     applications in
                               government and public
                               sector e.g. Aadhar (India
                               ID card)
                               • Heavy Industry -
                               Includes applications in
                               heavy industrial business
                               including electronics,
                               auto, aircraft e.g. Hitachi

                                                  19
hadoopsphere.com
                                        APPLICATION
                                          DOMAINS :
                                          • Internet - Includes new
                                          age internet applications
                                          including social media,
                                          content distribution e.g.
                             C            Facebook
                             o
                             r
                                          • Oil & Energy - Includes
                             e            applications in
                             L   Atm      upstream/downstream
                             a   osp
                             y   heri     oil, gas business along
                                 c
                   Persist   e
                             r   Laye     with those in Energy
                             s   rs
                                          sector. e.g. Chevron
                                          • Research - Includes
                                          applications in new
                                          research e.g. network
                                          analysis & security
                                          • Telecom - Includes
                                          applications in Telecom
                                          business e.g. Korea
                                          Telecom
                                                             20
hadoopsphere.com



Reference :
• www.apache.org
• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop

Image courtesy:
• Slide 1 : Getty Images #84480368 Dorling Kindersley
  (free thumbnail copy)
• Other images: Original source could not be established




                                                      21
hadoopsphere.com



About the document :
• Voluntarily contributed by: Sachin Ghai (@sachinghai)
• Publisher : hadoopsphere.com
• Version : 1.0
• Date : 11 March 2013
• Copyright: 2013, All Rights Reserved
• Note: The document does not use official lingo in part
• Contact : Use ‘Contact’ menu option on
  www.hadoopsphere.com
• Disclaimer: The project names mentioned in this document
  are either registered trademarks or trademarks of the Apache
  Software Foundation in the United States. The Apache
  Software Foundation has no affiliation with and does not
  endorse or review the materials provided in this document.

                                                           22
hadoopsphere.com



Subscribe to hadoopsphere.com:
• Newsletter on e-mail subscription

• RSS Feed for posts

• Follow on Twitter

• Like on Facebook

More Related Content

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Apache Hadoop ecosystem - March 2013

  • 1. hadoopsphere.com View in Full Screen mode for better readability Components that constitute the open source Apache Hadoop ecosystem - Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem. The document does not include other open source or commercial projects/products Contributed by : Sachin Ghai |@sachinghai
  • 2. hadoopsphere.com Distribution, Financial, Government, Heavy Industry, ‘Atmospheric ’ Layers Internet, Oil & Application Energy, Research, Domains Telecom Discovery & Lucene, Blur, Visualization Giraph Analytics & Intelligence Mahout, Drill Pig, Hive, Data HCatalog, Tez, Interactions Gora Hardware (& Appliances) Commodity H/w Distribution Apache Secure Knox Persist Oozie, Zookeeper, Crunch, MRUnit, HDT, Ambari, ‘Core ‘ Layers Vaidya, BigTop, Manage Whirr MapReduce, Run YARN, Hama HDFS, HBase, Cassandra, Accumulo, Avro, Persist Trevni, Thrift Flume, Sqoop, Transfer Chukwa, Kafka Contributed by : Sachin Ghai |@sachinghai M
  • 3. hadoopsphere.com CORE LAYERS which constitute the Apache Hadoop ecosystem 3
  • 4. hadoopsphere.com PERSIST : File System & Data Store – • HDFS - Distributed file system that provides high-throughput access. Comprises of NameNode, Secondary NameNode and DataNodes • HBase - Distributed, scalable, big Persist data store • Cassandra - Highly scalable, eventually consistent, distributed, structured key-value store • Accumulo - Sorted, distributed key/value data storage and retrieval system 4
  • 5. hadoopsphere.com PERSIST : Serialization – • Avro - Data serialization system • Trevni - A Column File format to permit compatible, independent implementations that read and/or write files in this format Persist • Thrift - Framework, for scalable cross-language services development 5
  • 6. hadoopsphere.com RUN: Job Execution – • MapReduce - Framework for performing distributed data processing. Comprises of JobTracker, TaskTracker and JobHistoryServer • YARN - Framework that facilitates writing arbitrary distributed processing frameworks and Persist applications. • Hama - Pure BSP (Bulk Synchronous Parallel) computing framework for massive scientific computations such as matrix, graph and network algorithms 6
  • 7. hadoopsphere.com MANAGE : Work – • Oozie - Workflow/coordination system to manage Hadoop jobs • Zookeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and Persist providing group services 7
  • 8. hadoopsphere.com MANAGE : Dev – • Crunch - Framework for writing, testing, and running MapReduce pipelines • MRUnit - Java library that helps developers unit test Apache Hadoop MapReduce jobs • HDT – Hadoop Development Persist Tools (HDT) comprise Eclipse based tools for developing applications on the Hadoop platform 8
  • 9. hadoopsphere.com MANAGE : Ops – • Ambari - Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters • Vaidya - Performance diagnostic tool for MapReduce jobs • BigTop - Project for the Persist development of packaging and tests and ensuring interoperability among Apache Hadoop related projects • Whirr - Set of libraries for running cloud services like running Hadoop clusters on EC2 9
  • 10. hadoopsphere.com SECURE : • Knox - System that provides a single point of secure access for Apache Hadoop clusters Persist 10
  • 11. hadoopsphere.com TRANSFER : • Flume - Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Sqoop - Tool designed for efficiently transferring bulk data between Apache Hadoop and Persist structured datastores such as relational databases. • Chukwa - Open source data collection system for monitoring large distributed systems • Kafka - Distributed publish- subscribe messaging system 11
  • 12. hadoopsphere.com ATMOSPHERIC LAYERS which build up the capabilities beyond the core of Persist Apache Hadoop ecosystem 12
  • 13. hadoopsphere.com HARDWARE : • Commodity Hardware - Low-cost, easily available hardware working in parallel C o r e L Atm a osp y heri Persist e c r Laye s rs Note: no appliances known to run on pure Apache Hadoop distribution; SSD and other cheap hardware options not mentioned separately here 13
  • 14. hadoopsphere.com DATA INTERACTIONS: • Pig - Platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs Persist • Hive - Data warehouse system that facilitates easy data summarization, ad-hoc queries and analysis of large datasets stored in Hadoop compatible file systems 14
  • 15. hadoopsphere.com DATA INTERACTIONS: • HCatalog - Table and storage management service for data created using Apache Hadoop C • Tez - Generic o r application framework e which can be used to L Atm process complex data- a osp y heri processing task DAGs and e c Persist r Laye runs natively on Apache s rs Hadoop YARN •Gora - Framework for in-memory data model and persistence with MapReduce support 15
  • 16. hadoopsphere.com ANALYTICS & INTELLIGENCE : • Mahout - Scalable machine learning and data mining algorithm library. Supports Recommendation mining, Clustering, Classification and Frequent itemset mining Persist • Drill - Distributed system for interactive analysis of large-scale datasets. Comprises of user interface (CLI, REST), pluggable query language and pluggable data source. 16
  • 17. hadoopsphere.com DISCOVERY & VISUALIZATION : • Lucene - Open-source search software including Java based indexing and search component Lucene Core and high performance search server component Solr • Blur - Search engine Persist capable of querying massive amounts of structured data at incredible speeds in a cloud computing environment 17
  • 18. hadoopsphere.com DISCOVERY & VISUALIZATION : • Giraph - Graph- processing framework leveraging existing Hadoop infrastructure. Follows bulk synchronous parallel model to run large scale algorithms. Supports directed, undirected, weighted, Persist unweighted and multigraphs Note: no pure visualization projects currently as part of ASF 18
  • 19. hadoopsphere.com APPLICATION DOMAINS : • Distribution - Includes applications in Travel, Transport, FMCG, supply chain e.g. Expedia • Financial - Includes applications in financial, banking, insurance e.g. Visa • Government - Includes Persist applications in government and public sector e.g. Aadhar (India ID card) • Heavy Industry - Includes applications in heavy industrial business including electronics, auto, aircraft e.g. Hitachi 19
  • 20. hadoopsphere.com APPLICATION DOMAINS : • Internet - Includes new age internet applications including social media, content distribution e.g. C Facebook o r • Oil & Energy - Includes e applications in L Atm upstream/downstream a osp y heri oil, gas business along c Persist e r Laye with those in Energy s rs sector. e.g. Chevron • Research - Includes applications in new research e.g. network analysis & security • Telecom - Includes applications in Telecom business e.g. Korea Telecom 20
  • 21. hadoopsphere.com Reference : • www.apache.org • http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop Image courtesy: • Slide 1 : Getty Images #84480368 Dorling Kindersley (free thumbnail copy) • Other images: Original source could not be established 21
  • 22. hadoopsphere.com About the document : • Voluntarily contributed by: Sachin Ghai (@sachinghai) • Publisher : hadoopsphere.com • Version : 1.0 • Date : 11 March 2013 • Copyright: 2013, All Rights Reserved • Note: The document does not use official lingo in part • Contact : Use ‘Contact’ menu option on www.hadoopsphere.com • Disclaimer: The project names mentioned in this document are either registered trademarks or trademarks of the Apache Software Foundation in the United States. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided in this document. 22
  • 23. hadoopsphere.com Subscribe to hadoopsphere.com: • Newsletter on e-mail subscription • RSS Feed for posts • Follow on Twitter • Like on Facebook