SlideShare a Scribd company logo
1 of 34
Introduction to Hadoop
Agenda
1. Introduction to Hadoop
 What is Big data and Why Hadoop?
 Big Data Charracteristics and Challeges
Comparison between Hadoop and RDBMS
 Hadoop History and Origin
Hadoop Ecosyetem overiew
 Anatomy of Hadoop Cluster
Hands on Exercise – Installing Couldera Hadoop VM
Big Data
Think at Scale
Data is in TB even in PB
• Facebook has 400 terabytes of stored data and ingest 20 terabytes of
new data per day. Hosts approx. 10 billion photos, 5PB(2011) and is
growing 4TB per day
• NYSE generates 1TB data/day
• The Internet Archive stores around 2PB of data and is growing at a rate
of 20PB per month
 Flood of data is coming from many resources
 Social network profile, activity , logging and tracking
 Public web information
 Data ware house appliances
 Internet Archive store etc.. [Q: How much big ? ]
Big Data-how it is? What it means ?
DATA Value
 Velocity
Data flown continues, time sensitive,
streaming flow
Batch, Real time, Streams, Historic
 Variety
Big Data extends structured, including semi-
structured and unstructured data of all variety:
text, log, xml, audio, vedio, stream, falt files etc.
Structured, Semi structured, Unstructured
Veracity
Quality, consistency, reliability and
provenance of data
Good, bad, undefined, inconsistency,
incomplete.
 Volume
Big Data comes in on large scale. Its
on TB and even PB
Records, Transaction, Tables , Files
Big Data- how it is? What it means ? continue..
Social media and websites
IT – services, Software and Hardware services and support.
Finance: Better and deeper understanding of risk to avoid credit
crisis
Telecommunication: More reliable network where we can
predicate and prevent
Media: More content that is lined up with your personal
preferences
Life science: Better targeted medicine with fewer complications
and side effects
Retail: A personal experience with product and offer that are
just what and you need
Google, yahoo and others need to index the entire internet and
return searched results in milliseconds
Business Drivers and sceneries for large data
Challenges in Big Data Storage and Analysis
Slow to process, can’t scale
Disk seek for every access
Buffered reads, locality  still seeking every disk page
It not Storage Capacity but access speeds which is the bottleneck.
Challenges to both store and analyze datasets
Scaling is expensive
 Hard Drive capacity to process
IDE drive – 75 MB/sec, 10ms seek
SATA drive – 300MB/s, 8.5ms seek
SSD – 800MB/s, 2 ms “seek”
Apart from this analyze, compute, aggregation, processing dealy
etc..
 Unreliable machines: Risk
1 Machine 1 time in 3 years mean time between failures
1000 Machines 1 day mean time between failures
Reliability
Partial failure, graceful decline rather than full halt
Data recoverability, if a node fails, another picks up its workload
Node recoverability, a fixed node can rejoin the group without a full
group restart
Scalability, adding resources adds load capacity
Backup
 Not affordable, expensive(faster, more reliability more cost)
Easy to use and Secure
Process data in parallel
Challenges in Big Data Storage and Analysis continues…
 An Idea: Parallelism
• Transfer speed improves at a greater rate than seek speed.
• Process read/write parallel rather then sequential.
1 drive – 75 MB/sec 16 days for 100TB
1000 drives – 75 GB/sec 22 minutes for 100TB
 A problem: Parallelism is Hard
• Synchronization
• Deadlock
• Limited bandwidth
• Timing issues and co-ordination
• Spilt & Aggregation
 Computer are complicate
• Driver failure
• Data availability
• Co-ordination
Hey !, We have Distribute computing !!!
Process data in parallel ? – not simple 
Yes, We have distributed computing and it also come up with some
challenges 
Resource sharing. Access any data and utilize CPU resource across the
system.
Portability, reliable,
Concurrency: Allow concurrent access, update of shared resource,
availability with high throughput
Scalability: With data, with load
Fault tolerance : By having provisions for redundancy and recovery
Heterogeneity: Different operating system, different hardware
Transparency: Should appear as a whole instead of collection of
computers
Hide details and complexity by accomplishing above challenges from
the user and need a common unified interface to interact with it.
To address most of these challenges(but not all) Hadoop come in.
Common Challenges in Distributed computing
Apache Hadoop is a framework that allows for the distributed
processing of large data sets across clusters of commodity
computers using a simple programming model. It is designed
to scale up from single servers to thousands of machines,
each providing computation and storage.
Hadoop is an open-source implementation of Google
MapReduce, GFS(distributed file system).
Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
Hadoop fulfill need of common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache License
Hadoop origins
The Name ‘Hadoop’ ?
Store and process large amounts of data (PetaBytes)
Performance, storage, processing scale linearly
Compute should move to data
Simple core , modular and extensible
Failure is normal, expected
Manageable and Heal self
Design run on commodity hardware-cost effective
Hadoop Design Axioms
For Storage and Distributed computing (MapReduce)
Spilt up the data
Process Data in parallel
Sort and combine to get the answer
Schedule, Process and aggregate independently
Failures are independent, Handle failures.
Handle fault tolerance
Solve : Hadoop achieves complete parallelism
2002-2004 Doug cutting and Mike Cafarella started
working on Nutch
2003-2004: Google publishes GFS and MapReduce paper
2004 : Doug cutting adds DFS and Mapreduce support to
Nutch
Yahoo ! Hires Cutting , bulid team to develop Hadoop
2007: NY time converts 4TB of archive over 100 EC2
cluster of Hadoop.
Web scale deployement at Y!,Facebook,twitter.
May 2009: Yahoo does fastest sort of a TB, 62secs over
1460nodes
Yahoo sort a PB in 16.25hrs over 3658 nodes
Hadoop History
An Elephant can't jump. But can carry heavy load !!!
 A fundamental tenet of relational databases structure defined by a
schema, what about Large data sets are often unstructured or semi-
structured, Hadoop is the best choice. Hadoop MR framework uses
key/value pairs as its basic data unit, which is flexible enough to
work with the less-structured data types.
 Scaling commercial relational databases is expensive and limited.
 High-level declarative language like SQL, Block box Query
engine.You query data by stating the result you want and let the
database engine figure and drive it. you can build complex statistical
models from your data or analytical reporting or reformat your
image data. SQL is not well designed for such tasks. MapReduce
tries to collocate the data with the compute node, so data access is
fast since it is local.
 Coordinating the processes in a large-scale distributed computation
is a challenge. HDFS and MR made easy split, store. process and
aggregate.
Hadoop V/S RDBMS
 To run a bigger database you need to buy a bigger machine. the high-end
machines are not cost effective for many applications. For example, a
machine with four times the power of a standard PC costs a lot more than
putting four such PCs in a cluster. Hadoop is designed to be a scale-out
architecture operating on a cluster of commodity hardware . Adding more
resources means adding more machines to the Hadoop cluster.
Effective cost per user TB: $250/TB
Other solutions(RDBMS) cost in the range of $100 to $100K per user TB
 Hardest aspect is gracefully handling partial failure— when you don’t
know if a remote process has failed or not—and still making progress with
the overall computation. MapReduce spares the programmer from having
to think about failure, since the implementation detects failed map or
reduce tasks and reschedules with suitable replacements . MapReduce is
able to do this since it is a shared-nothing architecture, meaning that tasks
have no dependence on one other.
Hadoop V/S RDBMS Continue…
Hardware failure: as soon as you start using many pieces of hardware, the
chance that one will fail is fairly high. A common way of avoiding data loss is
through replication. Redundant copies of the data are kept by the system so
that in the event of failure, there is another copy available. Node failure and disk
failure efficient handle in Hadoop frame work.
Hadoop V/S RDBMS Continue…
Is Hadoop alternative for RDBMs ?
Hadoop is not replacing the traditional data systems used for building
analytic applications – the RDBMS, EDW and MPP systems – but rather is a
complement.
 Interoperate with existing systems and tools, at the moment Apache
Hadoop is not a substitute for a database
 No Relation, Key Value pairs
 Big Data, unstructured (Text) & semi structured (Seq / Binary Files)
 Structured (Hbase=Google BigTable)
 Works fine together with RDBMs, Hadoop is being used to distill
large quantities of data into something more manageable.
Hadoop V/S RDBMS Continue…
Hadoop designed and built on two independent frame
works. Hadoop = HDFS + Map reduce
HDFS (storage and File system) : HDFS is a reliable
distributed file system that provides high-throughput
access to data
MapReduce (processing) : MapReduce is a
framework for performing high performance distributed
data processing using the divide and aggregate
programming paradigm
Hadoop has a master/slave architecture for both
storage and processing.
Hadoop Architecture
Hadoop Master and Salve Architecture
Periodic check piont
Hadoop storage and file system – Hadoo Distribute File system(HDFS)
The components(daemons) of HDFS are
• NameNode is the master of the system. It is maintains the name system
(directories and files) and manages the blocks which are present on the DataNodes.
• DataNodes are the slaves which are deployed on each machine and provide the
actual stor-age. They are responsible for serving read and write requests for the
clients.
• Secondary NameNode is responsible for performing periodic checkpoints. So, in
the event of NameNode failure, you can restart the NameNode using the checkpoint.
Name Node
Data Nodes Data node ….. Data node
Slave
Secondary Name
Node
Periodic check point
Master
Hadoop Master and Salve Architecture continues…
Parallel and Distributed computation – Map Reduce Paradigm
The components(daemons) of MapReduce are:
 JobTracker is the master of the system which manages the jobs and
resources in the clus-ter (TaskTrackers). The JobTracker tries to schedule each
map as close to the actual data being processed i.e. on the TaskTracker which is
running on the same DataNode as the underlying block.
 TaskTrackers are the slaves which are deployed on each machine. They are
responsible for running the map and reduce tasks as instructed by the
JobTracker
Job Tracker
Task Tacker Task tracker ….. Task Tracker
Master
Slave
Core Hadoop Ecosystem
Hadoop Ecosystem Development
Hadoop Ecosystem now…
The Hadoop ecosystem has grown over the last few years and there is a lot of
jargon in terms of tools, frameworks . Hadoop has become the kernel
Hadoop can be configured in three modes
Standalone: Hadoop all Deamons in run inside a single Java
process. Use local file for storage. Standalone mode helpful for
debug Hadoop applications.
Pseudo-distributed : Each Hadoop daemons runs is different
JVM, as a separate process , but all processes running on a
single machine.
Fully-distributed: Hadoop actual powers parallel processing,
scalability and the independence of task execution, replication
management, workflow management, fault-tolerance, and data
consistency are lies in the fully distributed mode. The Hadoop
fully distributed mode is highly effective centralized data
structure allows multiple machines to contribute processing
power and storage to the cluster.
Hadoop Environment
Distributed frame work for processing and storing data generally on commodity hardware.
Completely open source and Written in Java
Store anything.
Unstructured or semi structured data,
Storage capacity.
Scale linearly, cost in not exponential.
Data locality and process in your way.
Code moves to data
In MR you specify the actual steps in processing the data and drive the out put.
Stream access: Process data in any language.
Failure and fault tolerance:
Detect Failure and Heals itself.
Reliable, data replicated, failed task are rerun , no need maintain backup of data
Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity PC machines
The Hadoop framework transparently for customization to provides applications both
reliability, adaption and data motion
Primarily used for batch processing, not real-time/trasactional user applications.
& Many more……..
Hadoop Features and summary
Hadoop Open Source at Apache
No strategic agenda
 quality is emergent, Continues evolving
Community based and strong
 diverse organizations collaborating voluntarily
 decisions by consensus
 transparent
Allows competing projects
 survival of fittest
A loose federation of projects
 permits evolution
Insures against vendor lock-in
 can't buy Apache
Who Wrote Hadoop? It's the
Community
Contributors and Development
Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
What is Hadoop used for?
• Search
– Yahoo, Amazon, Zvents,
• Log processing
– Facebook, Yahoo, ContextWeb. Joost, Last.fm
• Recommendation Systems
– Facebook
• Data Warehouse
– Facebook, AOL
• Video and Image Analysis
– New York Times, Eyealike
..... Almost in every domain !!!
Who uses Hadoop?
 Amazon/A9
 Facebook
 Google
 IBM
Joost
 Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Twitter
LinkedIn
…No list too big now
References
Hadoop: The Definitive Guide, Third Edition by Tom White.
http://hadoop.apache.org/
http://www.cloudera.com/
Zookeeper(Coordination)
Map-Reduce (Job scheduling and Execution)
HBase(Key-Value store)Hive Warehouse
Pig(Data Flow) processing ETL Tools
Avero(Serialization)
Sqoop,Flume
MR-streaming
Application DB Transactional DBRest API

More Related Content

What's hot

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 

What's hot (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Sqoop
SqoopSqoop
Sqoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Viewers also liked

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 

Viewers also liked (20)

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to Hadoop introduction , Why and What is Hadoop ?

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 

Similar to Hadoop introduction , Why and What is Hadoop ? (20)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Hadoop introduction , Why and What is Hadoop ?

  • 2. Agenda 1. Introduction to Hadoop  What is Big data and Why Hadoop?  Big Data Charracteristics and Challeges Comparison between Hadoop and RDBMS  Hadoop History and Origin Hadoop Ecosyetem overiew  Anatomy of Hadoop Cluster Hands on Exercise – Installing Couldera Hadoop VM
  • 3. Big Data Think at Scale Data is in TB even in PB • Facebook has 400 terabytes of stored data and ingest 20 terabytes of new data per day. Hosts approx. 10 billion photos, 5PB(2011) and is growing 4TB per day • NYSE generates 1TB data/day • The Internet Archive stores around 2PB of data and is growing at a rate of 20PB per month  Flood of data is coming from many resources  Social network profile, activity , logging and tracking  Public web information  Data ware house appliances  Internet Archive store etc.. [Q: How much big ? ]
  • 4. Big Data-how it is? What it means ? DATA Value  Velocity Data flown continues, time sensitive, streaming flow Batch, Real time, Streams, Historic  Variety Big Data extends structured, including semi- structured and unstructured data of all variety: text, log, xml, audio, vedio, stream, falt files etc. Structured, Semi structured, Unstructured Veracity Quality, consistency, reliability and provenance of data Good, bad, undefined, inconsistency, incomplete.  Volume Big Data comes in on large scale. Its on TB and even PB Records, Transaction, Tables , Files
  • 5. Big Data- how it is? What it means ? continue..
  • 6. Social media and websites IT – services, Software and Hardware services and support. Finance: Better and deeper understanding of risk to avoid credit crisis Telecommunication: More reliable network where we can predicate and prevent Media: More content that is lined up with your personal preferences Life science: Better targeted medicine with fewer complications and side effects Retail: A personal experience with product and offer that are just what and you need Google, yahoo and others need to index the entire internet and return searched results in milliseconds Business Drivers and sceneries for large data
  • 7. Challenges in Big Data Storage and Analysis Slow to process, can’t scale Disk seek for every access Buffered reads, locality  still seeking every disk page It not Storage Capacity but access speeds which is the bottleneck. Challenges to both store and analyze datasets Scaling is expensive  Hard Drive capacity to process IDE drive – 75 MB/sec, 10ms seek SATA drive – 300MB/s, 8.5ms seek SSD – 800MB/s, 2 ms “seek” Apart from this analyze, compute, aggregation, processing dealy etc..  Unreliable machines: Risk 1 Machine 1 time in 3 years mean time between failures 1000 Machines 1 day mean time between failures
  • 8. Reliability Partial failure, graceful decline rather than full halt Data recoverability, if a node fails, another picks up its workload Node recoverability, a fixed node can rejoin the group without a full group restart Scalability, adding resources adds load capacity Backup  Not affordable, expensive(faster, more reliability more cost) Easy to use and Secure Process data in parallel Challenges in Big Data Storage and Analysis continues…
  • 9.  An Idea: Parallelism • Transfer speed improves at a greater rate than seek speed. • Process read/write parallel rather then sequential. 1 drive – 75 MB/sec 16 days for 100TB 1000 drives – 75 GB/sec 22 minutes for 100TB  A problem: Parallelism is Hard • Synchronization • Deadlock • Limited bandwidth • Timing issues and co-ordination • Spilt & Aggregation  Computer are complicate • Driver failure • Data availability • Co-ordination Hey !, We have Distribute computing !!! Process data in parallel ? – not simple 
  • 10. Yes, We have distributed computing and it also come up with some challenges  Resource sharing. Access any data and utilize CPU resource across the system. Portability, reliable, Concurrency: Allow concurrent access, update of shared resource, availability with high throughput Scalability: With data, with load Fault tolerance : By having provisions for redundancy and recovery Heterogeneity: Different operating system, different hardware Transparency: Should appear as a whole instead of collection of computers Hide details and complexity by accomplishing above challenges from the user and need a common unified interface to interact with it. To address most of these challenges(but not all) Hadoop come in. Common Challenges in Distributed computing
  • 11. Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage. Hadoop is an open-source implementation of Google MapReduce, GFS(distributed file system). Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop fulfill need of common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License Hadoop origins
  • 13. Store and process large amounts of data (PetaBytes) Performance, storage, processing scale linearly Compute should move to data Simple core , modular and extensible Failure is normal, expected Manageable and Heal self Design run on commodity hardware-cost effective Hadoop Design Axioms
  • 14. For Storage and Distributed computing (MapReduce) Spilt up the data Process Data in parallel Sort and combine to get the answer Schedule, Process and aggregate independently Failures are independent, Handle failures. Handle fault tolerance Solve : Hadoop achieves complete parallelism
  • 15. 2002-2004 Doug cutting and Mike Cafarella started working on Nutch 2003-2004: Google publishes GFS and MapReduce paper 2004 : Doug cutting adds DFS and Mapreduce support to Nutch Yahoo ! Hires Cutting , bulid team to develop Hadoop 2007: NY time converts 4TB of archive over 100 EC2 cluster of Hadoop. Web scale deployement at Y!,Facebook,twitter. May 2009: Yahoo does fastest sort of a TB, 62secs over 1460nodes Yahoo sort a PB in 16.25hrs over 3658 nodes Hadoop History
  • 16. An Elephant can't jump. But can carry heavy load !!!  A fundamental tenet of relational databases structure defined by a schema, what about Large data sets are often unstructured or semi- structured, Hadoop is the best choice. Hadoop MR framework uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types.  Scaling commercial relational databases is expensive and limited.  High-level declarative language like SQL, Block box Query engine.You query data by stating the result you want and let the database engine figure and drive it. you can build complex statistical models from your data or analytical reporting or reformat your image data. SQL is not well designed for such tasks. MapReduce tries to collocate the data with the compute node, so data access is fast since it is local.  Coordinating the processes in a large-scale distributed computation is a challenge. HDFS and MR made easy split, store. process and aggregate. Hadoop V/S RDBMS
  • 17.  To run a bigger database you need to buy a bigger machine. the high-end machines are not cost effective for many applications. For example, a machine with four times the power of a standard PC costs a lot more than putting four such PCs in a cluster. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity hardware . Adding more resources means adding more machines to the Hadoop cluster. Effective cost per user TB: $250/TB Other solutions(RDBMS) cost in the range of $100 to $100K per user TB  Hardest aspect is gracefully handling partial failure— when you don’t know if a remote process has failed or not—and still making progress with the overall computation. MapReduce spares the programmer from having to think about failure, since the implementation detects failed map or reduce tasks and reschedules with suitable replacements . MapReduce is able to do this since it is a shared-nothing architecture, meaning that tasks have no dependence on one other. Hadoop V/S RDBMS Continue…
  • 18. Hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication. Redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. Node failure and disk failure efficient handle in Hadoop frame work. Hadoop V/S RDBMS Continue…
  • 19. Is Hadoop alternative for RDBMs ? Hadoop is not replacing the traditional data systems used for building analytic applications – the RDBMS, EDW and MPP systems – but rather is a complement.  Interoperate with existing systems and tools, at the moment Apache Hadoop is not a substitute for a database  No Relation, Key Value pairs  Big Data, unstructured (Text) & semi structured (Seq / Binary Files)  Structured (Hbase=Google BigTable)  Works fine together with RDBMs, Hadoop is being used to distill large quantities of data into something more manageable. Hadoop V/S RDBMS Continue…
  • 20. Hadoop designed and built on two independent frame works. Hadoop = HDFS + Map reduce HDFS (storage and File system) : HDFS is a reliable distributed file system that provides high-throughput access to data MapReduce (processing) : MapReduce is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm Hadoop has a master/slave architecture for both storage and processing. Hadoop Architecture
  • 21. Hadoop Master and Salve Architecture Periodic check piont Hadoop storage and file system – Hadoo Distribute File system(HDFS) The components(daemons) of HDFS are • NameNode is the master of the system. It is maintains the name system (directories and files) and manages the blocks which are present on the DataNodes. • DataNodes are the slaves which are deployed on each machine and provide the actual stor-age. They are responsible for serving read and write requests for the clients. • Secondary NameNode is responsible for performing periodic checkpoints. So, in the event of NameNode failure, you can restart the NameNode using the checkpoint. Name Node Data Nodes Data node ….. Data node Slave Secondary Name Node Periodic check point Master
  • 22. Hadoop Master and Salve Architecture continues… Parallel and Distributed computation – Map Reduce Paradigm The components(daemons) of MapReduce are:  JobTracker is the master of the system which manages the jobs and resources in the clus-ter (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being processed i.e. on the TaskTracker which is running on the same DataNode as the underlying block.  TaskTrackers are the slaves which are deployed on each machine. They are responsible for running the map and reduce tasks as instructed by the JobTracker Job Tracker Task Tacker Task tracker ….. Task Tracker Master Slave
  • 25. Hadoop Ecosystem now… The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools, frameworks . Hadoop has become the kernel
  • 26. Hadoop can be configured in three modes Standalone: Hadoop all Deamons in run inside a single Java process. Use local file for storage. Standalone mode helpful for debug Hadoop applications. Pseudo-distributed : Each Hadoop daemons runs is different JVM, as a separate process , but all processes running on a single machine. Fully-distributed: Hadoop actual powers parallel processing, scalability and the independence of task execution, replication management, workflow management, fault-tolerance, and data consistency are lies in the fully distributed mode. The Hadoop fully distributed mode is highly effective centralized data structure allows multiple machines to contribute processing power and storage to the cluster. Hadoop Environment
  • 27. Distributed frame work for processing and storing data generally on commodity hardware. Completely open source and Written in Java Store anything. Unstructured or semi structured data, Storage capacity. Scale linearly, cost in not exponential. Data locality and process in your way. Code moves to data In MR you specify the actual steps in processing the data and drive the out put. Stream access: Process data in any language. Failure and fault tolerance: Detect Failure and Heals itself. Reliable, data replicated, failed task are rerun , no need maintain backup of data Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion Primarily used for batch processing, not real-time/trasactional user applications. & Many more…….. Hadoop Features and summary
  • 28. Hadoop Open Source at Apache No strategic agenda  quality is emergent, Continues evolving Community based and strong  diverse organizations collaborating voluntarily  decisions by consensus  transparent Allows competing projects  survival of fittest A loose federation of projects  permits evolution Insures against vendor lock-in  can't buy Apache Who Wrote Hadoop? It's the Community
  • 29. Contributors and Development Lifetime patches contributed for all Hadoop-related projects: community members by current employer * source : JIRA tickets
  • 30. What is Hadoop used for? • Search – Yahoo, Amazon, Zvents, • Log processing – Facebook, Yahoo, ContextWeb. Joost, Last.fm • Recommendation Systems – Facebook • Data Warehouse – Facebook, AOL • Video and Image Analysis – New York Times, Eyealike ..... Almost in every domain !!!
  • 31. Who uses Hadoop?  Amazon/A9  Facebook  Google  IBM Joost  Last.fm New York Times PowerSet Veoh Yahoo! Twitter LinkedIn …No list too big now
  • 32. References Hadoop: The Definitive Guide, Third Edition by Tom White. http://hadoop.apache.org/ http://www.cloudera.com/
  • 33.
  • 34. Zookeeper(Coordination) Map-Reduce (Job scheduling and Execution) HBase(Key-Value store)Hive Warehouse Pig(Data Flow) processing ETL Tools Avero(Serialization) Sqoop,Flume MR-streaming Application DB Transactional DBRest API

Editor's Notes

  1. Look around at the technology we have today, and it's easy to come to the conclusion thatit's all about data. We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold growth by 2011 to 1.8 and 2015 to 5Zetta bytes. This flood of data is coming from many sources. Consider the following:• The New York Stock Exchange generates about one terabyte of new trade data perday.• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of20 terabytes per month.• The Large Hadron Collider near Geneva, Switzerland, will produce about 15petabytes of data per year.
  2. Instead of growing a system onto larger and larger hardware, the scale-out approachspreads the processing onto more and more machines. These traditional approaches to scale-up and scale-out not feasible. The purchase costs are often high,as is the effort to develop and manage the systems.