36. Hadoop 1 – Job & Task Trackers
Master Node - The majority of hadoop deployments consist of sevaral master node
instances. Having more than one master node helps eliminate the risk of single
point of failure.
NameNode - These processes are charged with storing a directory tree of all files
in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the
file data is kept within in the cluster. Client Applications contact Name Nodes when
they need to locate a file, or add, or copy or delete a file.
DataNodes - The datanode stores data in the HDFS and is responsible for
replicating data across clusters. Data Nodes interact with client applications when
the NameNopde has supplied the Datanode's address.
WorkerNode: Unlike a master node, whose numbers we can count on one hand, a
representative Hadoop Deployment consists of dozens or hundreds of worker
nodes, which provides enough processing power to analyze a
few hundreds terabytes all the way upto one petabyte. Each worker node includes
a DataNode as well as Task Tracker.
37. Map Reduce
Job Tracker /MapReduce Workload Management Layer - This
process is assigned to interact with client applications. It is
responsible for distributing MapReduce tasks to particular nodes
within in a cluster. This engine coordinates all aspects of hadoop
such as scheduling and launching jobs.
Task Tracker - This is a process in the cluster that is capable of
receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job
Tracker
48. Coordination in a distributed system
• Coordination: An act that multiple nodes must perform together.
• Examples:
– Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
49.
50. ZooKeeper allows distributed processes to
coordinate with each other through a shared
hierarchical name space of data registers.
Introducing ZooKeeper
- ZooKeeper Wiki
ZooKeeper is much more than a
distributed lock server!
51. What is ZooKeeper?
• An open source, high-performance coordination service for
distributed applications.
• Exposes common services in simple interface:
– naming
– configuration management
– locks & synchronization
– group services
… developers don't have to write them from scratch
• Build your own on it for specific needs.
69. Name Site Counter
Dick Ebay 507,018
Dick Google 690,414
Jane Google 716,426
Dick Facebook 723,649
Jane Facebook 643,261
Jane ILoveLarry.com 856,767
Dick MadBillFans.com 675,230
NameId Name
1 Dick
2 Jane
SiteId SiteName
1 Ebay
2 Google
3 Facebook
4 ILoveLarry.com
5 MadBillFans.com
NameId SiteId Counter
1 1 507,018
1 3 690,414
2 3 716,426
1 3 723,649
2 3 643,261
2 4 856,767
1 5 675,230
Id Name Ebay Google Facebook (other columns) MadBillFans.com
1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230
Id Name Google Facebook (other columns) ILoveLarry.com
2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767
BigTable Data Model
70. Document databases
• Structured documents – XML and JSON
(JavaScript Object Notation) become more
prevalent within applications
• Web programmers start storing these in BLOBS in
MySQL
• Emergence of XML and JSON databases