SlideShare a Scribd company logo
1 of 8
Copyright © 2008 The Apache Software Foundation. All rights reserved.
ZooKeeper Recipes and Solutions
by
Table of contents
1 A Guide to Creating Higher-level Constructs with ZooKeeper.........................................2
1.1 Out of the Box Applications: Name Service, Configuration, Group Membership......2
1.2 Barriers..........................................................................................................................2
1.3 Queues...........................................................................................................................4
1.4 Locks.............................................................................................................................4
1.5 Two-phased Commit.....................................................................................................6
1.6 Leader Election.............................................................................................................7
ZooKeeper Recipes and Solutions
Page 2Copyright © 2008 The Apache Software Foundation. All rights reserved.
1 A Guide to Creating Higher-level Constructs with ZooKeeper
In this article, you'll find guidelines for using ZooKeeper to implement higher order
functions. All of them are conventions implemented at the client and do not require special
support from ZooKeeper. Hopfully the community will capture these conventions in client-
side libraries to ease their use and to encourage standardization.
One of the most interesting things about ZooKeeper is that even though ZooKeeper uses
asynchronous notifications, you can use it to build synchronous consistency primitives, such
as queues and locks. As you will see, this is possible because ZooKeeper imposes an overall
order on updates, and has mechanisms to expose this ordering.
Note that the recipes below attempt to employ best practices. In particular, they avoid
polling, timers or anything else that would result in a "herd effect", causing bursts of traffic
and limiting scalability.
There are many useful functions that can be imagined that aren't included here - revocable
read-write priority locks, as just one example. And some of the constructs mentioned here
- locks, in particular - illustrate certain points, even though you may find other constructs,
such as event handles or queues, a more practical means of performing the same function. In
general, the examples in this section are designed to stimulate thought.
1.1 Out of the Box Applications: Name Service, Configuration, Group Membership
Name service and configuration are two of the primary applications of ZooKeeper. These
two functions are provided directly by the ZooKeeper API.
Another function directly provided by ZooKeeper is group membership. The group is
represented by a node. Members of the group create ephemeral nodes under the group node.
Nodes of the members that fail abnormally will be removed automatically when ZooKeeper
detects the failure.
1.2 Barriers
Distributed systems use barriers to block processing of a set of nodes until a condition is met
at which time all the nodes are allowed to proceed. Barriers are implemented in ZooKeeper
by designating a barrier node. The barrier is in place if the barrier node exists. Here's the
pseudo code:
1. Client calls the ZooKeeper API's exists() function on the barrier node, with watch set to
true.
2. If exists() returns false, the barrier is gone and the client proceeds
3. Else, if exists() returns true, the clients wait for a watch event from ZooKeeper for the
barrier node.
ZooKeeper Recipes and Solutions
Page 3Copyright © 2008 The Apache Software Foundation. All rights reserved.
4. When the watch event is triggered, the client reissues the exists( ) call, again waiting until
the barrier node is removed.
1.2.1 Double Barriers
Double barriers enable clients to synchronize the beginning and the end of a computation.
When enough processes have joined the barrier, processes start their computation and leave
the barrier once they have finished. This recipe shows how to use a ZooKeeper node as a
barrier.
The pseudo code in this recipe represents the barrier node as b. Every client process p
registers with the barrier node on entry and unregisters when it is ready to leave. A node
registers with the barrier node via the Enter procedure below, it waits until x client process
register before proceeding with the computation. (The x here is up to you to determine for
your system.)
Enter Leave
1. Create a name n = b+“/”+p
2. Set watch: exists(b + ‘‘/ready’’, true)
3. Create child: create( n, EPHEMERAL)
4. L = getChildren(b, false)
5. if fewer children in L than x, wait for watch
event
6. else create(b + ‘‘/ready’’, REGULAR)
1. L = getChildren(b, false)
2. if no children, exit
3. if p is only process node in L, delete(n) and exit
4. if p is the lowest process node in L, wait on
highest process node in P
5. else delete(n) if still exists and wait on lowest
process node in L
6. goto 1
On entering, all processes watch on a ready node and create an ephemeral node as a child of
the barrier node. Each process but the last enters the barrier and waits for the ready node to
appear at line 5. The process that creates the xth node, the last process, will see x nodes in the
list of children and create the ready node, waking up the other processes. Note that waiting
processes wake up only when it is time to exit, so waiting is efficient.
On exit, you can't use a flag such as ready because you are watching for process nodes to go
away. By using ephemeral nodes, processes that fail after the barrier has been entered do not
prevent correct processes from finishing. When processes are ready to leave, they need to
delete their process nodes and wait for all other processes to do the same.
Processes exit when there are no process nodes left as children of b. However, as an
efficiency, you can use the lowest process node as the ready flag. All other processes that
are ready to exit watch for the lowest existing process node to go away, and the owner of the
lowest process watches for any other process node (picking the highest for simplicity) to go
away. This means that only a single process wakes up on each node deletion except for the
last node, which wakes up everyone when it is removed.
ZooKeeper Recipes and Solutions
Page 4Copyright © 2008 The Apache Software Foundation. All rights reserved.
1.3 Queues
Distributed queues are a common data structure. To implement a distributed queue in
ZooKeeper, first designate a znode to hold the queue, the queue node. The distributed clients
put something into the queue by calling create() with a pathname ending in "queue-", with
the sequence and ephemeral flags in the create() call set to true. Because the sequence
flag is set, the new pathnames will have the form _path-to-queue-node_/queue-X, where
X is a monotonic increasing number. A client that wants to be removed from the queue
calls ZooKeeper's getChildren( ) function, with watch set to true on the queue node, and
begins processing nodes with the lowest number. The client does not need to issue another
getChildren( ) until it exhausts the list obtained from the first getChildren( ) call. If there
are are no children in the queue node, the reader waits for a watch notification to check the
queue again.
Note:
There now exists a Queue implementation in ZooKeeper recipes directory. This is distributed with
the release -- src/recipes/queue directory of the release artifact.
1.3.1 Priority Queues
To implement a priority queue, you need only make two simple changes to the generic
queue recipe . First, to add to a queue, the pathname ends with "queue-YY" where YY is the
priority of the element with lower numbers representing higher priority (just like UNIX).
Second, when removing from the queue, a client uses an up-to-date children list meaning that
the client will invalidate previously obtained children lists if a watch notification triggers for
the queue node.
1.4 Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two
clients think they hold the same lock. These can be implemented using ZooKeeeper. As with
priority queues, first define a lock node.
Note:
There now exists a Lock implementation in ZooKeeper recipes directory. This is distributed with the
release -- src/recipes/lock directory of the release artifact.
Clients wishing to obtain a lock do the following:
1. Call create( ) with a pathname of "_locknode_/lock-" and the sequence and ephemeral
flags set.
ZooKeeper Recipes and Solutions
Page 5Copyright © 2008 The Apache Software Foundation. All rights reserved.
2. Call getChildren( ) on the lock node without setting the watch flag (this is important to
avoid the herd effect).
3. If the pathname created in step 1 has the lowest sequence number suffix, the client has
the lock and the client exits the protocol.
4. The client calls exists( ) with the watch flag set on the path in the lock directory with the
next lowest sequence number.
5. if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname
from the previous step before going to step 2.
The unlock protocol is very simple: clients wishing to release a lock simply delete the node
they created in step 1.
Here are a few things to notice:
• The removal of a node will only cause one client to wake up since each node is watched
by exactly one client. In this way, you avoid the herd effect.
• There is no polling or timeouts.
• Because of the way you implement locking, it is easy to see the amount of lock
contention, break locks, debug locking problems, etc.
1.4.1 Shared Locks
You can implement shared locks by with a few changes to the lock protocol:
Obtaining a read lock: Obtaining a write lock:
1. Call create( ) to create a node with pathname
"_locknode_/read-". This is the lock node
use later in the protocol. Make sure to set both
the sequence and ephemeral flags.
2. Call getChildren( ) on the lock node without
setting the watch flag - this is important, as it
avoids the herd effect.
3. If there are no children with a pathname starting
with "write-" and having a lower sequence
number than the node created in step 1, the client
has the lock and can exit the protocol.
4. Otherwise, call exists( ), with watch flag, set on
the node in lock directory with pathname staring
with "write-" having the next lowest sequence
number.
5. If exists( ) returns false, goto step 2.
1. Call create( ) to create a node with pathname
"_locknode_/write-". This is the lock
node spoken of later in the protocol. Make sure
to set both sequence and ephemeral flags.
2. Call getChildren( ) on the lock node without
setting the watch flag - this is important, as it
avoids the herd effect.
3. If there are no children with a lower sequence
number than the node created in step 1, the client
has the lock and the client exits the protocol.
4. Call exists( ), with watch flag set, on the node
with the pathname that has the next lowest
sequence number.
5. If exists( ) returns false, goto step 2. Otherwise,
wait for a notification for the pathname from the
previous step before going to step 2.
ZooKeeper Recipes and Solutions
Page 6Copyright © 2008 The Apache Software Foundation. All rights reserved.
6. Otherwise, wait for a notification for the
pathname from the previous step before going to
step 2
Note:
It might appear that this recipe creates a herd effect: when there is a large group of clients waiting for
a read lock, and all getting notified more or less simultaneously when the "write-" node with the
lowest sequence number is deleted. In fact. that's valid behavior: as all those waiting reader clients
should be released since they have the lock. The herd effect refers to releasing a "herd" when in fact
only a single or a small number of machines can proceed.
1.4.2 Recoverable Shared Locks
With minor modifications to the Shared Lock protocol, you make shared locks revocable by
modifying the shared lock protocol:
In step 1, of both obtain reader and writer lock protocols, call getData( ) with watch set,
immediately after the call to create( ). If the client subsequently receives notification for the
node it created in step 1, it does another getData( ) on that node, with watch set and looks for
the string "unlock", which signals to the client that it must release the lock. This is because,
according to this shared lock protocol, you can request the client with the lock give up the
lock by calling setData() on the lock node, writing "unlock" to that node.
Note that this protocol requires the lock holder to consent to releasing the lock. Such consent
is important, especially if the lock holder needs to do some processing before releasing the
lock. Of course you can always implement Revocable Shared Locks with Freaking Laser
Beams by stipulating in your protocol that the revoker is allowed to delete the lock node if
after some length of time the lock isn't deleted by the lock holder.
1.5 Two-phased Commit
A two-phase commit protocol is an algorithm that lets all clients in a distributed system agree
either to commit a transaction or abort.
In ZooKeeper, you can implement a two-phased commit by having a coordinator create a
transaction node, say "/app/Tx", and one child node per participating site, say "/app/Tx/
s_i". When coordinator creates the child node, it leaves the content undefined. Once each
site involved in the transaction receives the transaction from the coordinator, the site reads
each child node and sets a watch. Each site then processes the query and votes "commit"
or "abort" by writing to its respective node. Once the write completes, the other sites are
notified, and as soon as all sites have all votes, they can decide either "abort" or "commit".
Note that a node can decide "abort" earlier if some site votes for "abort".
ZooKeeper Recipes and Solutions
Page 7Copyright © 2008 The Apache Software Foundation. All rights reserved.
An interesting aspect of this implementation is that the only role of the coordinator is
to decide upon the group of sites, to create the ZooKeeper nodes, and to propagate the
transaction to the corresponding sites. In fact, even propagating the transaction can be done
through ZooKeeper by writing it in the transaction node.
There are two important drawbacks of the approach described above. One is the message
complexity, which is O(n²). The second is the impossibility of detecting failures of sites
through ephemeral nodes. To detect the failure of a site using ephemeral nodes, it is
necessary that the site create the node.
To solve the first problem, you can have only the coordinator notified of changes to the
transaction nodes, and then notify the sites once coordinator reaches a decision. Note that this
approach is scalable, but it's is slower too, as it requires all communication to go through the
coordinator.
To address the second problem, you can have the coordinator propagate the transaction to the
sites, and have each site creating its own ephemeral node.
1.6 Leader Election
A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|
EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea
is to have a znode, say "/election", such that each znode creates a child znode "/election/
n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper
automatically appends a sequence number that is greater that any one previously appended
to a child of "/election". The process that created the znode with the smallest appended
sequence number is the leader.
That's not all, though. It is important to watch for failures of the leader, so that a new client
arises as the new leader in the case the current leader fails. A trivial solution is to have all
application processes watching upon the current smallest znode, and checking if they are the
new leader when the smallest znode goes away (note that the smallest znode will go away if
the leader fails because the node is ephemeral). But this causes a herd effect: upon of failure
of the current leader, all other processes receive a notification, and execute getChildren on
"/election" to obtain the current list of children of "/election". If the number of clients is
large, it causes a spike on the number of operations that ZooKeeper servers have to process.
To avoid the herd effect, it is sufficient to watch for the next znode down on the sequence
of znodes. If a client receives a notification that the znode it is watching is gone, then it
becomes the new leader in the case that there is no smaller znode. Note that this avoids the
herd effect by not having all clients watching the same znode.
Here's the pseudo code:
Let ELECTION be a path of choice of the application. To volunteer to be a leader:
ZooKeeper Recipes and Solutions
Page 8Copyright © 2008 The Apache Software Foundation. All rights reserved.
1. Create znode z with path "ELECTION/n_" with both SEQUENCE and EPHEMERAL
flags;
2. Let C be the children of "ELECTION", and i be the sequence number of z;
3. Watch for changes on "ELECTION/n_j", where j is the largest sequence number such
that j < i and n_j is a znode in C;
Upon receiving a notification of znode deletion:
1. Let C be the new set of children of ELECTION;
2. If z is the smallest node in C, then execute leader procedure;
3. Otherwise, watch for changes on "ELECTION/n_j", where j is the largest sequence
number such that j < i and n_j is a znode in C;
Note that the znode having no preceding znode on the list of children does not imply that the
creator of this znode is aware that it is the current leader. Applications may consider creating
a separate znode to acknowledge that the leader has executed the leader procedure.

More Related Content

What's hot

JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
 
To inject or not to inject: CDI is the question
To inject or not to inject: CDI is the questionTo inject or not to inject: CDI is the question
To inject or not to inject: CDI is the questionAntonio Goncalves
 
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDK
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDKEric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDK
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDKGuardSquare
 
Deterministic simulation testing
Deterministic simulation testingDeterministic simulation testing
Deterministic simulation testingFoundationDB
 
Thread syncronization
Thread syncronizationThread syncronization
Thread syncronizationpriyabogra1
 
The definitive guide to java agents
The definitive guide to java agentsThe definitive guide to java agents
The definitive guide to java agentsRafael Winterhalter
 
Unit testing without Robolectric, Droidcon Berlin 2016
Unit testing without Robolectric, Droidcon Berlin 2016Unit testing without Robolectric, Droidcon Berlin 2016
Unit testing without Robolectric, Droidcon Berlin 2016Danny Preussler
 
Eric Lafortune - ProGuard and DexGuard for optimization and protection
Eric Lafortune - ProGuard and DexGuard for optimization and protectionEric Lafortune - ProGuard and DexGuard for optimization and protection
Eric Lafortune - ProGuard and DexGuard for optimization and protectionGuardSquare
 
Making Exceptions on Exception Handling (WEH 2012 Keynote Speech)
Making Exceptions on  Exception Handling (WEH 2012 Keynote Speech)Making Exceptions on  Exception Handling (WEH 2012 Keynote Speech)
Making Exceptions on Exception Handling (WEH 2012 Keynote Speech)Tao Xie
 
Eric Lafortune - Fighting application size with ProGuard and beyond
Eric Lafortune - Fighting application size with ProGuard and beyondEric Lafortune - Fighting application size with ProGuard and beyond
Eric Lafortune - Fighting application size with ProGuard and beyondGuardSquare
 
Getting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with ThymeleafGetting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with ThymeleafMasatoshi Tada
 
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...Istanbul Tech Talks
 

What's hot (17)

JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Kotlin meets Gadsu
Kotlin meets GadsuKotlin meets Gadsu
Kotlin meets Gadsu
 
Byte code field report
Byte code field reportByte code field report
Byte code field report
 
To inject or not to inject: CDI is the question
To inject or not to inject: CDI is the questionTo inject or not to inject: CDI is the question
To inject or not to inject: CDI is the question
 
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDK
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDKEric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDK
Eric Lafortune - ProGuard: Optimizer and obfuscator in the Android SDK
 
Deterministic simulation testing
Deterministic simulation testingDeterministic simulation testing
Deterministic simulation testing
 
Thread syncronization
Thread syncronizationThread syncronization
Thread syncronization
 
The definitive guide to java agents
The definitive guide to java agentsThe definitive guide to java agents
The definitive guide to java agents
 
Java7
Java7Java7
Java7
 
Unit testing without Robolectric, Droidcon Berlin 2016
Unit testing without Robolectric, Droidcon Berlin 2016Unit testing without Robolectric, Droidcon Berlin 2016
Unit testing without Robolectric, Droidcon Berlin 2016
 
What's new in Java EE 6
What's new in Java EE 6What's new in Java EE 6
What's new in Java EE 6
 
Eric Lafortune - ProGuard and DexGuard for optimization and protection
Eric Lafortune - ProGuard and DexGuard for optimization and protectionEric Lafortune - ProGuard and DexGuard for optimization and protection
Eric Lafortune - ProGuard and DexGuard for optimization and protection
 
Making Exceptions on Exception Handling (WEH 2012 Keynote Speech)
Making Exceptions on  Exception Handling (WEH 2012 Keynote Speech)Making Exceptions on  Exception Handling (WEH 2012 Keynote Speech)
Making Exceptions on Exception Handling (WEH 2012 Keynote Speech)
 
Eric Lafortune - Fighting application size with ProGuard and beyond
Eric Lafortune - Fighting application size with ProGuard and beyondEric Lafortune - Fighting application size with ProGuard and beyond
Eric Lafortune - Fighting application size with ProGuard and beyond
 
Getting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with ThymeleafGetting start Java EE Action-Based MVC with Thymeleaf
Getting start Java EE Action-Based MVC with Thymeleaf
 
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...
ITT 2014 - Eric Lafortune - ProGuard, Optimizer and Obfuscator in the Android...
 
Java byte code in practice
Java byte code in practiceJava byte code in practice
Java byte code in practice
 

Similar to ZooKeeper Recipes and Solutions

Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialJeff Smith
 
Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialJeff Smith
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demoAjith Narayanan
 
Linux synchronization tools
Linux synchronization toolsLinux synchronization tools
Linux synchronization toolsmukul bhardwaj
 
Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Joydeep Banik Roy
 
zookeeperProgrammers
zookeeperProgrammerszookeeperProgrammers
zookeeperProgrammersHiroshi Ono
 
Forti gate troubleshooting_guide_v0.10
Forti gate troubleshooting_guide_v0.10Forti gate troubleshooting_guide_v0.10
Forti gate troubleshooting_guide_v0.10Phong Nguyễn
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Zookeeper Architecture
Zookeeper ArchitectureZookeeper Architecture
Zookeeper ArchitecturePrasad Wali
 
Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applicationsNuno Caneco
 
Monitor(karthika)
Monitor(karthika)Monitor(karthika)
Monitor(karthika)Nagarajan
 
Handling False Positives in PVS-Studio and CppCat
Handling False Positives in PVS-Studio and CppCatHandling False Positives in PVS-Studio and CppCat
Handling False Positives in PVS-Studio and CppCatAndrey Karpov
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection7mind
 
running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
 
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingPART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingFastBit Embedded Brain Academy
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slotsmha4
 

Similar to ZooKeeper Recipes and Solutions (20)

Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorial
 
Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorial
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
Linux synchronization tools
Linux synchronization toolsLinux synchronization tools
Linux synchronization tools
 
Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!
 
zookeeperProgrammers
zookeeperProgrammerszookeeperProgrammers
zookeeperProgrammers
 
Forti gate troubleshooting_guide_v0.10
Forti gate troubleshooting_guide_v0.10Forti gate troubleshooting_guide_v0.10
Forti gate troubleshooting_guide_v0.10
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Curator intro
Curator introCurator intro
Curator intro
 
Zookeeper Architecture
Zookeeper ArchitectureZookeeper Architecture
Zookeeper Architecture
 
Building resilient applications
Building resilient applicationsBuilding resilient applications
Building resilient applications
 
Monitor(karthika)
Monitor(karthika)Monitor(karthika)
Monitor(karthika)
 
Handling False Positives in PVS-Studio and CppCat
Handling False Positives in PVS-Studio and CppCatHandling False Positives in PVS-Studio and CppCat
Handling False Positives in PVS-Studio and CppCat
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
 
running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with DebuggingPART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
PART-3 : Mastering RTOS FreeRTOS and STM32Fx with Debugging
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slots
 

More from Jeff Smith (20)

vi-vim-cheat-sheet.pdf
vi-vim-cheat-sheet.pdfvi-vim-cheat-sheet.pdf
vi-vim-cheat-sheet.pdf
 
dfgdgsdg
dfgdgsdgdfgdgsdg
dfgdgsdg
 
3yudh.pdf
3yudh.pdf3yudh.pdf
3yudh.pdf
 
nvjiz.pdf
nvjiz.pdfnvjiz.pdf
nvjiz.pdf
 
nvjiz.pdf
nvjiz.pdfnvjiz.pdf
nvjiz.pdf
 
nvjiz.pdf
nvjiz.pdfnvjiz.pdf
nvjiz.pdf
 
nvjiz.pdf
nvjiz.pdfnvjiz.pdf
nvjiz.pdf
 
nvjiz.pdf
nvjiz.pdfnvjiz.pdf
nvjiz.pdf
 
mctpr.pdf
mctpr.pdfmctpr.pdf
mctpr.pdf
 
mctpr.pdf
mctpr.pdfmctpr.pdf
mctpr.pdf
 
mctpr.pdf
mctpr.pdfmctpr.pdf
mctpr.pdf
 
mctpr.pdf
mctpr.pdfmctpr.pdf
mctpr.pdf
 
mctpr.pdf
mctpr.pdfmctpr.pdf
mctpr.pdf
 
0nba4.pdf
0nba4.pdf0nba4.pdf
0nba4.pdf
 
0nba4.pdf
0nba4.pdf0nba4.pdf
0nba4.pdf
 
4wa4i.pdf
4wa4i.pdf4wa4i.pdf
4wa4i.pdf
 
7k3gy.pdf
7k3gy.pdf7k3gy.pdf
7k3gy.pdf
 
b33t2.pdf
b33t2.pdfb33t2.pdf
b33t2.pdf
 
cqqk2.pdf
cqqk2.pdfcqqk2.pdf
cqqk2.pdf
 
ebcbe.docx
ebcbe.docxebcbe.docx
ebcbe.docx
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

ZooKeeper Recipes and Solutions

  • 1. Copyright © 2008 The Apache Software Foundation. All rights reserved. ZooKeeper Recipes and Solutions by Table of contents 1 A Guide to Creating Higher-level Constructs with ZooKeeper.........................................2 1.1 Out of the Box Applications: Name Service, Configuration, Group Membership......2 1.2 Barriers..........................................................................................................................2 1.3 Queues...........................................................................................................................4 1.4 Locks.............................................................................................................................4 1.5 Two-phased Commit.....................................................................................................6 1.6 Leader Election.............................................................................................................7
  • 2. ZooKeeper Recipes and Solutions Page 2Copyright © 2008 The Apache Software Foundation. All rights reserved. 1 A Guide to Creating Higher-level Constructs with ZooKeeper In this article, you'll find guidelines for using ZooKeeper to implement higher order functions. All of them are conventions implemented at the client and do not require special support from ZooKeeper. Hopfully the community will capture these conventions in client- side libraries to ease their use and to encourage standardization. One of the most interesting things about ZooKeeper is that even though ZooKeeper uses asynchronous notifications, you can use it to build synchronous consistency primitives, such as queues and locks. As you will see, this is possible because ZooKeeper imposes an overall order on updates, and has mechanisms to expose this ordering. Note that the recipes below attempt to employ best practices. In particular, they avoid polling, timers or anything else that would result in a "herd effect", causing bursts of traffic and limiting scalability. There are many useful functions that can be imagined that aren't included here - revocable read-write priority locks, as just one example. And some of the constructs mentioned here - locks, in particular - illustrate certain points, even though you may find other constructs, such as event handles or queues, a more practical means of performing the same function. In general, the examples in this section are designed to stimulate thought. 1.1 Out of the Box Applications: Name Service, Configuration, Group Membership Name service and configuration are two of the primary applications of ZooKeeper. These two functions are provided directly by the ZooKeeper API. Another function directly provided by ZooKeeper is group membership. The group is represented by a node. Members of the group create ephemeral nodes under the group node. Nodes of the members that fail abnormally will be removed automatically when ZooKeeper detects the failure. 1.2 Barriers Distributed systems use barriers to block processing of a set of nodes until a condition is met at which time all the nodes are allowed to proceed. Barriers are implemented in ZooKeeper by designating a barrier node. The barrier is in place if the barrier node exists. Here's the pseudo code: 1. Client calls the ZooKeeper API's exists() function on the barrier node, with watch set to true. 2. If exists() returns false, the barrier is gone and the client proceeds 3. Else, if exists() returns true, the clients wait for a watch event from ZooKeeper for the barrier node.
  • 3. ZooKeeper Recipes and Solutions Page 3Copyright © 2008 The Apache Software Foundation. All rights reserved. 4. When the watch event is triggered, the client reissues the exists( ) call, again waiting until the barrier node is removed. 1.2.1 Double Barriers Double barriers enable clients to synchronize the beginning and the end of a computation. When enough processes have joined the barrier, processes start their computation and leave the barrier once they have finished. This recipe shows how to use a ZooKeeper node as a barrier. The pseudo code in this recipe represents the barrier node as b. Every client process p registers with the barrier node on entry and unregisters when it is ready to leave. A node registers with the barrier node via the Enter procedure below, it waits until x client process register before proceeding with the computation. (The x here is up to you to determine for your system.) Enter Leave 1. Create a name n = b+“/”+p 2. Set watch: exists(b + ‘‘/ready’’, true) 3. Create child: create( n, EPHEMERAL) 4. L = getChildren(b, false) 5. if fewer children in L than x, wait for watch event 6. else create(b + ‘‘/ready’’, REGULAR) 1. L = getChildren(b, false) 2. if no children, exit 3. if p is only process node in L, delete(n) and exit 4. if p is the lowest process node in L, wait on highest process node in P 5. else delete(n) if still exists and wait on lowest process node in L 6. goto 1 On entering, all processes watch on a ready node and create an ephemeral node as a child of the barrier node. Each process but the last enters the barrier and waits for the ready node to appear at line 5. The process that creates the xth node, the last process, will see x nodes in the list of children and create the ready node, waking up the other processes. Note that waiting processes wake up only when it is time to exit, so waiting is efficient. On exit, you can't use a flag such as ready because you are watching for process nodes to go away. By using ephemeral nodes, processes that fail after the barrier has been entered do not prevent correct processes from finishing. When processes are ready to leave, they need to delete their process nodes and wait for all other processes to do the same. Processes exit when there are no process nodes left as children of b. However, as an efficiency, you can use the lowest process node as the ready flag. All other processes that are ready to exit watch for the lowest existing process node to go away, and the owner of the lowest process watches for any other process node (picking the highest for simplicity) to go away. This means that only a single process wakes up on each node deletion except for the last node, which wakes up everyone when it is removed.
  • 4. ZooKeeper Recipes and Solutions Page 4Copyright © 2008 The Apache Software Foundation. All rights reserved. 1.3 Queues Distributed queues are a common data structure. To implement a distributed queue in ZooKeeper, first designate a znode to hold the queue, the queue node. The distributed clients put something into the queue by calling create() with a pathname ending in "queue-", with the sequence and ephemeral flags in the create() call set to true. Because the sequence flag is set, the new pathnames will have the form _path-to-queue-node_/queue-X, where X is a monotonic increasing number. A client that wants to be removed from the queue calls ZooKeeper's getChildren( ) function, with watch set to true on the queue node, and begins processing nodes with the lowest number. The client does not need to issue another getChildren( ) until it exhausts the list obtained from the first getChildren( ) call. If there are are no children in the queue node, the reader waits for a watch notification to check the queue again. Note: There now exists a Queue implementation in ZooKeeper recipes directory. This is distributed with the release -- src/recipes/queue directory of the release artifact. 1.3.1 Priority Queues To implement a priority queue, you need only make two simple changes to the generic queue recipe . First, to add to a queue, the pathname ends with "queue-YY" where YY is the priority of the element with lower numbers representing higher priority (just like UNIX). Second, when removing from the queue, a client uses an up-to-date children list meaning that the client will invalidate previously obtained children lists if a watch notification triggers for the queue node. 1.4 Locks Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node. Note: There now exists a Lock implementation in ZooKeeper recipes directory. This is distributed with the release -- src/recipes/lock directory of the release artifact. Clients wishing to obtain a lock do the following: 1. Call create( ) with a pathname of "_locknode_/lock-" and the sequence and ephemeral flags set.
  • 5. ZooKeeper Recipes and Solutions Page 5Copyright © 2008 The Apache Software Foundation. All rights reserved. 2. Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect). 3. If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol. 4. The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number. 5. if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2. The unlock protocol is very simple: clients wishing to release a lock simply delete the node they created in step 1. Here are a few things to notice: • The removal of a node will only cause one client to wake up since each node is watched by exactly one client. In this way, you avoid the herd effect. • There is no polling or timeouts. • Because of the way you implement locking, it is easy to see the amount of lock contention, break locks, debug locking problems, etc. 1.4.1 Shared Locks You can implement shared locks by with a few changes to the lock protocol: Obtaining a read lock: Obtaining a write lock: 1. Call create( ) to create a node with pathname "_locknode_/read-". This is the lock node use later in the protocol. Make sure to set both the sequence and ephemeral flags. 2. Call getChildren( ) on the lock node without setting the watch flag - this is important, as it avoids the herd effect. 3. If there are no children with a pathname starting with "write-" and having a lower sequence number than the node created in step 1, the client has the lock and can exit the protocol. 4. Otherwise, call exists( ), with watch flag, set on the node in lock directory with pathname staring with "write-" having the next lowest sequence number. 5. If exists( ) returns false, goto step 2. 1. Call create( ) to create a node with pathname "_locknode_/write-". This is the lock node spoken of later in the protocol. Make sure to set both sequence and ephemeral flags. 2. Call getChildren( ) on the lock node without setting the watch flag - this is important, as it avoids the herd effect. 3. If there are no children with a lower sequence number than the node created in step 1, the client has the lock and the client exits the protocol. 4. Call exists( ), with watch flag set, on the node with the pathname that has the next lowest sequence number. 5. If exists( ) returns false, goto step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
  • 6. ZooKeeper Recipes and Solutions Page 6Copyright © 2008 The Apache Software Foundation. All rights reserved. 6. Otherwise, wait for a notification for the pathname from the previous step before going to step 2 Note: It might appear that this recipe creates a herd effect: when there is a large group of clients waiting for a read lock, and all getting notified more or less simultaneously when the "write-" node with the lowest sequence number is deleted. In fact. that's valid behavior: as all those waiting reader clients should be released since they have the lock. The herd effect refers to releasing a "herd" when in fact only a single or a small number of machines can proceed. 1.4.2 Recoverable Shared Locks With minor modifications to the Shared Lock protocol, you make shared locks revocable by modifying the shared lock protocol: In step 1, of both obtain reader and writer lock protocols, call getData( ) with watch set, immediately after the call to create( ). If the client subsequently receives notification for the node it created in step 1, it does another getData( ) on that node, with watch set and looks for the string "unlock", which signals to the client that it must release the lock. This is because, according to this shared lock protocol, you can request the client with the lock give up the lock by calling setData() on the lock node, writing "unlock" to that node. Note that this protocol requires the lock holder to consent to releasing the lock. Such consent is important, especially if the lock holder needs to do some processing before releasing the lock. Of course you can always implement Revocable Shared Locks with Freaking Laser Beams by stipulating in your protocol that the revoker is allowed to delete the lock node if after some length of time the lock isn't deleted by the lock holder. 1.5 Two-phased Commit A two-phase commit protocol is an algorithm that lets all clients in a distributed system agree either to commit a transaction or abort. In ZooKeeper, you can implement a two-phased commit by having a coordinator create a transaction node, say "/app/Tx", and one child node per participating site, say "/app/Tx/ s_i". When coordinator creates the child node, it leaves the content undefined. Once each site involved in the transaction receives the transaction from the coordinator, the site reads each child node and sets a watch. Each site then processes the query and votes "commit" or "abort" by writing to its respective node. Once the write completes, the other sites are notified, and as soon as all sites have all votes, they can decide either "abort" or "commit". Note that a node can decide "abort" earlier if some site votes for "abort".
  • 7. ZooKeeper Recipes and Solutions Page 7Copyright © 2008 The Apache Software Foundation. All rights reserved. An interesting aspect of this implementation is that the only role of the coordinator is to decide upon the group of sites, to create the ZooKeeper nodes, and to propagate the transaction to the corresponding sites. In fact, even propagating the transaction can be done through ZooKeeper by writing it in the transaction node. There are two important drawbacks of the approach described above. One is the message complexity, which is O(n²). The second is the impossibility of detecting failures of sites through ephemeral nodes. To detect the failure of a site using ephemeral nodes, it is necessary that the site create the node. To solve the first problem, you can have only the coordinator notified of changes to the transaction nodes, and then notify the sites once coordinator reaches a decision. Note that this approach is scalable, but it's is slower too, as it requires all communication to go through the coordinator. To address the second problem, you can have the coordinator propagate the transaction to the sites, and have each site creating its own ephemeral node. 1.6 Leader Election A simple way of doing leader election with ZooKeeper is to use the SEQUENCE| EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode "/election/ n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence number that is greater that any one previously appended to a child of "/election". The process that created the znode with the smallest appended sequence number is the leader. That's not all, though. It is important to watch for failures of the leader, so that a new client arises as the new leader in the case the current leader fails. A trivial solution is to have all application processes watching upon the current smallest znode, and checking if they are the new leader when the smallest znode goes away (note that the smallest znode will go away if the leader fails because the node is ephemeral). But this causes a herd effect: upon of failure of the current leader, all other processes receive a notification, and execute getChildren on "/election" to obtain the current list of children of "/election". If the number of clients is large, it causes a spike on the number of operations that ZooKeeper servers have to process. To avoid the herd effect, it is sufficient to watch for the next znode down on the sequence of znodes. If a client receives a notification that the znode it is watching is gone, then it becomes the new leader in the case that there is no smaller znode. Note that this avoids the herd effect by not having all clients watching the same znode. Here's the pseudo code: Let ELECTION be a path of choice of the application. To volunteer to be a leader:
  • 8. ZooKeeper Recipes and Solutions Page 8Copyright © 2008 The Apache Software Foundation. All rights reserved. 1. Create znode z with path "ELECTION/n_" with both SEQUENCE and EPHEMERAL flags; 2. Let C be the children of "ELECTION", and i be the sequence number of z; 3. Watch for changes on "ELECTION/n_j", where j is the largest sequence number such that j < i and n_j is a znode in C; Upon receiving a notification of znode deletion: 1. Let C be the new set of children of ELECTION; 2. If z is the smallest node in C, then execute leader procedure; 3. Otherwise, watch for changes on "ELECTION/n_j", where j is the largest sequence number such that j < i and n_j is a znode in C; Note that the znode having no preceding znode on the list of children does not imply that the creator of this znode is aware that it is the current leader. Applications may consider creating a separate znode to acknowledge that the leader has executed the leader procedure.