SlideShare a Scribd company logo
1 of 45
Download to read offline
Paxos
Building Reliable System
2015-07-02 @drdrxp
Background
Several processes do one thing.
The only problem in distributed system is
achieving consensus.
Paxos: the core of distributed system.
Agenda
1. Problem
2. Replication is not enough
3. Paxos Algorithm
4. Paxos Optimization
Problem
Required:
Durability: 99.99999999%
Availability: 99.99%
What we have:
Hard Drive: 4% of Annual failure rate
Server Down Time: 0.1% or longer
Packet loss between IDC: 5% ~ 30%
Solution(Maybe)
Multiple Replicas
No data loss if x(x<n) replicas lost
Durability:
1 replicas: ~ 0.63%
2 replicas: ~ 0.00395%
3 replicas: < 0.000001%
n replicas: = 1 - x^n /* x = failure rate of single replica */
Solution.
How to replicate
data?
Besides number of replicas:
Availability
Atomicity
Consistency
...
Fundamental Replication Algorithms
Master-Slave Async
Master-Slave Sync
Master-Slave Semi-Sync
Quorum Write and Read
Master-Slave Async
The Mysql Way.
1. Master received write op.
2. Master wrote on disk.
3. Master responded ‘OK’.
4. Master replicated to slaves.
If disk fail before replication
→ Data loss.
Time
MasterClient Slave.1 Slave.2
Disk Failure
Master-Slave Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client won’t receive ‘OK’ until
all slaves respond.
One unreachable node
halts the entire system.
: No data loss.
: But low availability.
Time
MasterClient Slave.1 Slave.2
Master-Slave Semi-Sync
1. Master received write op.
2. Master replicated log to slaves.
3. Slave may block...
4. Client receives ‘OK’ if [1,n)
slaves respond.
: High durability.
: High availability.
: No slave has all data
→ We need Quorum Write
Time
MasterClient Slave.1 Slave.2
Quorum Write and Read
Dynamo / Cassandra
Write to W >=N/2+1 nodes.
No master required.
Read R >=N/2+1 nodes.
W + R > N
Tolerate upto (N-1)/2 failed
nodes.
Time
Node.1Client Node.2 Node.3
Quorum Write and Read. Last-Win
The last write wins.
Totally ordered based on
timestamp.
Time
Node.1Client Node.2 Node.3
: High durability.
: High availability.
: Data completeness is guaranteed.
Is it enough?
Quorum Write and Read..
Quorum Write and Read... W + R > N
Consistency:
Eventual
Transactionality:
Non-Atomic-Update
Dirty-Read
Lost-Update
http://en.wikipedia.org/wiki/Concurrency_control
An Imaginary Storage Service
● A storage system with 3 nodes(processes).
● Policy: Quorum RW.
● It stores only one variable “i”.
● “i” has multiple versions: i1, i2, i3…
● Commands:
get /* read latest “i” */
set <n> /* assign <n> to “i” */
inc <n> /* increment “i” by <n> */
It shows us the deficiency of Quorum RW
and how paxos solves these problems.
An Imaginary Storage Service.
"set" → Quorum Write.
"inc" → the simplest transactional operation:
1. Read latest “i” with Quorum Read: i1
2. Let i2 = i1 + n
3. set i2
X
set i2=3
X
get i
21
21
00
32
21
32
X
get i1=2
i2 = i1 + 1
32
21
32
set i2=3
OK
set i2=4
An Imaginary Storage Service..
X
X
get i
21
21
00
32
21
32
53
21
53
X
get i1=2
i2 = i1 + 1
We expect X to be able to get i3=5
This requires Y to “fail” after X wrote i2. How do we do that?
Y
get i1=2
Y
i2 = i1 + 2
32
21
32
Y should run Quorum Read and Quorum Write again...
Must Fail.
Or existed
value will be
overwritten.
An Imaginary Storage Service...
In order to correctly get i3 after 2 “inc” operations:
There can only be ONE successful “write” operation
to a certain version of “i”(in our case: i2).
Generalization:
One value(one version of a variable) should not be
modified any more after it is determined(client received
“OK” and believes it is stored).
How to define “determined”?
How to avoid changing a “determined” value?
Determine a Value
X
Y
Any value set?
X
No
XX -
---
Any value set?
---
Y
Yes, Y gives up
X
XX -
XX -
Solution: Before writing a value, run a Quorum Read
round to check if such a value exists(or maybe exist).
Determine a Value.
X Y
Any value set?
X
No
YYX Y
XX -
---
Any value set?
--- Y
No
X
But both X and Y would believe there is no value set.
X and Y both will start to write at the same time.
Lost Update
Determine a Value..
X
Any value set?
X
No
YYX Y
---
---
X
Y---
Any value set?
Quorum Read+Write:
Remember X is the last reader
--- Y
No
Quorum Read+Write:
Remember Y is the last reader
X --
Solution improved: Remember who did the last read And
deny write from previous readers.
now node 1 and 2 will only accept
request from X.
now node 2 and 3 will only accept
request from Y.
Determine a Value...
By applying this policy, a value(each version of “i” in our
case) can be stored safely and consistently.
Leslie Lamport made a paper of this policy.
Paxos
What is Paxos
● A reliable storage: based on Quorum RW.
● Each paxos instance stores only 1 value.
● 2 rounds are required to determine 1 value.
● A value can’t be modified after determined.
● determined means being accepted by a
quorum(>n/2).
● Immediate Consistency.
Paxos
Classic Paxos
2 rounds per instance.
Multi Paxos
~1 round per instance.
Fast Paxos
1 round per instance ( without conflict ).
2 rounds per instance ( with conflict ).
Paxos: Precondition
Storage must be reliable:
No Data loss
/* Or it falls back to Byzantine Paxos */
Tolerate:
Message loss
Message in random order
Proposer: process that starts a paxos round to write sth.
Acceptor: process that receives and stores messages.
Quorum( of acceptors ) : n/2+1 Acceptors.
Round:Including 2 phases:Phase-1 & Phase-2
Round Number (rnd):
ID of a round.
monotonic incremental;Last-Win;Universially unique;
Paxos: Concepts
Last Round Number (last_rnd):
Greatest rnd an Acceptor has ever seen;
To identify the proposer from which a acceptor would
accept write request;
Value (v): the value an Acceptor accepted.
Value round number (vrnd):
At which round an Acceptor accepted the v.
Value determined:
The value accepted by a quorum of acceptors.
Paxos: Concepts.
Illustration of Acceptor
5,x3
last_rnd
v
vrnd
In following slides, an Acceptor would have 3 attributes
saved on it: last_rnd, v and vrnd:
Paxos: Classic - phase 1
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Acceptor received requests from Proposer:
● Refuse requests whose rnd < last_rnd.
● Save the rnd from phase-1 request into its last_rnd.
● Since now it only accepts phase-2 request with this
last_rnd.
● Respond with last_rnd, v and vrnd it has previously
accepted.
Paxos: Classic - phase 1.
X
rnd=1
X
Phase 1
1,1, -
---
Proposer X Acceptor 1,2,3
Upon Proposer received replies from Acceptors:
● If a last_rnd > rnd found: Discard this round.
● Choose v with the greatest vrnd if there is non-nil v.
● Choose the v that Proposer wants to write.
● If less than (n+1)/2 responses received, fail this round.
last_rnd=0, v=nil, vrnd=0
last_rnd=0, v=nil, vrnd=0..
Paxos: Classic - phase 2
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Proposer:
Send phase-2 with v chosen from previous step to
Acceptors
Paxos: Classic - phase 2.
X
v="x", rnd=1
X
AcceptedPhase 2
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Acceptor:
● Accept requests with rnd that equals its last_rnd
last_rnd==rnd guarantees there is no other Proposer
touches this Acceptor.
Paxos: Case 1: Classic, no Conflict
X
rnd=1
X
last_rnd=0, v=nil, vrnd=0
X
v="x", rnd=1
X
Accepted
Phase 1
Phase 2
1,1, -
---
1,1, -
1,x1
1,x1
-
Proposer X Acceptor 1,2,3
v=x, vrnd=1
Paxos: Case 2.1: Resolve Conflict
X
Y
rnd=1
X
Phase 1 for X
rnd=2
OK, forget X
Phase 1 for Y
Y
X
Y
v="x", rnd=1
Fail
v="y",rnd=2
OK
Phase 2
Y
round=1
round=2
Time
2,y2
1,x1
2,y2
2,1,x1
2,
2,1,x1
2,
2,1, 2,
1,1, -
1,1, -
---
Paxos: Case 2.2: Respect Existed v
X
rnd=3
X
v="y",vrnd=2;
v="x",vrnd=1;
choose 'y'
Phase 1
X
v="y",vrnd=3
Phase 2
round=3
2,y2
1,x1
2,y2
3,y2
3,x1
2,y2
3,y2
3,x1
2,y2
X
OK
3,y3
3,y3
3,y3
v=“y” must be chosen by
Proposer X because “y” may
be a determined value and
should not be overwritten.
Although, without checking
the 3rd acceptor we do not
know if “y” is actually
determined(accepted by a
quorum)
Paxos........
Learner:
● Acceptor send phase-3 message to Learner to inform
that a value has been determined.
● Most of the time Proposer can also be a Learner.
Livelock:
Proposers continually raise its rnd and overwrite others’
last_rnd on Acceptors, thus no phase-2 can be done
successfully.
Multi Paxos
Combine multiple phase-1 requests into one
message.
Send each phase-2 request separately.
Applications:
chubby zookeeper megastore spanner
Fast Paxos
● Proposers send phase-2 without sending phase-1.
● rnd in a Fast Paxos phase-2 is 0.
rnd=0 because rnd must be lower than any Classic rnd.
So it can fall back to Classic Paxos safely.
● Acceptor accepts Fast-phase-2 only when v=nil
● If conflict happened, Proposer should fall back to Class
Paxos with a rnd > 0.
Is Fast Paxos as cheap as Class Paxos?
Fast Paxos Quorum
--- - -
0,x0
-0,x0
0,x0
0,y0
0,x0
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
2/5; Fails
-
0,y0? ?
If Quorum of Fast Paxos is n/2+1 = 3:
When Y found conflict and fell back to Classic Paxos:
No way for Y to know if x0
or y0
is a determined value.
Solution: An undetermined value must not occupy half of the n/2+1 Acceptors:
→ Fast quorum > n*¾;
→ A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
Fast Paxos Quorum.
Fast Paxos Quorum = n*¾
Availability becomes lower because Fast Paxos requires
more Acceptors to work.
Fast Paxos requires at least 5 Acceptors in order to tolerate
one failed Acceptor.
Fast Paxos ⅘: Y has a Conflict
--- - -
0,x0
-0,x0
0,x0
0,x0
0,y0
0,x0
0,x0
0,x0
0,x0
2,y0
0,x0
0,x0
2,x0
2,x0
2,x2
0,x0
0,x0
2,x2
2,x2
X
fast rnd=0
X
phase 2
OK
Y
fast rnd=0
phase 2
1/5; Fail
Y
classic rnd=2
phase 1
OK, "x"
Y
phase 2
OK, writes "x"
Y saw two x0
on 3 Acceptors.
Y must choose x0
because x0
might be a determined value.
y0
can not be determined
because even if the other two
untouched acceptors both have
y0
, there are not enough(5*¾ )
y0
to form a quorum.
Fast Paxos ⅘: X Y conflicts
--- - -
0,x0
0,x0
0,x0
0,y0
0,y0
1,x0
1,x0
1,x0
0,y0
0,y0
1,x0
1,x0
2,y0
2,y0
2,x0
X
fast rnd=0
X
phase 2
Conflict
Y
fast rnd=0
phase 2
Y
Conflict
0,x0
0,x0
0,x0
0,y0
0,y0X
classic rnd=1
phase 1
Y
classic rnd=2
phase 1
X
OK, only "x"
Y
OK, choose "y"
Y
phase 2
2,y2
2,y2
2,y2
2,y2
2,y2X
fail in phase 2
Note
In phase-2, it is also correct if Acceptor accpets
request with rnd >= last_rnd
Q&A
Thanks
drdr.xp@gmail.com
http://drmingdrmer.github.io
weibo.com: @drdrxp

More Related Content

What's hot

Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory spaceCoder Tech
 
SDN Networks Programming Languages
SDN Networks Programming LanguagesSDN Networks Programming Languages
SDN Networks Programming LanguagesFlavio Vit
 
Client-centric Consistency Models
Client-centric Consistency ModelsClient-centric Consistency Models
Client-centric Consistency ModelsEnsar Basri Kahveci
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionSHIKHA GAUTAM
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Deadlock detection and recovery by saad symbian
Deadlock detection and recovery by saad symbianDeadlock detection and recovery by saad symbian
Deadlock detection and recovery by saad symbiansaad symbian
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replicationAbDul ThaYyal
 
Lec 4 (program and network properties)
Lec 4 (program and network properties)Lec 4 (program and network properties)
Lec 4 (program and network properties)Sudarshan Mondal
 
process management
 process management process management
process managementAshish Kumar
 
Packet flow on openstack
Packet flow on openstackPacket flow on openstack
Packet flow on openstackAchhar Kalia
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating systemudaya khanal
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemorySHIKHA GAUTAM
 

What's hot (20)

Structure of shared memory space
Structure of shared memory spaceStructure of shared memory space
Structure of shared memory space
 
SDN Networks Programming Languages
SDN Networks Programming LanguagesSDN Networks Programming Languages
SDN Networks Programming Languages
 
Client-centric Consistency Models
Client-centric Consistency ModelsClient-centric Consistency Models
Client-centric Consistency Models
 
Distributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock DetectionDistributed Mutual Exclusion and Distributed Deadlock Detection
Distributed Mutual Exclusion and Distributed Deadlock Detection
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Deadlock detection and recovery by saad symbian
Deadlock detection and recovery by saad symbianDeadlock detection and recovery by saad symbian
Deadlock detection and recovery by saad symbian
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Lec 4 (program and network properties)
Lec 4 (program and network properties)Lec 4 (program and network properties)
Lec 4 (program and network properties)
 
11. dfs
11. dfs11. dfs
11. dfs
 
Distributed systems scheduling
Distributed systems schedulingDistributed systems scheduling
Distributed systems scheduling
 
process management
 process management process management
process management
 
Packet flow on openstack
Packet flow on openstackPacket flow on openstack
Packet flow on openstack
 
Chapter 13
Chapter 13Chapter 13
Chapter 13
 
Process Management-Process Migration
Process Management-Process MigrationProcess Management-Process Migration
Process Management-Process Migration
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
Message oriented middleware
Message oriented middlewareMessage oriented middleware
Message oriented middleware
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Distributed deadlock
Distributed deadlockDistributed deadlock
Distributed deadlock
 
Routing algorithms
Routing algorithmsRouting algorithms
Routing algorithms
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 

Similar to Paxos building-reliable-system

Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng
 
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterAlpen-Adria-Universität
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Media Gorod
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)NYversity
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...Kinson Chan
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferScyllaDB
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torqueboxrockyjaiswal
 
Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Alin Stefanescu
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkOmer Korech
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyUlf Wendel
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 艾鍗科技
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryanyyaevents
 
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Yandex
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingBob John
 

Similar to Paxos building-reliable-system (20)

Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Lab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed SystemsLab: Foundation of Concurrent and Distributed Systems
Lab: Foundation of Concurrent and Distributed Systems
 
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet JitterPeer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
Peer-to-Peer Streaming Based on Network Coding Decreases Packet Jitter
 
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
Константин Серебряный, Google, - Как мы охотимся на гонки (data races) или «н...
 
Computer network (8)
Computer network (8)Computer network (8)
Computer network (8)
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
RabbitMQ in Sprayer
RabbitMQ in SprayerRabbitMQ in Sprayer
RabbitMQ in Sprayer
 
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Compu...
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005Slides for a talk on UML Semantics in Nuremberg in 2005
Slides for a talk on UML Semantics in Nuremberg in 2005
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistencyMySQL 5.6 Global Transaction IDs - Use case: (session) consistency
MySQL 5.6 Global Transaction IDs - Use case: (session) consistency
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
Fast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya SerebryanyFast dynamic analysis, Kostya Serebryany
Fast dynamic analysis, Kostya Serebryany
 
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
Константин Серебряный "Быстрый динамичекский анализ программ на примере поиск...
 
Interval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision makingInterval Type-2 fuzzy decision making
Interval Type-2 fuzzy decision making
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Paxos building-reliable-system

  • 2. Background Several processes do one thing. The only problem in distributed system is achieving consensus. Paxos: the core of distributed system.
  • 3. Agenda 1. Problem 2. Replication is not enough 3. Paxos Algorithm 4. Paxos Optimization
  • 4. Problem Required: Durability: 99.99999999% Availability: 99.99% What we have: Hard Drive: 4% of Annual failure rate Server Down Time: 0.1% or longer Packet loss between IDC: 5% ~ 30%
  • 5. Solution(Maybe) Multiple Replicas No data loss if x(x<n) replicas lost Durability: 1 replicas: ~ 0.63% 2 replicas: ~ 0.00395% 3 replicas: < 0.000001% n replicas: = 1 - x^n /* x = failure rate of single replica */
  • 6. Solution. How to replicate data? Besides number of replicas: Availability Atomicity Consistency ...
  • 7. Fundamental Replication Algorithms Master-Slave Async Master-Slave Sync Master-Slave Semi-Sync Quorum Write and Read
  • 8. Master-Slave Async The Mysql Way. 1. Master received write op. 2. Master wrote on disk. 3. Master responded ‘OK’. 4. Master replicated to slaves. If disk fail before replication → Data loss. Time MasterClient Slave.1 Slave.2 Disk Failure
  • 9. Master-Slave Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client won’t receive ‘OK’ until all slaves respond. One unreachable node halts the entire system. : No data loss. : But low availability. Time MasterClient Slave.1 Slave.2
  • 10. Master-Slave Semi-Sync 1. Master received write op. 2. Master replicated log to slaves. 3. Slave may block... 4. Client receives ‘OK’ if [1,n) slaves respond. : High durability. : High availability. : No slave has all data → We need Quorum Write Time MasterClient Slave.1 Slave.2
  • 11. Quorum Write and Read Dynamo / Cassandra Write to W >=N/2+1 nodes. No master required. Read R >=N/2+1 nodes. W + R > N Tolerate upto (N-1)/2 failed nodes. Time Node.1Client Node.2 Node.3
  • 12. Quorum Write and Read. Last-Win The last write wins. Totally ordered based on timestamp. Time Node.1Client Node.2 Node.3
  • 13. : High durability. : High availability. : Data completeness is guaranteed. Is it enough? Quorum Write and Read..
  • 14. Quorum Write and Read... W + R > N Consistency: Eventual Transactionality: Non-Atomic-Update Dirty-Read Lost-Update http://en.wikipedia.org/wiki/Concurrency_control
  • 15. An Imaginary Storage Service ● A storage system with 3 nodes(processes). ● Policy: Quorum RW. ● It stores only one variable “i”. ● “i” has multiple versions: i1, i2, i3… ● Commands: get /* read latest “i” */ set <n> /* assign <n> to “i” */ inc <n> /* increment “i” by <n> */ It shows us the deficiency of Quorum RW and how paxos solves these problems.
  • 16. An Imaginary Storage Service. "set" → Quorum Write. "inc" → the simplest transactional operation: 1. Read latest “i” with Quorum Read: i1 2. Let i2 = i1 + n 3. set i2 X set i2=3 X get i 21 21 00 32 21 32 X get i1=2 i2 = i1 + 1 32 21 32
  • 17. set i2=3 OK set i2=4 An Imaginary Storage Service.. X X get i 21 21 00 32 21 32 53 21 53 X get i1=2 i2 = i1 + 1 We expect X to be able to get i3=5 This requires Y to “fail” after X wrote i2. How do we do that? Y get i1=2 Y i2 = i1 + 2 32 21 32 Y should run Quorum Read and Quorum Write again... Must Fail. Or existed value will be overwritten.
  • 18. An Imaginary Storage Service... In order to correctly get i3 after 2 “inc” operations: There can only be ONE successful “write” operation to a certain version of “i”(in our case: i2). Generalization: One value(one version of a variable) should not be modified any more after it is determined(client received “OK” and believes it is stored). How to define “determined”? How to avoid changing a “determined” value?
  • 19. Determine a Value X Y Any value set? X No XX - --- Any value set? --- Y Yes, Y gives up X XX - XX - Solution: Before writing a value, run a Quorum Read round to check if such a value exists(or maybe exist).
  • 20. Determine a Value. X Y Any value set? X No YYX Y XX - --- Any value set? --- Y No X But both X and Y would believe there is no value set. X and Y both will start to write at the same time. Lost Update
  • 21. Determine a Value.. X Any value set? X No YYX Y --- --- X Y--- Any value set? Quorum Read+Write: Remember X is the last reader --- Y No Quorum Read+Write: Remember Y is the last reader X -- Solution improved: Remember who did the last read And deny write from previous readers. now node 1 and 2 will only accept request from X. now node 2 and 3 will only accept request from Y.
  • 22. Determine a Value... By applying this policy, a value(each version of “i” in our case) can be stored safely and consistently. Leslie Lamport made a paper of this policy.
  • 23. Paxos
  • 24. What is Paxos ● A reliable storage: based on Quorum RW. ● Each paxos instance stores only 1 value. ● 2 rounds are required to determine 1 value. ● A value can’t be modified after determined. ● determined means being accepted by a quorum(>n/2). ● Immediate Consistency.
  • 25. Paxos Classic Paxos 2 rounds per instance. Multi Paxos ~1 round per instance. Fast Paxos 1 round per instance ( without conflict ). 2 rounds per instance ( with conflict ).
  • 26. Paxos: Precondition Storage must be reliable: No Data loss /* Or it falls back to Byzantine Paxos */ Tolerate: Message loss Message in random order
  • 27. Proposer: process that starts a paxos round to write sth. Acceptor: process that receives and stores messages. Quorum( of acceptors ) : n/2+1 Acceptors. Round:Including 2 phases:Phase-1 & Phase-2 Round Number (rnd): ID of a round. monotonic incremental;Last-Win;Universially unique; Paxos: Concepts
  • 28. Last Round Number (last_rnd): Greatest rnd an Acceptor has ever seen; To identify the proposer from which a acceptor would accept write request; Value (v): the value an Acceptor accepted. Value round number (vrnd): At which round an Acceptor accepted the v. Value determined: The value accepted by a quorum of acceptors. Paxos: Concepts.
  • 29. Illustration of Acceptor 5,x3 last_rnd v vrnd In following slides, an Acceptor would have 3 attributes saved on it: last_rnd, v and vrnd:
  • 30. Paxos: Classic - phase 1 X rnd=1 X last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Acceptor received requests from Proposer: ● Refuse requests whose rnd < last_rnd. ● Save the rnd from phase-1 request into its last_rnd. ● Since now it only accepts phase-2 request with this last_rnd. ● Respond with last_rnd, v and vrnd it has previously accepted.
  • 31. Paxos: Classic - phase 1. X rnd=1 X Phase 1 1,1, - --- Proposer X Acceptor 1,2,3 Upon Proposer received replies from Acceptors: ● If a last_rnd > rnd found: Discard this round. ● Choose v with the greatest vrnd if there is non-nil v. ● Choose the v that Proposer wants to write. ● If less than (n+1)/2 responses received, fail this round. last_rnd=0, v=nil, vrnd=0 last_rnd=0, v=nil, vrnd=0..
  • 32. Paxos: Classic - phase 2 X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Proposer: Send phase-2 with v chosen from previous step to Acceptors
  • 33. Paxos: Classic - phase 2. X v="x", rnd=1 X AcceptedPhase 2 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1 Acceptor: ● Accept requests with rnd that equals its last_rnd last_rnd==rnd guarantees there is no other Proposer touches this Acceptor.
  • 34. Paxos: Case 1: Classic, no Conflict X rnd=1 X last_rnd=0, v=nil, vrnd=0 X v="x", rnd=1 X Accepted Phase 1 Phase 2 1,1, - --- 1,1, - 1,x1 1,x1 - Proposer X Acceptor 1,2,3 v=x, vrnd=1
  • 35. Paxos: Case 2.1: Resolve Conflict X Y rnd=1 X Phase 1 for X rnd=2 OK, forget X Phase 1 for Y Y X Y v="x", rnd=1 Fail v="y",rnd=2 OK Phase 2 Y round=1 round=2 Time 2,y2 1,x1 2,y2 2,1,x1 2, 2,1,x1 2, 2,1, 2, 1,1, - 1,1, - ---
  • 36. Paxos: Case 2.2: Respect Existed v X rnd=3 X v="y",vrnd=2; v="x",vrnd=1; choose 'y' Phase 1 X v="y",vrnd=3 Phase 2 round=3 2,y2 1,x1 2,y2 3,y2 3,x1 2,y2 3,y2 3,x1 2,y2 X OK 3,y3 3,y3 3,y3 v=“y” must be chosen by Proposer X because “y” may be a determined value and should not be overwritten. Although, without checking the 3rd acceptor we do not know if “y” is actually determined(accepted by a quorum)
  • 37. Paxos........ Learner: ● Acceptor send phase-3 message to Learner to inform that a value has been determined. ● Most of the time Proposer can also be a Learner. Livelock: Proposers continually raise its rnd and overwrite others’ last_rnd on Acceptors, thus no phase-2 can be done successfully.
  • 38. Multi Paxos Combine multiple phase-1 requests into one message. Send each phase-2 request separately. Applications: chubby zookeeper megastore spanner
  • 39. Fast Paxos ● Proposers send phase-2 without sending phase-1. ● rnd in a Fast Paxos phase-2 is 0. rnd=0 because rnd must be lower than any Classic rnd. So it can fall back to Classic Paxos safely. ● Acceptor accepts Fast-phase-2 only when v=nil ● If conflict happened, Proposer should fall back to Class Paxos with a rnd > 0. Is Fast Paxos as cheap as Class Paxos?
  • 40. Fast Paxos Quorum --- - - 0,x0 -0,x0 0,x0 0,y0 0,x0 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 2/5; Fails - 0,y0? ? If Quorum of Fast Paxos is n/2+1 = 3: When Y found conflict and fell back to Classic Paxos: No way for Y to know if x0 or y0 is a determined value. Solution: An undetermined value must not occupy half of the n/2+1 Acceptors: → Fast quorum > n*¾; → A value is determined in Fast Round if it is accepted by n*¾+1 Acceptors.
  • 41. Fast Paxos Quorum. Fast Paxos Quorum = n*¾ Availability becomes lower because Fast Paxos requires more Acceptors to work. Fast Paxos requires at least 5 Acceptors in order to tolerate one failed Acceptor.
  • 42. Fast Paxos ⅘: Y has a Conflict --- - - 0,x0 -0,x0 0,x0 0,x0 0,y0 0,x0 0,x0 0,x0 0,x0 2,y0 0,x0 0,x0 2,x0 2,x0 2,x2 0,x0 0,x0 2,x2 2,x2 X fast rnd=0 X phase 2 OK Y fast rnd=0 phase 2 1/5; Fail Y classic rnd=2 phase 1 OK, "x" Y phase 2 OK, writes "x" Y saw two x0 on 3 Acceptors. Y must choose x0 because x0 might be a determined value. y0 can not be determined because even if the other two untouched acceptors both have y0 , there are not enough(5*¾ ) y0 to form a quorum.
  • 43. Fast Paxos ⅘: X Y conflicts --- - - 0,x0 0,x0 0,x0 0,y0 0,y0 1,x0 1,x0 1,x0 0,y0 0,y0 1,x0 1,x0 2,y0 2,y0 2,x0 X fast rnd=0 X phase 2 Conflict Y fast rnd=0 phase 2 Y Conflict 0,x0 0,x0 0,x0 0,y0 0,y0X classic rnd=1 phase 1 Y classic rnd=2 phase 1 X OK, only "x" Y OK, choose "y" Y phase 2 2,y2 2,y2 2,y2 2,y2 2,y2X fail in phase 2
  • 44. Note In phase-2, it is also correct if Acceptor accpets request with rnd >= last_rnd