Wednesday, 4 June 2014

Consensus Protocols : Overview

Consensus Problem :  Mutually Agreeing on a particular Value


In this I have Considered both Txn and Normal operations consensus . 


Synchronous System :
Bounded Message Delay
No timeouts

Asynchronous : 
Message delay finite but unbounded
Consensus not always possible

Correctness of Consensus Protocol :
Agreement : All Nodes agree on a value
Validity : Value proposed is Valid ie.. It is proposed by one of participant node.

Non Triviality* : There exist a accessible configuration in which the decision is to commit
Integrity*: Each (RM decides atmost once)

Termination: All nodes Eventually Terminates
*For Txn commit

Agreement can be between majority of nodes.
Database Commit protocol that always says Abbort is not Valid
Termination of messages exchanged so that the protocol is useful.

2PC : 2 Phase Commit: 

1)Coordinator : Contact all nodes, suggest value , gather response(Form of Vote)
2)If everyone agrees contact for committing else contact for abort.




IMP: Consensus here is not on value proposed but weather to accept or reject the value nor the value.
Not on what value should but rather weather to accept or reject value.

Real Time Example
Transfer money from one account to another
Move file between servers

TO check weather commit happened see log.

Phase 1: Coordinator Request
Coordinator request (REQUEST Message to all nodes)
Eg : Delete file from directory

Nodes receive requests execute tranxn locally
Write to local log Request ,Result,VOTE_COMMIT or VOTE_ABBORT
Send VOTE_COMMIT to cordinator

Coordinator receive vote VOTE_ABORT
Write GLOBAL_ABORT/COMMIT to Global Log and sends GLOBAL_ABORT/COMMIT to participant nodes.
No of messages : 3n where n is number of nodes. ( N - Value Propose, N-Vote , N -Commit/Abbort)

Failures: 
1)Fail Stop : Nodes Crash and Never recover
2)Fail Recover: Crash and recover after sometime
3)Byzantine: Divert from protocol specification .
Example an army of Soldiers, coordinator is a general . Agree on one point or fail. Some soldiers traitor :
All loyal traitors should agree
Small Number of traitors should not be able to trick loyal soldiers.

Case 1: Before messages exchanged server crashes. (No worries) bcz protocol never started.
Case 2: Some proposal messages sent but not all.

Some nodes received proposal starting 2PC ,and some havent recieved any proposal. If coordinator do not recover in time , nodes that have sent proposal keeps on waiting for things that never finished.
Protocol Blocked.

Have another node to become co-ordinator. When timeout occurs that node can finish task that co-ordinator started.

Contacts all nodes to find out to whom the nodes voted.
So all nodes has to keep their descision on persistent storage.

If one another participating node fails before recovery node has committed, cant be recovered.
Recovery node cant distinguish between all nodes having voted to commit or abort.
Coordinator is participant too and then it fails.
 
Cordinator logs the result in persistent storage.
Advantage : Low  message Complexity.

Link: 
http://the-paper-trail.org/blog/consensus-protocols-two-phase-commit/


2PC transaction Commit: 

TM has following states : init,Preparing, Commit and aborted.
Let Trxn cordinator TMi

Phase 1: Obtaining a decision 
Initiate TM
TM ask all participant to prepare to commit transaction
TM adds record<Prepare>to its recovery long and saves it in stable storage
Send Prepare to all sites at which t is executed.

After receiving all messgae RM(Resource Manager ) can decide if it can commit on transection.
If transaction can be commited adds the record <ready> to log.
Flush all record to storage
Send Ready msg to TM
Else add <no> to log and sends a msg abort to TM

Phase 2: Aborting a decision
T commited if only TM receives ready msg else abort
TM add decision <commit> or <abort> to log 
TM sends msg to each particpant informing of descison.
RM does the same on recieving msg (Stores in log)

 
2PC suffers from Single Point of Failure.


Failure 2PC TXN RM:
RM participating in TXn fails and then recovers and then examins log:
if log says <commit> no action
log says<abort>no action
log says <ready> RM must consult TM for the fate. IF commite, redo log and write commit
else write abort.

If site contains no record => RM failed before responding to prepare. TM abort txn and RM should undo T

Failure TXN TM:
If TM fails before completion RM decides fate
IF one of RM contains commit then T is commited else aborted.
If RM of any site doesnt contain ready, no decision can be taken <abort>
IF RM has ready then sited must wait for TM to recover. Blocking problem

TOTAL MSG: 1 to initiate TM, 2n in phase 1 n in phase 2 total : 3n+1 
4 msg delay
n+1 writes to stable storage n by RM and 1 by TM


_____________________________________________________

3 PC
Problem of 2 PC removed by an extra Phase.
Idea break 2PC second phase into 2 phases. 
1st Prepare to commit phase. 
Co-ordinated sends to all if it has received yes from all .Nodes store them and then sends msg to cordinator that prepare to commit was received.
Purpose is to communicate result of vote to every node so that the state can be recovered. 

Last phase : If coordinator receives result of prepare of commit from all replica it commits, if delivery is not confirmed and our system can handle f failure co-ordinator can go ahead once it recieves f+1 msg.
If co-ordinator crash at any time , recovery node can take over transaction and query state from any remaining node.
If anyother node  crashes , it can read form other nodes.
Problem :
Network was divided into 2 partitions and both had different results.
Works for fail-stop not for fail-recover.


Cordinator fails before it has received prepared to commit replies from nodes,new co-ordinator takes over.Recover Co-or will interrogates nodes to know of the decision and completes them. At same time main co-or recovers and relies it has not received replies from all times them out,sending abort to all.messages gets interleaved with commit msg of recovery co-or resulting in inconsistent state.
_____________________________________________________________________

PAXOS 
One Node act as proposer and is responsible for initiating protocol.
Acceptors other nodes.
Acceptors accept or reject .
Once majority have  accepted protocol can terminate.
Proposer sends propose req to acceptor, acceptor responds once acceptor agrees, proposer sends commit request to acceptor.

Paxos adds 2 main improvements to 2PC.
1) Order to determine what proposal can be accepted.
2) Consider accepted if majority has done that.

Every proposer tagged with unique sequence number . Number required to decide on ordering of which proposer came first,
Once proposal arrives acceptors sees whats the highest number of proposal they have received.If new proposer number > previous acceptor reply saying it wont accept any proposal less then the recieved seq num.
All proposers unique seq number.
if two proposers are agreed on by majority atleast one will be common to both.
When new proposal is made new majority will guarantee acceptor that saw both previous proposals or two acceptors that saw one each.

Once it receives message from majority, proposer can go ahead and ask to commit to a value.

Case if 2 proposers propose first proposal accepted by just majority and if before reply of prepare to commit the acceptor fails . no majority. 
Now new proposer propose and its accepted by majority, it commits
Failed acceptor wakes up and sends final accept msg to orignal proposal first proposer commits too . Correctness voilated.  
Solution both propose same value. If any proposer receives higher value then it accepts it.


Failure :
If f failure allowed, no of acceptor 2f+1
If proposer fail, another can take over. 
If orignal recovers and see no > his own . 
It can agree on that.
Issue if both proposal send proposal one more then previous in every request, will never come to consensus.

Acceptor need to record highest proposal in stable storage.
if that storage crash cannot take part(Byzantine)

Efficiency:
1st phase : f+1 msg and recieve f+1 replies
repeat till 4 phases : 4f+4 msg
Delay till protocol completed 4 msgs.
It can be seen that 2F+1 needed for consensus with despite failure of f.
 

Paxos Commit Protocol:
Txn initiator calls TM
TM sends prepare to RM 
RM sends Phase 2a to acceptor 
acceptor sends phase 2b to TM
TM sends commit to RM
TOTAL MSG : 2F *(n+1) + 3n + 1 msgs
5 msg delays
n+2F+1 writes to stable storage.

Analysis:
1 msg initiator to TM
N prepare msg from TM to RM
N commit from TM to RM
N(2F+1) phase 2a commit msg from RM to acceptor as each RM sends to acceptor 

2F+1 phase 2b msgs.

total 2f(n+1)  + 3n+2

Transection initiated by one of RM so 1 prepare msg can be avoided bwn RM and TM
if TM in same node as acceptor phase 2b msg can be avoided. 

If each acceptor same node as RM and TM also on RM
F+1 of phase 2a and f+1 of phase 2b can be avoided.

Links:
PaperTrail
Paxos Commit 
Paxos Simple