On Two-Phase Commit
Two phase commit does not guarantee that a distributed transaction can't fail, but it does guarantee that it can't fail silently without the TM being aware of it.
In order for B to report the transaction as being ready to commit, B must have the transaction in persistent storage (i.e. B must be able to guarantee that the transaction can commit in all circumstances). In this situation, B has persisted the transaction but the transaction manager has not yet received a message from B confirming that B has completed the commit.
The transaction manager will poll B again when B comes back online and ask it to commit the transaction. If B has already committed the transaction it will report the transaction as committed. If B has not yet committed the transaction it will then commit as it has already persisted it and is thus still in a position to commit the transaction.
In order for B to fail in this situation, it would have to undergo a catastrophic failure that lost data or log entries. The transaction manager would still be aware that B had not reported a successful commit.1
In practice, if B can no longer commit the transaction, it would imply that the disaster that took B out had caused data loss, and B would report an error when the TM asked it to commit a TxID that it wasn't aware of or didn't think was in a commitable state.
Thus, two phase commit does not prevent a catastrophic failure from occuring, but it does prevent the failure from going unnoticed. In this scenario the transaction manager will report an error back to the application if B cannot commit.
The application still has to be able to recover from the error, but the transaction cannot fail silently without the application being made aware of the inconsistent state.
Semantics
If a resource manager or network goes down in phase 1, the
transaction manager will detect a fatal error (can't connect to
resource manager) and mark the sub-transaction as failed. When the
network comes back up it will abort the transaction on all of the
participating resource managers.
If a resource manager or network goes down in phase 2, the
transaction manager will continue to poll the resource manager until
it comes back up. When it re-connects back to the resource manager
it will tell the RM to commit the transaction. If the RM returns an
error along the lines of 'Unknown TxID' the TM will be aware that
there is a data loss issue in the RM.
If the TM goes down in phase 1 then the client will block until the
TM comes back up, unless it times out or receives an error due to the
broken network connection. In this case the client is made aware of
the error and can either re-try or initiate the abort itself.
If the TM goes down in phase 2 then it will block the client until
the TM comes back up. It has already reported the transaction as
committable and no fatal error should be presented to the client,
although it may block until the TM comes back up. The TM will still
have the transaction in an uncommitted state and will poll the RMs
to commit when it comes back up.
Post-commit data loss events in the resource managers are not handled by the transaction manager and are a function of the resilience of the RMs.
Two-phase commit does not guarantee fault tolerance - see Paxos for an example of a protocol that does address fault tolerance - but it does guarantee that partial failure of a distributed transaction cannot go un-noticed.
- Note that this sort of failure could also lose data from previously committed transactions. Two phase commit does not guarantee that the resource managers can't lose or corrupt data or that DR procedures don't screw up.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…