-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi @killme2008,
Thank you for the quick review on PR #1243 — we really appreciate it!
We have been doing a systematic correctness check of sofa-jraft using TLA+ model checking. It looks like we found several more issues beyond #1241. I wanted to give you a heads-up before filing individual issues, so you can plan on your side.
Here's a summary of the issues we found, roughly ordered by severity:
1. electSelf() sends RequestVote RPCs before persisting term+votedFor (NodeImpl.java:1215-1218)
Same class of bug as #1241. requestVote() is called in a loop, then setTermAndVotedFor() happens after. A crash between the two means the vote is lost but remote nodes already saw the higher term. On restart the node doesn't know it voted, so it can vote for a different candidate in the same term — violating Raft's single-vote-per-term guarantee.
2. truncateSuffix() deletes from two column families without WriteBatch (RocksDBLogStorage.java:627-630)
Two separate deleteRange() calls on defaultHandle and confHandle. A crash between them leaves the CFs inconsistent. Should use a WriteBatch for atomicity.
3. Uncaught exception in fsm.onApply() permanently stalls the apply pipeline (FSMCallerImpl.java:562-608)
If the user's state machine throws from onApply(), lastAppliedIndex is never advanced. The next doCommitted() retries the same entry, hits the same exception, and loops forever. The node is alive but the FSM is dead. The intended setErrorAndRollback() path is never reached.
4. checkConsistency() uses %d format with a String argument (LogManagerImpl.java:1208-1210)
lastSnapshotId.toString() is passed to a %d placeholder. Throws IllegalFormatConversionException at runtime when there's a gap between snapshot and log.
5. passByStatus() returns true when FSMCaller is in ERROR state (FSMCallerImpl.java:773-782)
When done == null and the FSMCaller is in error state, the method falls through to return true instead of return false. The return false is nested inside if (done != null) when it shouldn't be.
6. handlePreVoteRequest missing ABA check after unlock-relock (NodeImpl.java:1748-1756)
handlePreVoteRequest releases the writeLock to call getLastLogId(), then re-locks — but unlike handleRequestVoteRequest, it doesn't check whether currTerm changed during the unlock window. A concurrent term change would make the pre-vote response based on stale state.
We also found a few lower-severity issues: onInstallSnapshotReturned missing response.getTerm() check (Replicator.java:711), meta file loss silently resetting term to 0 (LocalRaftMetaStorage.java:89), and incorrect StampedLock usage in BallotBox.describe(). Happy to share details on these too if you are interested.
We can file these as individual issues with reproduction details whenever it would be convenient for you. Also happy to work on and submit PRs for them.
Let us know how you'd like to proceed!