Distributing Key Values
Consistency is a virtue of mules?
Small islands of order exist amidst the sea of chaos. And our brave seafarer has but a raft.
About a month ago, Ashu, Pooja and I started on a project to build a distributed, strongly consistent key-value store. Its often said that where a 100 solutions exist, no great one does. And we discovered exactly this as we iterated through the plethora of papers dealing with solving this problem. Now I don't explain our decision process for arriving at our final design (the quote alludes to a few things!) but rather just dive headlong into it.
The Design
Design is not just what it looks and feels like. Design is how it works.
Steve Jobs
I won't go into the gory details of the implementation or some of the finer details of the algorithms used - simply because it makes for a drab blog. But I'm sure the picture might clear some of these at a high level?
PS: We use Raft as our consensus algorithm-that's one of the 2 most popular approaches, the other one being Paxos.
The Server
To live is to serve
Problem | Technique |
---|---|
Consistency | Strong leadership in RAFT |
Availability | Independent RAFT groups |
Failure Detection within a group | Heartbeat in RAFT |
Fault Tolerance | Leader Election in RAFT |
Replication | Majority voting in Put |
Membership Changes | Joint Consensus in RAFT |
Key Redistribution on new RAFT group addition | Full key scan by RAFT leader |
Durability and Persistence | RocksDB as Key-Value Store |
Log Compaction | Snapshotting in RAFT |
Node-level performance optimization | Bloom Filters, Compression and Caching |
The Client
Help will always be given to those that ask for it.
Albus Dumbledore
Problem | Technique |
---|---|
Partitioning | Consistent Hashing |
Leader Search | Retry on redirection |
View of Server | Globally shared config file |
Server Addition | Addition to existing RAFT groups or new RAFT group formation |
Key Redistribution | Keyspace determination and key remapping |
Salient Design Features
Never ask me to alter a weapon merely in order to improve its appearance. A weapon is a tool, and if it is beautiful, then it is beautiful because it is useful.
Rhunön
We think its best to draw attention to a couple of design decisions we made to support membership changes to our cluster. In order to simulate, servers entering and leaving the pool, we added die
, leave
and start
calls to the client library. These are considered as system admin calls and as such the standing assumption is that once invoked these cannot fail.
Addition of Servers
You want it. You get it
When a new RAFT group gets added, keys need to get redistributed from existing groups to the new group. While we can introduce a gossip protocol between the RAFT groups, in our current iteration, we impose this responsibility on the system administrator (client). This allows us to address 2 critical cases:
- If multiple clients are using the service during reconfiguration, the administrator can block progress by imposing a global file lock while the reconfiguration is underway. This allows us to guarantee strong consistency throughout the use of the service because the server becomes briefly unavailable during this transition period.
- It makes the system design simple and doesn't skew the intra-group load balance too much. Otherwise, the leader would have the multi-dimensional responsibility of coordinating between and within groups.
- We can also claim partition tolerance since the individual RAFT groups (having collocated nodes) can continue to remain oblivious to each other.
Removal of Servers
You come for the king, you best not miss
We don't allow clients to fail during die/leave operations. This is a reasonable assumption given our administrator would have an infallible connection to our "datacenter".
We also impose another condition specifically for "leave" calls: a server cannot leave a group if the group has reduced quorum after it leaves. This is because then we can no longer be able to recover that group when a different server tries to replace the outgoing server due to the absence of a leader.
Global shared configuration file
Be wary then; best safety lies in fear.
We currently view the configuration as a file on a remote storage akin to NFS. It gets locked when a client edits the file. Every time any client wants to perform an operation, it checks if its current view of the server configuration is the latest by validating its knowledge of the last modified timestamp.
This way of enforcing a consistent view of the system is inefficient but it allows all concurrent clients to move safely from one configuration to the next. Also if there's too much churn in the system, one should be talking to the hardware folks to see what they've set us up with (sound familiar😂?).
What's Next?
Turn every bug into a feature.
So we've made it sound like some of our design decisions were actually "smart". But clearly we can do more. The global shared config file is a bottleneck, the reconfiguration via client is clearly something that can be re-engineered and several of the constants in the system could be tuned to improve performance (even the RAFT timeouts!). Can you think of other ways to enhance the system? Would love to hear your thoughts!