Distributing Key Values

Consistency is a virtue of mules?

Distributing Key Values
Hash ring in true garba style

Small islands of order exist amidst the sea of chaos. And our brave seafarer has but a raft.

About a month ago, Ashu, Pooja and I started on a project to build a distributed, strongly consistent key-value store. Its often said that where a 100 solutions exist, no great one does. And we discovered exactly this as we iterated through the plethora of papers dealing with solving this problem. Now I don't explain our decision process for arriving at our final design (the quote alludes to a few things!) but rather just dive headlong into it.

The Design

Design is not just what it looks and feels like. Design is how it works.

Steve Jobs

Overall design of our system

I won't go into the gory details of the implementation or some of the finer details of the algorithms used - simply because it makes for a drab blog. But I'm sure the picture might clear some of these at a high level?
PS: We use Raft as our consensus algorithm-that's one of the 2 most popular approaches, the other one being Paxos.

The Server

To live is to serve

Problem Technique
Consistency Strong leadership in RAFT
Availability Independent RAFT groups
Failure Detection within a group Heartbeat in RAFT
Fault Tolerance Leader Election in RAFT
Replication Majority voting in Put
Membership Changes Joint Consensus in RAFT
Key Redistribution on new RAFT group addition Full key scan by RAFT leader
Durability and Persistence RocksDB as Key-Value Store
Log Compaction Snapshotting in RAFT
Node-level performance optimization Bloom Filters, Compression and Caching

The Client

Help will always be given to those that ask for it.

Albus Dumbledore

Problem Technique
Partitioning Consistent Hashing
Leader Search Retry on redirection
View of Server Globally shared config file
Server Addition Addition to existing RAFT groups or new RAFT group formation
Key Redistribution Keyspace determination and key remapping

Salient Design Features

Never ask me to alter a weapon merely in order to improve its appearance. A weapon is a tool, and if it is beautiful, then it is beautiful because it is useful.

Rhunön

We think its best to draw attention to a couple of design decisions we made to support membership changes to our cluster. In order to simulate, servers entering and leaving the pool, we added die, leave and start calls to the client library. These are considered as system admin calls and as such the standing assumption is that once invoked these cannot fail.

Addition of Servers

You want it. You get it

When a new RAFT group gets added, keys need to get redistributed from existing groups to the new group. While we can introduce a gossip protocol between the RAFT groups, in our current iteration, we impose this responsibility on the system administrator (client). This allows us to address 2 critical cases:

  • If multiple clients are using the service during reconfiguration, the administrator can block progress by imposing a global file lock while the reconfiguration is underway. This allows us to guarantee strong consistency throughout the use of the service because the server becomes briefly unavailable during this transition period.
  • It makes the system design simple and doesn't skew the intra-group load balance too much. Otherwise, the leader would have the multi-dimensional responsibility of coordinating between and within groups.
  • We can also claim partition tolerance since the individual RAFT groups (having collocated nodes) can continue to remain oblivious to each other.

Removal of Servers

You come for the king, you best not miss

We don't allow clients to fail during die/leave operations. This is a reasonable assumption given our administrator would have an infallible connection to our "datacenter".

We also impose another condition specifically for "leave" calls: a server cannot leave a group if the group has reduced quorum after it leaves. This is because then we can no longer be able to recover that group when a different server tries to replace the outgoing server due to the absence of a leader.

Global shared configuration file

Be wary then; best safety lies in fear.

We currently view the configuration as a file on a remote storage akin to NFS. It gets locked when a client edits the file. Every time any client wants to perform an operation, it checks if its current view of the server configuration is the latest by validating its knowledge of the last modified timestamp.

This way of enforcing a consistent view of the system is inefficient but it allows all concurrent clients to move safely from one configuration to the next. Also if there's too much churn in the system, one should be talking to the hardware folks to see what they've set us up with (sound familiar😂?).

What's Next?

Turn every bug into a feature.

So we've made it sound like some of our design decisions were actually "smart". But clearly we can do more. The global shared config file is a bottleneck, the reconfiguration via client is clearly something that can be re-engineered and several of the constants in the system could be tuned to improve performance (even the RAFT timeouts!). Can you think of other ways to enhance the system? Would love to hear your thoughts!