If you read the last post, Understanding the Cassandra Data Model from a SQL Perspective, then you already have a decent handle on what a column-oriented key/value store is. The next thing that tends to confuse new users of Cassandra is the notion of Consistency Level and Replication Factor. This post doesn’t really have a lot to do with SQL in particular, it’s more of a discussion about how tables would be replicated across multiple servers and how you read/write from those tables.
1. Replication Factor
In the previous post, we described how every single Cassandra Row would be a separate table in SQL with an arbitrary number of rows. The Replication Factor (aka ‘RF’) determines how many nodes have a copy of that table. If you have 10 nodes and your RF=3, then every single table will exist on 3 nodes. Table users-91b56770-bf40-11df-969c-b2b36bca998e might exist on nodes 2 through 4 and users-28b56770-b410-15ef-968c-b2c36d511e78 might exist on nodes 6 through 8.
- Can the RF be any number? Anything one or higher works but make sure it’s not higher than the actual number of nodes you have. This may be fixed by now but it used to be handled ungracefully. In general it should also be an odd number >=3 so that quorum is a useful consistency level (for more details on this, keep reading).
- How do I know which node to query to get my data if it could be anywhere? Do I have to know where to write it? Each node acts as a storage proxy for every other node. If you query node 5 for data that is on nodes 8-10, it knows to ask one or more of those nodes for the answer. This is also true of writes.
- What if a node goes down? How do I synchronize them? What if they have different values? This will be covered in the next section, Consistency Levels.
2. Consistency Level
Consistency in Cassandra is Eventual, which is to say that the N nodes responsible for a particular table could have a slightly different opinion of what it’s contents are. That’s ok! When reading from or writing to a Cassandra cluster, you can tune the tolerance for that sort of tomfoolery by reading or writing at a varying Consistency Level (aka ‘CL’). Remember that each node acts as a proxy for every other node so you don’t have to worry about which node you are interacting with (unless you are really trying to optimize network activity but in general, this is unnecessary).
CL::ALL will not return success unless it has successfully written the same value to every node responsible for holding whatever you are writing. Using CL::ALL, you can guarantee that whatever you are writing is the same on all nodes when success is returned. There is of course a performance penalty for this, but in return you get very strong Consistency. Reading at CL::ALL functions in a similar way .. if one of the nodes does not respond, it will return an error since it cannot guarantee that all of the nodes are consistent for the value in question.
CL::QUORUM only requires a majority of the nodes responsible for a table to respond instead of all of them. This allows for one or more nodes to be down or otherwise unavailable and still have your read/write return. If your RF=3, then only 2 nodes responsible for a table need to be online in order to reliably manipulate the data in that table. Note that if RF=1 or RF=2 then there is no meaningful difference between CL::ALL and CL::QUORUM so if you want the benefits of CL::QUORUM, make sure your RF>=3.
CL::ONE requires that only one of the nodes responsible for a table respond. This makes reads and writes fast, but it also means that depending on what else is reading and writing, it’s possible that they could briefly give conflicting answers. This is a fine tradeoff for speed in many applications, but not all. For example, recording votes or website hits where the outside possibility of a few going missing on machine failure is probably fine. Recording a financial transaction of some sort, probably not so much.
CL::ANY is only used for writing, not reading. CL::ANY means that as soon as a write is received by any node, the call returns success. This occurs when your client might be connecting to node 5 but the nodes responsible for it are 6-8. The difference between CL::ONE and CL::ANY is that with CL::ANY, as soon as node 5 receives the write, it returns success (but nodes 6-8 could be down or whatever). CL::ONE means that if you write to node 5, either 6, 7, or 8 have to return success before node 5 returns success.
- OK .. so if I use CL::ALL and it fails, does that mean that my write failed? Not necessarily! It may have succeeded on two nodes and failed on the third which means that eventually it will be propagated to the third, but the required guarantee was not met.
- Uh .. ok. So if that happens, how do I know that any of my writes succeeded? You don’t! At least not yet. The error doesn’t indicate percentage of success, just failure. Not great, I know.
- I have multiple datacenters. How do I tell Cassandra to put some data in each? There are various strategies and snitches that tell Cassandra where to put things and they are constantly in flux. You can also write your own if you have specific requirements. For more information, see the Cassandra Wiki.
- What happens if I write at CL::ANY and that node explodes before it can send data to where it belongs? Congratulations, you just lost data. If you can’t tolerate that, don’t use CL::ANY.
- What happens if all 3 nodes are up, but have different values for something for some reason and I read at CL::ALL. Does it give me the finger or what? Cassandra performs ‘read repair’. That is, upon reading, if you are using a CL that requires more than one node and they disagree on the value, Cassandra will compare the values and use whichever one has the latest timestamp. It will also write that value back to the node which had an outdated value hence ‘repairing it’.
- What if a node goes offline for a whole day, then comes back and has all sorts of wonky outdated data. Is there a way to fix it all at once instead of on read? That sounds expensive and slow to do all ad-hoc. Indeed it is! You can initiate a complete ‘repair’ of a node after a failure like that using nodetool.