Cassandra For Shared Media Libraries

by simbo1905

A good friend of mine is working on a project which hosts media libraries in his cloud service. At the end of 2015, I integrated Cassandra into a big financial services platform. Cassandra is a great fit for my friend’s service. In this post, I will outline an appropriate Cassandra data model and along the way outline some of the killer features of Cassandra.

If you would like to follow along you can install Docker then run this Cassandra docker image. I modified some commands and installed the python cqlsh tool which I have documented here.

First here is a rough sketch of some features which approximate to my friend’s project:

  • Users can upload media files they own
  • Users can license media files from the service
  • Media files have metadata (e.g. tags)
  • Users can modify the metadata. User modifications are visible only to that user (e.g. they can use their own tags)

We can see that we would probably use at least two tables. One to store the media files and one (or more) to store the metadata. Let’s discuss the media files first.

NoSQL in general, and Cassandra specifically, is perfect for storing immutable data which is accessed via a primary key. In this case, media files are immutable and will only rarely (or possibly never) be deleted. We can create a keyspace with a replication factor of 3 so that the three copies are kept on three separate nodes for safety. The create table command of Cassandra in CQL (“Cassandra Query Langage”) is close enough to SQL for any programmer to get it right the first time:

CREATE KEYSPACE medialibrary WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
CREATE TABLE IF NOT EXISTS medialibrary.media (
mediaid uuid PRIMARY KEY,
media blob
);
INSERT INTO medialibrary.media ( mediaid, media ) values ( c37d661d-7e61-49ea-96a5-68c34e83db3a, textAsBlob('adb14fbe076f6b94444c660e36a400151f26fc6f') ) ;
SELECT * FROM medialibrary.media WHERE mediaid = c37d661d-7e61-49ea-96a5-68c34e83db3a ;

Although that looks like a SQL solution you get a lot more scale and cloud orientated resilience with Cassandra than a traditional SQL database. Those commands used the SimpleStrategy for replication but with a real deployment use the NetworkTopologyStategy so that copies are kept on different racks or across two or more cloud availability zones. It’s very simple to use you just edit the config of each node to identify their location.

The killer feature is automatic sharding of the data. Client connections will be balanced across cluster nodes and the actual nodes that are used for reads and writes will be distributed across the cluster using consistent hashing techniques. With Cassandra to expand the cluster you just startup a new node. It will join the cluster which will automatically rebalance the data across the new node as a bandwidth limited background task. You get more disc, CPU and memory (cache) evenly distributed across your query traffic with almost no effort.

Inserts on the media table are a no brainer but what about deletes? A user of the system might delete their account and so wish to delete the media which they uploaded. This is likely to be rare but should be handled correctly as we don’t want to pay for storage of files that can no longer be accessed when a user deletes their account. Deletes are a notorious problem for NoSQL where we don’t have the bottleneck of a single master. If we have replicated the data across three nodes in a big cluster and a node is crashed for a day it might miss the delete. We don’t want the offline node to come back online, and sync up with its peers, such that the data is resurrected. Is that a problem with Cassandra? Nope.

Cassandra solves deletes transparently by storing tombstones which are delete markers. After a long period of time, the tombstones are themselves eventually garbage collected. With my buddies system, there would be no need to adjust any of the default settings. It would all just work with a negligibly low probably of a resurrection ever happening as long as he doesn’t leave nodes down for weeks at a time.

In the case of the system I worked on we had exactly the same scenario of immutable and very rarely deleted data. We set a replication factor of three nodes. Upon insert, we set that the write needed to be acknowledged by QUORUM nodes. That ensured that the data was on two out of three nodes when the write returned without an error. Cassandra will then asynchronously replicate the data to a third node. To read the data we used a “fast query with slow fallback” approach. We would query with a read concern of ONE. If that did not find the row we would query again with a read concern of QUORUM. Setting the read or write concern is specific to the language driver that you are using but cqlsh also lets you set it with a command:

cqlsh> CONSISTENCY
Current consistency level is ONE.
cqlsh> CONSISTENCY QUORUM
Consistency level set to QUORUM.
cqlsh> SELECT * FROM medialibrary.media WHERE mediaid = c37d661d-7e61-49ea-96a5-68c34e83db3a ;

mediaid | media
--------------------------------------+------------------------------------------------------------------------------------
c37d661d-7e61-49ea-96a5-68c34e83db3a | 0x61646231346662653037366636623934343434633636306533366134303031353166323666633666

(1 rows)

It is only a single extra line of code (or two) in the language of your choice to check the result of the ONE query and if it got back zero rows make a second QUORUM query. The slower fallback query would automatically recover from any consistency problems. Happy days, high-fives and beers all round.

Cassandra has loads of “anti-entropy” features to correct consistency problems when the data didn’t make it onto all nodes due to crashes or network issues. I wrote a weekly cron job to cycle through all nodes on Sunday and run the administration command to scan for and correct any replication problems. Check the latest docs as that might not be necessary on the current version. Even in 2015 Cassandra had a form of built-in background error correction that is probably good enough for my friend’s project. (Edit: see this very scary presentation about data corruptions at CERN, 6 per day (!!!), which indicates that there is a lot of bit rot in the wild.)

The metadata is a little bit more sophisticated. Recall that users may want to see their own custom metadata (if it exists) rather than the service official metadata (if it exists). To achieve that we can write the services own metadata using a ‘system’ userid. We can then use a compound primary key for the metadata table as follows:

CREATE TABLE IF NOT EXISTS medialibrary.metadata ( mediaid uuid, userid text, metadata text, PRIMARY KEY (mediaid, userid) );
insert into medialibrary.metadata ( mediaid, userid, metadata) values ( d37d661d-7e61-49ea-96a5-68c34e83db3a , 'system', 'whatever2' );
insert into medialibrary.metadata ( mediaid, userid, metadata) values ( d37d661d-7e61-49ea-96a5-68c34e83db3a , 'simbo1905', 'whatever' );
SELECT * from medialibrary.metadata WHERE mediaid=d37d661d-7e61-49ea-96a5-68c34e83db3a AND userid IN ('system', 'simbo1905');

The query will pull back either one or two rows and the client code can show the most specific one to the user. Here are a couple of example queries which find either one or two rows:

cqlsh:medialibrary> SELECT * FROM medialibrary.metadata WHERE mediaid=d37d661d-7e61-49ea-96a5-68c34e83db3a AND userid IN ('system', 'simbo1905');

mediaid                              | userid    | metadata
--------------------------------------+-----------+-----------
d37d661d-7e61-49ea-96a5-68c34e83db3a | simbo1905 |  whatever
d37d661d-7e61-49ea-96a5-68c34e83db3a |    system | whatever2

(2 rows)

cqlsh:medialibrary> SELECT * FROM medialibrary.metadata WHERE mediaid=c37d661d-7e61-49ea-96a5-68c34e83db3a AND userid IN ('system', 'simbo1905');

mediaid                              | userid    | metadata
--------------------------------------+-----------+----------
c37d661d-7e61-49ea-96a5-68c34e83db3a | simbo1905 | whatever

(1 rows)

By using a compound primary key we have avoided needing to set up a secondary index on the table to search for user specific data. Happy days.

Back in 2015 you couldn’t use “IN” clause queries on compound primary keys with the Java driver. The workaround we used was to create a tuple primary key to mimic the compound key we wanted. That’s more of less exactly the same query without the syntactic sugar. It was simply a case of having to write some lines of boilerplate code to pack and unpack pairs of fields into tuple wrappers. Other than that minor faffing around Cassandra was a joy to work with.

Advertisements