One of the largest challenges for developers starting with Cassandra is the paradigm shift when you move from a normalized relational data model to a typical Cassandra data model that is strongly denormalized. What helped for me is to forget about how SQL-like the Cassandra Query Language is and just see Cassandra as a Hash Map: you have to retrieve everything by key. This automatically leads to a lot of data duplication and thus a lot of additional code. Fortunately with version 3 Cassandra can help you with duplicating data mutations by allowing you to construct views on existing tables.
Introduction
So let’s take a step back first and discuss the 'why' first: why do we want to duplicate data? This goes against pretty much everything we were taught back in school right? Let’s start with a short explanation on how keys work in Cassandra. The most important key is the 'partition key'. If you define a primary key on a table on just one column this primary key is also the partition key. The partition key is the key Cassandra (via a smart hashing algorithm) uses to decide which Cassandra nodes in your cluster (should) have the data. So using this key Cassandra knows which node your data is in and where in that node your data is. This is why updating and retrieving by key scales so well in Cassandra.
Our example: User table
Note
|
By far the easiest way to get a running Cassandra installation is to use docker. “docker run --name cass3 -d -p 9042:9042 cassandra:3.7” will download a version 3.7 cassandra image, run it and expose the CQL port. |
Let’s create a keyspace first:
CREATE KEYSPACE example WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE example;
So as a simple example we’ll create a user table:
CREATE TABLE users (
id uuid,
name text,
email text,
password text,
roles set<text>,
PRIMARY KEY(id)
);
And insert some data:
INSERT INTO users(id, name, email, password, roles) VALUES (
00000000-0000-0000-0000-000000000000,
‘John’,
‘john@example.com’,
‘secret’,
{‘admin’}
);
So we have a unique ID we can use internally, the name, an e-mail the user can use to log in, a password hash and a set of user roles that determine if the user is an ‘admin’ for example. Let’s do a select by ID:
SELECT * FROM users WHERE id = 00000000-0000-0000-0000-000000000000;
id | email | name | password | roles
--------------------------------------+-----------------+------+----------+-----------
00000000-0000-0000-0000-000000000000 | john@example.com | John | secret | {'admin'}
Great! So now we have created a nice simple model of our user. But how would we look up the user by email on login to check the password? Cassandra will not allow you to use the email in the where without explicitly enabling filtering.
In a relational database we would probably add an index to the email column so we’d be tempted to do the same in Cassandra:
CREATE INDEX users_email ON users( email );
If we do a select now:
SELECT * FROM users WHERE email = 'john@exampl.com';
id | email | name | password | roles
--------------------------------------+-----------------+------+----------+-----------
00000000-0000-0000-0000-000000000000 | john@example.com | John | secret | {'admin'}
Great! We’re done now right? Afraid not! Since we’re only querying on an index and not using a partition key (the id) this query gets slower the more machines are added to your cluster. Cassandra can’t predict which machines contain your data so it has to ask all of them and collate the responses into one result.
Again: think of Cassandra as a hash-map. If we want to search by anything other than the key we need to traverse the entire map.
Data denormalization
The go-to solution when creating a Cassandra model is simply duplicating the data into another table with the column you want to search on as the partition key. So we’ll do exactly that:
CREATE TABLE users_email (
id uuid,
name text,
email text,
password text,
roles set<text>,
PRIMARY KEY(email)
);
This seems strange at first but it’s perfectly normal when you’re using Cassandra. There is a downside however in that you have to insert, update and delete everything twice in your code:
INSERT INTO users_email(id, name, email, password, roles) VALUES (
00000000-0000-0000-0000-000000000000,
‘John’,
‘john@example.com’,
‘secret’,
{‘admin’}
);
I’m a lazy developer so I would personally much prefer it if someone else would handle this for me. Enter Cassandra 3!
Materialized views
Cassandra 3 (released Nov 2015) has support for materialised views. Basically you can now have one ‘user’ table and a ‘user_email’ view that contains the same data with a different partition key we can then query. This view will always reflect the state of the underlying table.
So let’s drop the user_email table first since we won’t be using it anymore:
DROP TABLE users_email;
And in it’s place create a materialized view:
CREATE MATERIALIZED VIEW users_email AS
SELECT id, email, name, password, roles FROM users WHERE email IS NOT NULL PRIMARY KEY(email, id);
And when we now insert a user into our users table:
INSERT INTO users(id, name, email, password, roles) VALUES(
00000000-0000-0000-0000-000000000001,
‘Jill’,
‘jill@example.com’,
‘secret’,
{‘user’}
);
We can immediately find it in our users_email view:
SELECT * FROM users_email WHERE email = 'jill@example.com';
email | id | name | password | roles
------------------+--------------------------------------+------+----------+----------
jill@example.com | 00000000-0000-0000-0000-000000000001 | Jill | secret | {'user'}
And if we delete the user, it will also be gone from the users_mail view:
DELETE FROM users WHERE id = 00000000-0000-0000-0000-000000000001;
SELECT * FROM users_email WHERE email = 'jill@example.com';
email | id | name | password | roles
-------+----+------+----------+-------
(0 rows)
Note
|
You can’t delete data from a view. |
This makes it a lot easier to keep the different tables consistent.
Like with most good things there is a downside: materialised views have some performance implications. It’s design favours correctness over speed. So if speed is your main concern you might want to consider to keep doing it manually but in most cases (the speed trade off isn’t that large) I would favour the view approach.
Conclusion
In Cassandra duplicating data is a must but this also leads to code duplication in the system that accesses the data. Materialized views are an excellent way to have less duplicate code, make ensuring consistency easier with only a small tradeoff in speed.
I hope you enjoyed the read. Let me know if anything is unclear!
Further reading: