Wednesday, October 17, 2012

Why I am ready to move to CQL for Cassandra application development

Earlier this year, I started learning about Cassandra as it seemed like it might be a good fit as a replacement data store for metrics and other time series data in RHQ. I developed a prototype for RHQ. I used the client library Hector for accessing Cassandra from within RHQ. I defined my schema using a Cassandra CLI script. I recall when I first read about CQL. I spent some time deliberating over whether to define the schema using a CLI script or using a CQL script. Although I was intrigued but ultimately decided against using CQL. As the CLI and the Thrift interface were more mature, it seemed like the safer bet. While I decided not to invest any time in CQL, I did make a mental note to revisit it at a later point since there was clearly a big emphasis within the Cassandra community for improving CQL. That later point is now, and I have decided to start making extensive use of CQL.

After a thorough comparative analysis, the RHQ team decided to move forward with using Cassandra for metric data storage. We are making heavy use of dynamic column families and wide rows. Consider for example the raw_metrics column family in figure 1,

Figure 1. raw_metrics column family

The metrics schedule id is the row key. Each data point is stored in a separate column where the metric timestamp is the column name and the metric value is the column value. This design supports fast writes as well as fast reads and works particularly well for the various date range queries in RHQ. This is considered a dynamic column family because the number of columns per row will vary and because column names are not defined up front. I was quick to rule out using CQL due to a couple misconceptions about CQL's support for dynamic column families and wide rows. First, I did not think it was possible to define a dynamic table with wide rows using CQL. Secondly, I did not think it was possible to execute range queries on wide rows.

A couple weeks ago I came across this thread on the cassandra-users mailing list which points out that you can in fact create dynamic tables/column families with wide rows. And conveniently after coming across this thread, I happened to stumble on the same information in the docs. Specifically the DataStax docs state that wide rows are supported using composite column names. The primary key can have multiple components, but there must be at least one column that is not part of the primary key. Using CQL I would then define the raw_metrics column family as follows,

This CREATE TABLE statement is straightforward, and it does allow for wide rows with dynamic columns. The underlying column family representation of the data is slightly different from the one in figure 1 though.

Figure 2. CQL version of raw_metrics column family
Each column name is now a composite that consists of the metric timestamp along with the string literal, value. There is additional overhead on reads and writes as the column comparator now has to compare the string in addition to the timestamp. Although I have yet to do any of my own benchmarking, I am not overly concerned by the additional string comparison. I was however concerned about the additional overhead in terms of disk space. I have done some preliminary analysis and concluded that the difference with just storing the timestamp in the column name is negligible due to compression of SSTables which is enabled by default.

My second misconception about executing range queries is really predicated on the first misconception. It is true that you can only query named columns in CQL; consequently, it is not possible to perform a date range query against the column family in figure 1. It is possible though to execute a date range query against the column family in figure 2.

RHQ supports multiple upgrade paths. This means that in order to upgrade to the latest release (which happens to be 4.5.0 at the time of this writing), I do not have to first upgrade to the previous release (which would be 4.4.0). I can upgrade from 4.2.0 for instance. Supporting multiple upgrade paths requires a tool for managing schema changes. There are plenty of such tools for relational databases, but I am not aware of any for Cassandra. But because we can leverage CQL and because there is a JDBC driver, we can look at using an existing tool instead of writing something from scratch. I have done just that and working on adding support for Cassandra to Liquibase. I will have more on that in future post. Using CQL allows us to reuse existing solutions which in turn is going to save a lot of development and testing effort.

The most compelling reason to use CQL is the familiar, easy to use syntax. I have been nothing short of pleased with Hector. It is well designed, the online documentation is solid, and the community is great. Whenever I post a question on the mailing list, I get responses very quickly. With all that said, contrast the following two, equivalent queries against the raw_metrics column family.

RHQ developers can look at the CQL version and immediately understand it. Using CQL will result in less, easier to maintain code. We can also leverage ad hoc queries with cqlsh during development and testing. The JDBC driver also lends itself nicely to applications that run in an application as RHQ does.

Things are still evolving both with CQL and with the JDBC driver. Collections support is coming in Cassandra 1.2. The JDBC driver does not yet support batch statements. This is due to the lack of support for it the server side. The functionality is there in the Cassandra trunk/master branch, and I expect to see it in the 1.2 release. The driver also currently lacks support for connection pooling. These and other critical features will surely make their way into the driver. With the enhancements and improvements to CQL and to the JDBC driver, adding Cassandra support to Hibernate OGM becomes that much more feasible.

The flexibility, tooling, and ease of use make CQL a very attractive option for working with Cassandra. I doubt the Thrift API is going away any time soon, and we will continue to leverage the Thrift API through Hector in RHQ in various places. But I am ready to make CQL a first class citizen in RHQ and look forward to watching it continue to mature into a great technology.

1 comment:

  1. Cassandra represents one of the most exciting uses of P2P technology to date.

    ReplyDelete