Column Families in CockroachDB

In this article, we will take a closer look at using column families in CockroachDB to reduce contention on concurrent updates.

Column families are a group of columns in a table that are stored as single key-value pairs in the storage layer. By default, there is a single implicit column family created for all columns.

Adding column families is a method to impose a different structure on the table where each row is represented by multiple key-value pairs. This can improve performance for write operations (INSERT, UPDATE and DELETE) and given certain constraints also reduce contention from concurrent transactions, which is the topic of this article.

The test project we're going to use is a basic Spring Boot application that uses plain JPA and Hibernate for data access.

Source Code

The source code for examples of this article can be found on GitHub.

Use Case

The demo application uses a single table to store purchase orders. It comes in two flavors: one with a single-column family and one with two families. Other than that, the tables are identical.

The operations we will perform will also be the same for both tables, but we are going to witness different transaction behavior due to the nature of the operations and the use of column families.

The table schema is as follows:

create table purchase_order1
(
    id                  integer        not null default unordered_unique_rowid(),
    bill_address1       varchar(255),
    bill_address2       varchar(255),
    bill_city           varchar(255),
    bill_country        varchar(16),
    bill_postcode       varchar(16),
    bill_to_first_name  varchar(255),
    bill_to_last_name   varchar(255),
    date_placed         date           not null default current_date(),
    deliv_to_first_name varchar(255),
    deliv_to_last_name  varchar(255),
    deliv_address1      varchar(255),
    deliv_address2      varchar(255),
    deliv_city          varchar(255),
    deliv_country       varchar(16),
    deliv_postcode      varchar(16),
    order_status        varchar(64),
    total_price         decimal(18, 2) not null,

    primary key (id)
);

create table purchase_order2
(
    id                  integer        not null default unordered_unique_rowid(),
    bill_address1       varchar(255),
    bill_address2       varchar(255),
    bill_city           varchar(255),
    bill_country        varchar(16),
    bill_postcode       varchar(16),
    bill_to_first_name  varchar(255),
    bill_to_last_name   varchar(255),
    date_placed         date           not null default current_date(),
    deliv_to_first_name varchar(255),
    deliv_to_last_name  varchar(255),
    deliv_address1      varchar(255),
    deliv_address2      varchar(255),
    deliv_city          varchar(255),
    deliv_country       varchar(16),
    deliv_postcode      varchar(16),
    order_status        varchar(64),
    total_price         decimal(18, 2) not null,

    primary key (id),

    FAMILY f1 (id, bill_address1, bill_address2, bill_city, bill_country, bill_postcode, bill_to_first_name, bill_to_last_name, date_placed, deliv_to_first_name, deliv_to_last_name, deliv_address1, deliv_address2, deliv_city, deliv_country, deliv_postcode, order_status),
    family f2 (total_price)
);

The sequence of SQL operations is to first read a single row and then update the same row but use different columns. This will be done concurrently in separate sessions to visualize the effect.

The sequence of operations will be something like the following, for transaction T1 and T2 (time flows vertically):

Time	T1	T2
1	`begin;`	`begin;`
2	`select id,status from purchase_order1 where id=1;`
3	`(reads 1,'PLACED')`	`select id,total_price from purchase_order1 where id=1;`
4	`update purchase_order1 set status = 'CONFIRMED' where id = 1;`	`(reads 1,100.00)`
5		`update purchase_order1 set total_price = total_price + 5 where id = 1;`
6		`(blocked on T1)`
7	`commit;`
8		`commit; --- "ERROR: restart transaction.."`
9		(rollback)

The rollback for T2 is expected (yet undesired) behavior since we have contending read and write operations on the same row, even if the transactions write back to different columns.

Next, let's see how the same sequence works on the other table (purchase_order2) with a separate column family for the total_price column.

Time	T1	T2
1	`begin;`	`begin;`
2	`select id,status from purchase_order2 where id=1;`
3	`(reads 1,'PLACED')`	`select id,total_price from purchase_order2 where id=1;`
4	`update purchase_order2 set status = 'CONFIRMED' where id = 1;`	`(reads 1,100.00)`
5		`update purchase_order2 set total_price = total_price + 5 where id = 1;`
6	`commit;`	`commit;`

As we can see, both transactions can commit which is allowed in a serializable history since we are not writing to the same key, but to disjoint columns in different key-value pairs in the underlying key-value storage.

JPA and Hibernate Considerations

There are a few considerations in terms of Hibernate entity mappings and query operations to make the best use of CockroachDB column families to reduce contention.

One is to always use projection in the reading part of a transaction. This is the default, but the problem is that all columns are read like using a star projection. This is the case for example when using the entityManager#find method. If all columns would be read in the last example above, then it would fail with a transient SQL error.

For the demo, we are using named queries for the different order entities, as follows:

@Entity
@Table(name = "purchase_order1")
@NamedQuery(name = "Order1.findByIdForUpdateStatus",
        query = "SELECT o.id,o.orderStatus,o.totalPrice FROM Order1 o WHERE o.id = ?1")
@NamedQuery(name = "Order1.findByIdForUpdatePrice",
        query = "SELECT o.id,o.totalPrice,o.orderStatus FROM Order1 o WHERE o.id = ?1")
public class Order1 extends AbstractOrder {
}

.. and ..

@Entity
@Table(name = "purchase_order2")
@NamedQuery(name = "Order2.findByIdForUpdateStatus",
        query = "SELECT o.id,o.orderStatus FROM Order2 o WHERE o.id = ?1")
@NamedQuery(name = "Order2.findByIdForUpdatePrice",
        query = "SELECT o.id,o.totalPrice FROM Order2 o WHERE o.id = ?1")
public class Order2 extends AbstractOrder {
}

The downside with projection is that we can't use a typed query to read the persistent domain object (PDO) into its attached state, but only use scalar values. It's fine for this example though, where we just write the data again being selective of which columns to update:

entityManager
        .createQuery("update " + orderType.getSimpleName()
                + " o set o.orderStatus = :status"
                + " where o.id = :id")
        .setParameter("status", status)
        .setParameter("id", orderId)
        .executeUpdate();

// ..and..

entityManager
        .createQuery("update " + orderType.getSimpleName()
                + " o set o.totalPrice = o.totalPrice + :price"
                + " where o.id = :id")
        .setParameter("price", price)
        .setParameter("id", orderId)
        .executeUpdate();

Another method of being selective on which columns to update is to use the @DynamicInsert and @DynamicUpdate annotations in Hibernate:

@Entity
@Table(name = "purchase_order1")
@DynamicInsert
@DynamicUpdate
public class Order1 extends AbstractOrder {
}

This works for the update part, but not for projection.

Running the Demo

For demo cloning and building instructions, see the Github project.

Single Column Family

Get a form to create a new order:

shell curl "http://localhost:8090/order/v1/template" > o1.json

Submit the order form:

curl "http://localhost:8090/order/v1/" -H "Content-Type:application/json" -X POST -d "@o1.json"

Take note of generated id value (which is 1 by default) and be prepared to rotate the order status with a commit delay of 5 sec to allow for another update to come in between:

curl "http://localhost:8090/order/v1/1/status?delay=5" -i -X PUT

Now, within 5 sec after the previous PUT, increment the price on the same order ID, which will cause a serialization conflict:

curl "http://localhost:8090/order/v1/1/price?price=5&delay=0" -i -X PUT

This request will fail with an expected 40001 error. That's fine, but let's try to avoid this contention effect.

Multiple Column Families

Submit the order form again but using a different URL:

curl "http://localhost:8090/order/v2/" -H "Content-Type:application/json" -X POST -d "@o1.json"

Let's fire the status update and delay the commit with 5 sec:

curl "http://localhost:8090/order/v2/1/status?delay=5" -i -X PUT

Now, within 5 sec, increment the price on the same order ID which will succeed:

curl "http://localhost:8090/order/v2/1/price?price=5&delay=0" -i -X PUT

That's it, no retry errors.

Conclusion

Column families are a simple technique that can help to reduce contention and improve write performance in CockroachDB. The key consideration to make this work in JPA and Hibernate is to use both projections and selective updates when writing entity state back to the database.