Kai Niemi's Blog

Multi-active systems

Kai Niemi — Mon, 31 Jul 2023 07:15:36 GMT

Some business drivers that could justify adopting a multi-active and multi-region deployment strategy:

Securing business continuity in the event of regional data centre disruptions
Deliver a good customer experience worldwide
Deliver business adaptability and scalability to different market needs/volumes
Compliance with data locality/placement regulations
Sustainable operational costs as the business grows

Multi-active systems are a prerequisite to effectively adopting a multi-region strategy. Let's find out how.

Multi-Active Systems

A multi-active system is capable to operate and serve online traffic simultaneously from multiple active data centres and regions. With that comes characteristics that will satisfy the goals stated above without adding too much infrastructure/app complexity and cost overhead.

Multi-active systems run live in multiple datacenters in different regions all the time and workloads are dynamically shared across these datacenters. A multi-active system process requests simultaneously for either domestic or global markets without any assumptions on locality, traffic affinity or replication delays.

There are no actual concepts of traffic failover or failback. Instead, failures and disruptions are handled transparently through regional and/or global load balancing and traffic rerouting. If one failure domain (data centre or region) begins to fall behind due to disruptions, parts of its workload can move to other domains transparently. If one domain is completely offline, all its work is rebalanced to the remaining domains.

Using three failure domains allows one to go completely dark, while still allowing systems to make forward progress through consensus decisions. Using five failure domains allows two domains to go dark, and so on. This allows for systems to be both highly available and always consistent and correct.

A multi-active system is not without certain challenges:

Needs a fit-for-purpose solution to manage state and replication (data distribution at a global scale is a difficult problem) such as:
- Protecting rule invariants also during contention and failures
- Ensure effectively once outcomes of event processing
Higher service latency due to mandatory cross-datacenter coordination
Existing design assumptions on single DC/region deployment
Constraints around auxiliary system integrations
Breaking current assumptions on system design

To contain complexity, most of these challenges are preferably pushed down to the resource tier - the database - instead of being managed in the app tier. CockroachDB is one such system with a wide range of multi-region deployment options and first-class support for crafting multi-active systems.

Failover-based Systems

To contrast multi-active systems, let's quickly look at the main predecessor: singly-homed, failover-based systems. A singly-homed system is crafted to operate and serve online traffic from a single data centre or at most a single region. Crafted in terms of design choices, technology selections, network assumptions and supporting infrastructure components.

In the event of a primary domain disruption or disaster, a singly-homed system may failover traffic to an alternative secondary data centre. After the disruption is cleared away, it may "fail back" traffic to the original primary domain.

This type of setup has many limitations:

Unable to scale horizontally beyond a single data centre/region
Unable to load-balance traffic freely across multiple active datacenters
Dependent on standby, underutilized resources, increasing TCO (300% capacity for steady state)
Must use asynchronous replication for availability and performance, with the risk of data loss
Long recovery times after failures
Complex and error-prone failover protocols with manual checkpoints/sign-offs
Unclear when and if a standby system can resume traffic from a safe point
Difficult and risky to test and verify that the protocol works

Disaster Recovery Spectrum

The main objective of a disaster recovery plan is to minimize the time it takes to recover from a severe disruption event and reduce the amount of data loss and other business impacts.

The spectrum of disaster recovery solutions typically ranges from offline backups to full-blown multi-region deployments, also with backups.

Backups - Data is frequently backed up and sent off-side or to cloud storage.
- The recovery time objective (RTO) is governed by the time it takes to restore the database to a new setup
- The recovery point objective (RPO) is governed by the frequency of incremental and full backups
Cold Standby - A minimally provisioned environment with the ability to take over core services from a failed primary data centre.
- Higher TCO due to under-utilized standby capacity
- RTO is governed by how fast a switchover can be made to the secondary
- RPO is governed by the async replication delay from the primary to the secondary
Warm Standby - A fully provisioned environment with the ability to take over a failed primary data centre.
- Higher TCO due to excessive amounts of under-utilized standby capacity
- Could serve certain read-only traffic at the same time as the primary
- RTO and RPO are quite similar to a cold standby, only a bit lower due to higher readiness
Multi-Active - Each deployment site serves production traffic simultaneously.
- All data centres provide traffic at the same time for the entire keyspace
- There is no actual notion of fail-over or fail-back, failures and recovery are handled transparently towards the app tier
- RTO is governed by how quickly an isolated or crashed node can drop its authority over reads and writes to local data (typically a few seconds)
- RPO is zero due to consensus-based replication

Multi-active systems stand out from most fail-over-based models in terms of cost and complexity reduction. It's far more resilient against different categories of disruptions, but not immune to disasters. If a multi-active system loses a majority of its failure domains (like 2 zones in a 3-zone region) or if some operator error corrupts a database, then the music stops. Therefore, backups are still commonly used alongside multi-active systems, which adds a safety harness for recovery.

Combined with multiple regions, the blast radius is extended to cover most conditions and you can also improve customer experience for a global market.

Multi-region deployments

One data centre is a single point of failure, similar to a single region. If that data centre/region goes offline for a longer period without any recovery option, it may have a severe impact on the business and the company's reputation.

Adding two or more data centres to a single region will increase the blast radius and decrease the likelihood of severe, long-lasting service disruptions due to a single DC outage.

Deploying a system (as in many services/components working in concert) across multiple, geo-separated regions extends the blast radius even further. Single-region assumptions cannot however be transferred to this new ecosystem due to how we traditionally manage state and consistency. Leveraging multi-region effectively requires a multi-active system architecture. Not exclusively, but it's very much a state/database undertaking that needs a fit-for-purpose solution like CockroachDB.

Summary

This article discusses the advantages of adopting a region-level deployment strategy for businesses, focusing on multi-active systems. These systems operate simultaneously across multiple data centres, providing increased resiliency and adaptability to market needs. The article also contrasts multi-active systems with traditional failover-based systems and examines the disaster recovery spectrum, including backups, cold standby, warm standby, and multi-active solutions. Ultimately, multi-active systems offer significant benefits in terms of cost and complexity reduction, while still requiring backups to ensure data safety.

User defined composite types

Kai Niemi — Fri, 30 Jun 2023 12:13:50 GMT

In a previous article, we look at creating a simple distributed user-defined function (UDF) in CockroachDB. In this article, we'll revisit UDFs in the form of user-defined composite types, introduced in CockroachDB v23.1.

Introduction

A composite type is simply a type composed of other types. In the following example, we are creating a composite money type. The money type is the combination of an amount, currency code and monetary type:

The amount is a decimal with fractions matching the currency.
The currency is a 3-letter ISO 4217 code.
The monetary type is an arbitrary tag for denoting the type of money. For example:
- RM for real money
- FM for funny money

On top of the type, we'll also add a few UDFs for money arithmetics. Ideally, these functions should only be allowed when operands use the same currency and monetary type. For example, you want to prevent adding 10 USD with 15 SEK or real money with funny money. There's however no way to enforce these rules in the DB itself.

Let's begin with the money type:

CREATE TYPE money_type AS (amount decimal, currency_code char (3), monetary_type char (2));

Next, create a few UDFs for money operations:

CREATE FUNCTION money_amount(x money_type) RETURNS decimal IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$   select ((x).amount)::decimal$$;CREATE FUNCTION money_currency(x money_type) RETURNS char(3) IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$   select ((x).currency_code)::char(3)$$;CREATE FUNCTION money_monetary_type(x money_type) RETURNS char(3) IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$   select ((x).monetary_type)::char(3)$$;CREATE FUNCTION to_money(x string) RETURNS money_type IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$   select (    split_part($1,' ',1)::decimal,    split_part($1,' ',2),    split_part($1,' ',3)    )::money_type$$;

Next, let's add a few money arithmetics UDFs:

CREATE FUNCTION money_add(IN m money_type, IN addend decimal) RETURNS money_type IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$select (    (m).amount + addend,    (m).currency_code,    (m).monetary_type    )$$;CREATE FUNCTION money_mult(IN m money_type, IN multiplier decimal) RETURNS money_type IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$select (    (m).amount * multiplier,    (m).currency_code,    (m).monetary_type    )$$;CREATE FUNCTION money_div(IN m money_type, IN dividend decimal) RETURNS money_type IMMUTABLE LEAKPROOF LANGUAGE SQL AS $$select (    (m).amount / dividend,    (m).currency_code,    (m).monetary_type    )$$;

That's about it. Now let's put the money type to use and see how things work. Create an account table holding a cached balance using the money type:

create table account(    id             uuid        not null default gen_random_uuid(),    city           string      not null,    balance        money_type  not null,    name           string(128) not null,    description    string(256) null,    closed         boolean     not null default false,    allow_negative integer     not null default 0,    updated_at     timestamptz not null default clock_timestamp(),    primary key (id));

Let's strengthen the integrity a bit with these CHECK constraints (totally optional):

alter table account    add constraint check_account_allow_negative check (allow_negative between 0 and 1);alter table account    add constraint check_account_positive_balance check ((balance).amount * abs(allow_negative - 1) >= 0);alter table account    add constraint check_account_currency check ((balance).currency_code in ('SEK', 'USD', 'GBP', 'EUR'));

As you can see, different accounts may or may not accept a negative balance depending on the current flag. We could also have used an enum type for the currency.

Add some data:

INSERT INTO account (id, city, balance, name, allow_negative)VALUES ('10000000-0000-0000-0000-000000000000', 'stockholm', to_money('100.00 SEK RM'), 'test:1', 0),       ('20000000-0000-0000-0000-000000000000', 'stockholm', to_money('200.00 SEK RM'), 'test:2', 1),       ('30000000-0000-0000-0000-000000000000', 'new york', to_money('300.00 USD PM'), 'test:3', 0),       ('40000000-0000-0000-0000-000000000000', 'new york', to_money('400.00 USD PM'), 'test:4', 1);

Let's see how this looks:

select id,city,balance,       money_amount(balance),       money_currency(balance),       money_monetary_type(balance)from account;                   id                  |   city    |     balance     | money_amount | money_currency | money_monetary_type---------------------------------------+-----------+-----------------+--------------+----------------+----------------------  10000000-0000-0000-0000-000000000000 | stockholm | (100.00,SEK,RM) |       100.00 | SEK            | RM  20000000-0000-0000-0000-000000000000 | stockholm | (200.00,SEK,RM) |       200.00 | SEK            | RM  30000000-0000-0000-0000-000000000000 | new york  | (300.00,USD,PM) |       300.00 | USD            | PM  40000000-0000-0000-0000-000000000000 | new york  | (400.00,USD,PM) |       400.00 | USD            | PM(4 rows)Time: 14ms total (execution 14ms / network 0ms)

Let's execute an aggregation query:

select sum(money_amount(balance)) balance,        money_currency(balance) currency from account group by balance,currency;  balance | currency----------+-----------   100.00 | SEK   200.00 | SEK   300.00 | USD   400.00 | USD(4 rows)Time: 2ms total (execution 2ms / network 0ms)

Updating the money type can be done using one of the arithmetic functions:

UPDATE account set balance=money_add(balance,-90.00) where id='10000000-0000-0000-0000-000000000000';UPDATE account set balance=money_add(balance,-100.00) where id='20000000-0000-0000-0000-000000000000';

Conclusion

This article provides an example of how to create a user-defined composite type in CockroachDB v23.1, specifically a money type composed of an amount, currency code, and monetary type. It also explains how to use UDFs for money operations, create a table to hold a cached balance, and add CHECK constraints to strengthen the integrity.

Transaction timeouts in CockroachDB

Kai Niemi — Fri, 30 Jun 2023 12:09:12 GMT

In a previous article series on Spring Data JPA and CockroachDB, we look into different methods to avoid lengthy transaction execution times. Until recently, however, there's not been any way to specify the transaction execution timeout in CockroachDB only at the statement level.

This has changed since CockroachDB v23.1 where a new session variable for transaction timeouts was introduced, unsurprisingly called transaction_timeout:

New in v23.1: Aborts an explicit transaction when it runs longer than the configured duration. Stored in milliseconds; can be expressed in milliseconds or as an INTERVAL.

Overview

Transaction timeouts are helpful if you need to set a fixed upper limit for how long to wait for an explicit transaction to complete. If a transaction is not completed within that timeframe it's aborted and then you that any provisional writes did not complete.

In contrast, if you just wait for an arbitrary amount of time and then interrupt the calling thread, then you have an ambiguous result where you can't tell if an operation took place or not since the commit could have been completed or rolled back just before the cancellation. Ambiguous results for non-idempotent operations are typically not a good thing for safety.

Now let's see how to hook up transaction timeouts in a fully transparent way using Spring's @Transactional annotation and AspectJ. Similar to how we can deal with transaction retries.

Source Code

The code for this article is available on GitHub.

AOP Timeout Solution

We are going to set the attributes using AOP and AspectJ, which is a core concept in Spring Boot.

A small recap on basic AOP terminology:

Aspect - An orthogonal cross-cutting concern that you wrap in a contained module or aspect. Like retries, logging, security or in our case setting session variables.
Joinpoint - Points in the application code where to plugin the aspect, such as method execution or the handling of an exception.
Pointcut - One or more join points where advice should be executed, often using pointcut expressions.
Advice - The action to be performed either before or after method execution, akin to an interceptor.

To set setting attributes, we create a TransactionAttributesAspect with an around-advice:

import org.aspectj.lang.ProceedingJoinPoint;import org.aspectj.lang.annotation.Around;import org.aspectj.lang.annotation.Aspect;import org.aspectj.lang.annotation.Pointcut;import org.springframework.core.Ordered;import org.springframework.core.annotation.Order;import org.springframework.transaction.TransactionDefinition;import org.springframework.transaction.annotation.Transactional;@Aspect@Order(Ordered.LOWEST_PRECEDENCE - 2)public class TransactionAttributesAspect {    @Autowired    private JdbcTemplate jdbcTemplate;    @Pointcut("execution(public * *(..)) "            + "&& @annotation(transactional)")    public void anyTransactionalOperation(Transactional transactional) {    }    @Around(value = "anyTransactionalOperation(transactional)", argNames = "pjp,transactional")    public Object doAroundTransactionalMethod(ProceedingJoinPoint pjp, Transactional transactional) throws Throwable {        Assert.isTrue(TransactionSynchronizationManager.isActualTransactionActive(), "Explicit transaction required");        applyVariables(transactional);        return pjp.proceed();    }    private void applyVariables(Transactional transactional) {        if (transactional.timeout() != TransactionDefinition.TIMEOUT_DEFAULT) {            jdbcTemplate.update("SET transaction_timeout=?", transactional.timeout() * 1000);        }        if (transactional.readOnly()) {            jdbcTemplate.execute("SET transaction_read_only=true");        }    }}

This weaves in the doAroundTransactionalMethod advice at runtime on all public methods annotated with Spring's @Transactional annotation. This is pretty much what the pointcut expression says:

@Pointcut("execution(public * *(..)) && @annotation(transactional))

Lastly, we look at the annotation properties and use a JDBC template to set the appropriate variables while assuming there's an open transaction in scope.

if (transactional.timeout() != TransactionDefinition.TIMEOUT_DEFAULT) {   jdbcTemplate.update("SET transaction_timeout=?", transactional.timeout() * 1000);}

Testing Timeouts

To test this in action, let's create a simple service and a few repositories:

@Servicepublic class OrderService {    @Autowired    private OrderRepository orderRepository;    @Autowired    private ProductRepository productRepository;    @Transactional(propagation = Propagation.REQUIRES_NEW, readOnly = true)    public Product findProduct(String sku) {        return productRepository.findBySku(sku)                .orElseThrow(() -> new ObjectRetrievalFailureException(Product.class, sku));    }    @Transactional(propagation = Propagation.REQUIRES_NEW, timeout = 5)    public void placeOrderWithTimeout(Order order, long delayMillis) {        placeOrderAndUpdateInventory(order);        try {            logger.info("Entering sleep for " + delayMillis);            Thread.sleep(delayMillis);        } catch (InterruptedException e) {            Thread.currentThread().interrupt();        } finally {            logger.info("Exited sleep for " + delayMillis);        }    }    @Transactional(propagation = Propagation.REQUIRES_NEW)    public void placeOrderWithoutTimeout(Order order) {        placeOrderAndUpdateInventory(order);    }    private void placeOrderAndUpdateInventory(Order order) {        Assert.isTrue(!TransactionSynchronizationManager.isCurrentTransactionReadOnly(), "Read-only");        Assert.isTrue(TransactionSynchronizationManager.isActualTransactionActive(), "No tx");        // Update product inventories        order.getOrderItems().forEach(orderItem -> {            Product product = orderItem.getProduct();            product.addInventoryQuantity(-orderItem.getQuantity());            productRepository.save(product); // product is in detached state        });        order.setStatus(ShipmentStatus.confirmed);        orderRepository.save(order);    }}

In the placeOrderWithTimeout method, there's a fake delay that can last longer than the configured timeout to trigger an abort. Let's verify this in an integration test:

public class TimeoutsTest extends AbstractIntegrationTest {    @Autowired    private OrderService orderService;    @Autowired    private TestSetup testSetup;    @BeforeAll    public void setupTest() {        testSetup.setupTestData();    }    @org.junit.jupiter.api.Order(1)    @Test    public void whenCreatingOrderWithTimeoutThatExpires_thenExpectRollback() {        Product p1 = orderService.findProduct("p1");        int inventory = p1.getInventory();        JpaSystemException ex = Assertions.assertThrows(JpaSystemException.class, () -> {            orderService.placeOrderWithTimeout(Order.builder()                            .andOrderItem()                            .withProduct(p1)                            .withQuantity(1)                            .withUnitPrice(p1.getPrice())                            .then()                            .build(),                    7000);        });        Assertions.assertEquals("transaction timeout expired", ex.getMessage());        Assertions.assertEquals(inventory, orderService.findProduct("p1").getInventory());        logger.info("Exception thrown", ex);    }    @org.junit.jupiter.api.Order(2)    @Test    public void whenCreatingOrderWithTimeout_thenExpectCommit() {        Product p1 = orderService.findProduct("p1");        int inventory = p1.getInventory();        orderService.placeOrderWithTimeout(Order.builder()                        .andOrderItem()                        .withProduct(p1)                        .withQuantity(1)                        .withUnitPrice(p1.getPrice())                        .then()                        .build(),                2000);        Assertions.assertEquals(inventory - 1, orderService.findProduct("p1").getInventory());    }    @org.junit.jupiter.api.Order(3)    @Test    public void whenCreatingOrderWithoutTimeout_thenExpectCommit() {        Product p1 = orderService.findProduct("p1");        int inventory = p1.getInventory();        orderService.placeOrderWithoutTimeout(Order.builder()                .andOrderItem()                .withProduct(p1)                .withQuantity(1)                .withUnitPrice(p1.getPrice())                .then()                .build());        Assertions.assertEquals(inventory - 1, orderService.findProduct("p1").getInventory());    }}

In this example if the transaction time out, it throws JpaSystemException.

Conclusion

This article explains how to use the new transaction_timeout session variable in CockroachDB v23.1 to set a fixed upper limit for how long to wait for an explicit transaction to complete. It also provides an example of a service and repositories to test the timeout in an integration test, which verifies that when the timeout expires, the transaction is rolled back and the inventory remains unchanged.

One-Phase Commit Transaction Strategy

Kai Niemi — Sun, 30 Apr 2023 18:29:40 GMT

Introduction

A commonly adopted transaction strategy can be described as the best-efforts one-phase-commit (1PC) pattern. It's different from a global XA/2PC protocol where an external transaction manager ensures that all transaction properties are maintained across the involved transactional resources (database, queue, etc).

The basic idea behind 1PC is to delay the commit in a transaction as late as possible so that the only things that can go wrong are infrastructure failures (because they are rare). All business processing failures are caught before it happens.

It is a relaxation of ACID properties spanning multiple transactional resources. That also means there's a certain risk for system inconsistency in a worst-case scenario, which can be mitigated if the processing is idempotent. For this strategy to work as safely as possible, idempotency is key.

This concept is also described in full detail in Dr David Syer's article Distributed transactions in Spring, with and without XA from 2009.

Scenarios

To illustrate 1PC let's review a couple of examples.

Database and Message Broker

Consider a typical service activator scenario where there's a database and a JMS message queue involved. It includes a write to the database and an acknowledgement to the broker of receiving a message. Both these operations are independent, as in there's no atomicity.

This scenario also maps to Kafka which uses consumer offsets and message retention rather than ephemeral message acks (removed once ack:ed for queue).

Start messaging transaction (broker delivers a message)
Receive message
Start database transaction
Update database
Commit database transaction
Commit messaging transaction (ack is sent to broker upon which the message is removed from destination)

The order of the first four steps is not important. What is important is that the message must be received before updating the database and each transaction must start before its corresponding resource is used.

Dual-write problem (below):

@Componentpublic class RegistrationConsumer {    @JmsListener(destination = "${active-mq.topic}", containerFactory = "jmsListenerContainerFactory")    @Transactional(propagation = Propagation.REQUIRES_NEW)    public void receiveMessage(RegistrationEvent event) {        registrationRepository.save(toEntity(event));    }}

The following order is therefore just as valid:

Start messaging transaction
Start database transaction
Receive message
Update database
Commit database transaction
Commit messaging transaction

The last two steps (5 and 6) are important to be both in order and come last. It's better to surface business processing violations (bad input, rule violations, constraint violations etc) before sending things to be made permanent. When flushing database operations before acknowledging the message, there's less chance of both systems going out of sync.

An out-of-sync condition could be that the database transaction commits, but the message broker ack fails. Or the other way around, the commit fails and the ack succeeds. In either case, it will result in double-processing the same event and if the database writes are nonidempotent you end up with multiple side effects.

In the case of object-relational-mappers (ORM), most database actions take place during the commit phase due to the first-level cache. It's at that point where the JPA provider (Hibernate) performs update optimizations (collapsing) and determines what SQL statements to send to the database. It's also the phase where the database may raise data model constraint violations to preserve integrity.

Things that can go wrong in the messaging transaction are network and process failures with the broker, which are less likely to occur.

Database and Remote API Call

Consider a typical service boundary scenario where there's a database and a foreign API service involved. It includes a write to the database and an API call to the remote endpoint, which in turn creates some side-effect. Both these operations are independent, as in there's no atomicity.

Start database transaction
Update database
Send a POST request to the remote API
Commit database transaction

First off, it's not advisable to invoke remote calls from a transaction context. Therefore, the minimum ask would be to order the steps accordingly:

Send a POST request to the remote API
Start database transaction
Update database
Commit database transaction

Better, but still there's a potential issue here if the database transaction fails. In that case, when you retry the entire operation there will be another POST request sent to the endpoint. If that endpoint is nonidempotent you may end up with multiple side effects. Essentially this is the same problem as in the first example, called non-atomic dual-writes.

@Servicepublic class TransferService {    @Transactional(propagation = Propagation.REQUIRES_NEW)    public void createTransfer(TransferEntity entity) {        ResponseEntity response                = new RestTemplate().postForEntity("https://api.bank.com",toRquestPayload(entity), String.class);        if (!response.getStatusCode().is2xxSuccessful()) {            throw new IllegalStateException("Disturbance!");        }    }}

However, in this scenario, if the endpoint is idempotent (invoking many times is the same as invoking once) then it doesn't matter how many times you retry it. It will only have one single side effect.

CockroachDB and XA

Currently CockroachDB doesn't support the XA protocol but there's a tracking issue for it. The good thing is that XA-distributed transactions are not strictly needed to support the above scenarios. There are plenty of alternative options, such as:

Saga pattern:
- A decomposed version of 2PC where involved services implement participation and compensation methods as part of an agreement protocol. Either using an orchestrated or choreographed approach.
- The practical use is between disparate services (not between databases and/or brokers)
- Quite complicated to implement and test and reduces understandability
Outbox pattern:
- Domain events are written to the database as part of the local transaction.
- Domain events are published downstream after the commit point using CDC
- Avoids the non-atomic dual write problem.
- The practical use is between disparate services
Inbox pattern:
- Incoming messages are stored in the database and then CDC is used to publish or self-subscribe to the messages
- Offloads the message broker and adds retention
- The practical use is between disparate services
1PC with idempotency
- As described in this article

Conclusion

This article outlined a commonly adopted transaction strategy described as the best-efforts one-phase-commit (1PC) pattern. It offers a simple, low-effort alternative to XA pre-conditioned that operations are idempotent.

Testing Serializable Isolation in CockroachDB

Kai Niemi — Sun, 30 Apr 2023 10:06:47 GMT

As a follow-up to A Basic Guide to Transaction Isolation, this article will reproduce a handful of interesting tests described on the PostgreSQL SSI page. Only this time for CockroachDB.

Another great resource to illustrate the behaviour of serializable is Martin Kleppman's Hermitage project and the CockroachDB contribution. It goes through a rich set of anomalies ranging from dirty writes to write skew on disjoint and predicate reads.

Examples

These tests were executed using the following CockroachDB version:

$ cockroach versionBuild Tag:        v22.2.8Build Time:       2023/04/17 13:22:08Distribution:     CCLPlatform:         linux amd64 (x86_64-pc-linux-gnu)Go Version:       go1.19.6C Compiler:       gcc 6.5.0Build Commit ID:  9a7c644e565b21d29db26a0a82524a00809d0a8cBuild Type:       release

First, create a test database:

cockroach sql --insecure --host=localhost -e "CREATE database test"

Next, start three separate shell windows representing transactions T1, T2 and T3. Whenever there's a "-- T1" comment for a SQL statement, run that statement in the designated console session.

cockroach sql --insecure --host=localhost --database test // for T1cockroach sql --insecure --host=localhost --database test // for T2cockroach sql --insecure --host=localhost --database test // for T3

Black and White Marbles

This is a test for Write Skew (A5B), prevented by serializable. Write Skew is when two transactions overlap and one reads data that another is writing.

Schema setup:

create table if not exists marbles (  id    bigint      not null primary key,  color varchar(25) not null);delete from marbles where 1=1;insert into marbles values  (1,'black'),  (2,'black'),  (3,'black'),  (4,'black'),  (5,'black'),  (6,'white'),  (7,'white'),  (8,'white'),  (9,'white'),  (10,'white');

Note: The set transaction isolation level serializable part is redundant for CockroachDB since it's the default (and only level supported).

Observe that CockroachDB serializable prevent this anomaly:

begin; set transaction isolation level serializable; -- T1begin; set transaction isolation level serializable; -- T2update marbles set color = 'black' where color = 'white'; -- T1update marbles set color = 'white' where color = 'black'; -- T2commit; -- T1. First commit wins.commit; -- T2. ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key /Table/137/1/6/0): "sql txn" meta={id=f8ee6d8c key=/Table/137/1/1/0 pri=0.00066739 epo=0 ts=1682691951.358336984,2 min=1682691934.561490359,0 seq=5} lock=true stat=PENDING rts=1682691934.561490359,0 wto=false gul=1682691935.061490359,0SQLSTATE: 40001HINT: See: https://www.cockroachlabs.com/docs/v22.2/transaction-retry-error-reference.html#retry_serializable

All the colours must match:

SELECT * from marbles order by id;  id | color-----+--------   1 | black   2 | black   3 | black   4 | black   5 | black   6 | black   7 | black   8 | black   9 | black  10 | black(10 rows)

If you would run in SNAPSHOT (which CockroachDB doesn't provide) then it would not prevent write skew and look like this instead:

+----+-------+| id | color |+----+-------+|  1 | white ||  2 | white ||  3 | white ||  4 | white ||  5 | white ||  6 | black ||  7 | black ||  8 | black ||  9 | black || 10 | black |+----+-------+-- (10 rows)

The colours have been flipped due to write skew, which is expected under concurrent execution with snapshot isolation (or read committed).

Red, Green and Blue Marbles

This example is similar to the previous one, only this time involving three transactions.

Setup schema:

create table if not exist marbles (  id    bigint      not null primary key,  color varchar(25) not null);delete from marbles where 1=1;insert into marbles (id,color) values (1,'red');insert into marbles (id,color) values (2,'red');insert into marbles (id,color) values (3,'red');insert into marbles (id,color) values (4,'yellow');insert into marbles (id,color) values (5,'yellow');insert into marbles (id,color) values (6,'yellow');insert into marbles (id,color) values (7,'blue');insert into marbles (id,color) values (8,'blue');insert into marbles (id,color) values (9,'blue');

Again, CockrochDB SERIALIZABLE isolation prevents Write Skew (A5B):

begin; set transaction isolation level SERIALIZABLE ; -- T1update marbles set color = 'yellow' where color = 'red'; -- T1begin; set transaction isolation level SERIALIZABLE ; -- T2update marbles set color = 'blue' where color = 'yellow'; -- T2. Blocks on T1 intents.begin; set transaction isolation level SERIALIZABLE ; -- T3update marbles set color = 'red' where color = 'blue'; -- T3. Blocks.commit; --T1(T2 unblocks - rows affected 6)(T3 unblocks - rows affected 3)commit; -- T3commit; -- T2. ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict

The correct outcome (only yellow and red):

select * from marbles;+----+--------+| id | color  |+----+--------+|  1 | yellow ||  2 | yellow ||  3 | yellow ||  4 | yellow ||  5 | yellow ||  6 | yellow ||  7 | red    ||  8 | red    ||  9 | red    |+----+--------+(9 rows)

Intersecting Data

Two concurrent transactions read data, and each uses it to update the range read by the other.

Setup schema:

create table if not exists tab (  id bigint not null,  value bigint not null);delete from tab where 1=1;INSERT INTO tab VALUES(1, 10), (1, 20), (2, 100), (2, 200);

Observe CockroachDB guarantees a serial execution:

begin; set transaction isolation level serializable; -- T1SELECT SUM(value) FROM tab WHERE id = 1; -- T1INSERT INTO tab VALUES (2, 30); -- T1begin; set transaction isolation level serializable; -- T2SELECT SUM(value) FROM tab WHERE id = 2; -- T2 (blocks)commit; -- T1 (unblocks T2)INSERT INTO tab VALUES (1, 330); -- T2commit; -- T2SELECT * from tab;

Yields:

SELECT * from tab;  id | value-----+--------   1 |    10   1 |    20   2 |   100   2 |   200   2 |    30   1 |   330(6 rows)

Overdraft Protection

Here we will protect the invariant that the total of all accounts must exceed the amount requested.

Schema setup:

create table if not exists account  (    name VARCHAR(25) not null,    type VARCHAR(25) not null,    balance NUMERIC(19, 2) not null,    primary key (name, type)  );delete from account where 1=1;insert into account values  ('alice','saving', 500),  ('alice','checking', 500);

Let's try to play the bank under serializable isolation:

begin; set transaction isolation level serializable ; -- T1select type, balance from account where name = 'alice'; -- T1begin; set transaction isolation level serializable ; -- T2select type, balance from account where name = 'alice'; -- T2update account set balance = balance - 900.00 where name = 'alice' and type = 'saving'; -- T1commit; -- T1update account set balance = balance - 900.00 where name = 'alice' and type = 'checking'; -- T2commit; -- T2 ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict:

Yields the following where the invariant holds:

select * from account;  name  |   type   | balance--------+----------+----------  alice | checking |  500.00  alice | saving   | -400.00(2 rows)

Deposit Report

Setup schema (before every test run):

create table if not exists control  (    deposit_no int not null  );create table if not exists receipt  (    receipt_no bigint NOT NULL PRIMARY KEY DEFAULT unique_rowid(),    deposit_no int not null,    payee text not null,    amount numeric(19,2) not null  );DELETE from control where 1=1;DELETE from receipt where 1=1;insert into control values (1);insert into receipt  (deposit_no, payee, amount)  values ((select deposit_no from control), 'Crosby', 100.00);insert into receipt  (deposit_no, payee, amount)  values ((select deposit_no from control), 'Stills', 200.00);insert into receipt  (deposit_no, payee, amount)  values ((select deposit_no from control), 'Nash', 300.00);

Test sequence:

begin; set transaction isolation level serializable ; -- T1insert into receipt (deposit_no, payee, amount) values ( (select deposit_no from control), 'Young', 100.00 ); -- T1select * from receipt; -- T1begin; set transaction isolation level serializable ; -- T2   select deposit_no from control; -- T2update control set deposit_no = 2 where 1=1; -- T2commit; -- T2begin; set transaction isolation level serializable ; -- T3   select * from receipt where deposit_no = 1; -- T3. Blocks on T1commit; -- T1(T3 unblocks)commit; -- T3

Yields:

select * from receipt;      receipt_no     | deposit_no | payee  | amount---------------------+------------+--------+---------  860561294810873858 |          1 | Crosby | 100.00  860561295115354114 |          1 | Stills | 200.00  860561295382970370 |          1 | Nash   | 300.00  860561358736326657 |          1 | Young  | 100.00(4 rows)

Rollover

This example was created to show that PostgreSQL can roll back read-only transactions to prevent serialization conflicts. It won't happen in CockroachDB.

Schema setup:

create table if not exists rollover (  id int primary key,   n int not null  );delete from rollover where 1=1;insert into rollover values (1,100), (2,10);

Financial transaction under serializable isolation:

begin; set transaction isolation level serializable ; -- T1update rollover  set n = n + (select n from rollover where id = 2)  where id = 1; -- T1begin; set transaction isolation level serializable ; -- T2update rollover set n = n + 1 where id = 2; -- T2 - blocks on T1commit; --T2 begin; set transaction isolation level snapshot ; -- T3select count(*) from rollover; -- T3 - blocks on T1         commit; -- T1select n from rollover where id in (1,2); -- T3

Conclusion

This article showcased a few examples outlined in the PostgreSQL SSI description page. It highlights some runtime differences between PostgreSQL SSI and CockroachDB serializable.

A Basic Guide to Transaction Isolation

Kai Niemi — Sun, 30 Apr 2023 09:54:43 GMT

ACID transactions are implemented differently in databases and provide different runtime characteristics towards applications. It's mainly manifested in terms of when different operations are blocked from proceeding or a transaction is forced to retry. That is, if the isolation level is indeed serializable, which is not always the case. Not all databases provide true ACID guarantees and that presents a problem if you are dependent on it.

The "I" part in ACID stands for serializable isolation, which means that a database that formally claims to support ACID needs to provide the highest isolation standard in SQL - serializable. Serializable isolation guarantees that even though transactions may execute in parallel, the result is the same as if they had executed one at a time, without any concurrency.

Executing transactions serially would lead to the same result, but it would also destroy any performance aspirations. Concurrent execution is a must-have. One way to look at it is that it's a magic show hosted by the database, giving the illusion to clients they are the exclusive users of the database, completely free from interference from others.

(image from: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/)

Isolation levels are however confusing and ambiguous, in particular for distributed databases where you don't have a single time source. Not only are isolation levels difficult to understand but can also mean different things. Serializable in Oracle, for example, actually means Snapshot (which is weaker) and Repeatable Read in PostgreSQL means snapshot (which is stronger). Snapshot also permits write skew (A5B), which Repeatable Reads does not. Then we have Oracle Read Consistency, which is like Read Committed, only stronger by advancing the transaction timestamp for each SQL statement.

This ambiguity presents a real challenge for application developers and architects. They are tasked to figure out when a given isolation level is sufficient for correct execution. It also makes it more difficult to think in terms of portability between databases when the behaviour is different. One piece of advice is that unless you are 100% sure of what anomalies business rule invariants are exposed to, then go for a higher level of isolation.

Related Resources:

The goal of transaction isolation is to find a good balance between safety and performance for concurrent transactions. A database should allow concurrent access to data while still being safe, meaning that concurrent operations that happen to interleave, should not observe intermediate state, overwrite other transaction writes or violate invariants guarded by constraints. It's the database being liberal and conservative at the same time.

A higher isolation level reduces and even eliminates most known read/write conflict anomalies, at the expense of performance and rollbacks on contended operations. Performance is increased and transient errors are reduced by lowering the isolation level, effectively requiring less coordination and planning effort by the database to guarantee safe, concurrent execution. It depends on the database implementation though, and in some cases, the difference in performance is small for non-contending operations.

The main downside of lowering isolation is that applications become more exposed to read-write phenomena (anomalies) that may cause data loss or corruption in the worst case. These types of errors are quite difficult to track down and test for.

Read and Write Anomalies

The lowest isolation level is Read Uncommitted (RU) meaning basically that all (most) bets are off. It allows dirty reads (P1) where transaction T1 is allowed to read transaction T2:s writes that haven't been committed yet. Read Uncommitted must prohibit dirty writes (P0) though, where T1 would modify T2:s write before it has committed.

The highest ACID isolation level is serializability which means transactions are not exposed to any read/write anomalies. A client can safely read and write without having to worry about other transactions possibly performing the same operations. The database will guarantee that no client will ever observe any inconsistent state and that all invariants will be preserved at commit.

In between you have all the rest. Anomalies can either be permitted or prevented by using ANSI SQL isolation levels, or something even higher like strict serializability or linearizability (external consistency).

Common anomalies include:

Dirty write (P0)
Dirty read (P1)
Fuzzy read (P2)
Phantom (P3)
Strict Phantom (A3)
Lost update (P4)
Cursor lost update (P4C)
Read Skew (A5A)
Write Skew (A5B)

Surprisingly enough, the default isolation level in most modern databases is read committed (RC). It is a fundamentally unsafe isolation level exposed to lost updates (P4) and more. Still, many applications are using it and seem to work fine most of the time.

But how can you be sure you will not be the next Bitcoin exchange or e-commerce site that gets exploited by weak isolation? Trying to navigate through these things is not far from trying to beat classic Minesweeper.

Isolation Levels in Databases

Modern lock-free MVCC databases (and others) like Oracle and PostgreSQL default to Read Committed (RC). As MVCC databases, they also support snapshot isolation (SI) which is a slightly weaker model than serializable.

SI does not use locking, which is sort of the point with MVCC, but instead every transaction operates on an isolated snapshot of committed data whose values are not visible to other transactions unless the transaction commits.

SI sorts in somewhere between read committed and serializable (Berenson and Adya). It prevents P4 (lost update) by applying a first committer wins policy and like Repeatable Read (RR) it prohibits P0, P1 and P2. It prevents a special version of P3 called A3 (Phantom) that RR allows, but allows A5B (write skew) that RR prevents. Write skew is when two concurrent transactions are writing based on reading a data set which overlaps what the other is writing.

PostgreSQL (since 9.1) implements serializable isolation on top of SI, called serializable snapshot isolation or SSI. It prevents A5B (write skew) by forcing conflicts through either promoting reads to writes or by analyzing dependency cycles in transactions.

If you are not already confused at this point, then congratulations. These conditions and more are outlined in far more detail in the A Critique of ANSI SQL Isolation Levels paper.

Cockroachdb only implements serializable isolation, which narrows down the options. It gives peace of mind if you are concerned about read/write anomalies.

Conclusion

ACID transaction isolation levels are ambiguous and tricky to grok. Most modern databases implement transaction isolation differently and often have weak defaults where applications need to opt-in for higher isolation. In CockroachDB, the only choice is serializable which is the highest level in the SQL standard.

Entity Control Boundary in Spring Boot Apps

Kai Niemi — Sun, 30 Apr 2023 09:18:04 GMT

Introduction

In a previous article, we looked at an architectural pattern named entity-control-boundary (ECB) mapped to Spring meta-annotations for transaction management and retries. That post didn't go very deep into this architectural pattern, which is the purpose of this article.

ECB is an architecture pattern originally coined in Ivar Jacobson's use-case-driven object-oriented software engineering (OOSE) method published in 1992. In other words, it dates way back in time yet it's not super well-known.

This pattern fits really well into organising transaction boundaries in application code and you don't need to go all-in on all the fun stuff like UML, waterfall or unified process to use it. It's really straightforward and mainly serves a documentative and declarative purpose.

The ECB pattern is centred around defining clear responsibilities and interactions between different categories of classes. It can be broken down into four elements of a robustness diagram: Actor, Boundary, Control and Entity.

The following robustness constraints apply:
Actors may only know and communicate with boundaries.
Boundaries may communicate with actors and controls only.
Controls may know and communicate with boundaries and entities, and if needed other controls.
Entities may only know about other entities but could communicate also with controls.

Source: https://en.wikipedia.org/wiki/Entity-control-boundary

In other words, there could be dependencies and interactions like this in a single service:

ECB offers structure and low-effort consistency to transaction boundary demarcating, simply by using transaction attributes or preferably dedicated meta- or stereotype annotations. Without this level of structure, the chances are that the boundaries become unclear and blurry which may result in hard-to-find errors and system inconsistencies.

Definitions

Let's map the ECB concept to concrete architectural elements (namespaces and annotations) that you typically see in a Spring Boot application.

Boundary

A boundary is coarse-grained and exposes functionality towards users or other systems (actors). It is typically implemented as a web controller or business service facade. It should be thin and delegate business processing to more fine-grained control services, if applicable. It acts both as a remoting and transaction boundary.

A boundary should never be invoked from within a transaction context. It means that only a boundary is allowed to create new transactions. To that end, boundaries must have the REQUIRES_NEW transaction attribute on their public, transactional methods. This propagation attribute will always create a new transaction and suspend any existing one.

Transaction suspension via REQUIRES_NEW and nested transactions via NESTED are different things. Nested transactions allow for a rollback to the beginning of the sub-transaction while keeping the transactional state of the outer transaction. Nested transactions are expressed using savepoints. Unfortunately, savepoints are not fully supported in all Java application frameworks but it's available in Spring Boot, although not in JPA. If you are using something else than JPA then savepoints opens up a few more opportunities. For the transaction boundary type discussed here, however, we don't use nested transactions (via savepoints) but just regular unnested local transactions.

Characteristics

Key characteristics for a transaction and remoting boundary:

Independent of other service facades or web controllers.
Granularity is more coarse-grained than a service.
The layer that exposes functionality outside of the business tier.
The only layer that is accessible from an external client (typically via a web API).
Methods are preferably idempotent for client convenience.
Never invoked within a transaction context.

Solution

Typical implementation elements in Spring Boot:

Can be a Business Facade, Web Controller or Service Activator (Message Listener) where:
- A business facade uses @Service
- A controller uses @RestController
- A service activator uses @Service
Implements simple business logic or delegates to services (Control) or even repositories (Entity).
Always uses transaction demarcation REQUIRES_NEW since it's a boundary.
- @Transactional(propagation = Propagation.REQUIRES_NEW)

Conventions

Represents the remoting entry point (when it's a web @Controller).
An interface or class with thin, coarse-grained methods.
Should be located in a dedicated boundary or service namespace.
Should use the documentative meta-annotation to emphasise its architectural role.
The business interface should be named after business concepts.

Example of a boundary meta-annotation (annotation describing or grouping other annotations). Notice that it incorporates the Spring @Transactional annotation with propagation REQUIRES_NEW.

@Inherited@Documented@Retention(RetentionPolicy.RUNTIME)@Target({ElementType.TYPE, ElementType.METHOD})@Transactional(propagation = Propagation.REQUIRES_NEW)public @interface Boundary {}

Boundary service facade example:

@Servicepublic class TransactionServiceImpl implements TransactionService {    @Autowired    private AccountRepository accountRepository;    @Autowired    private TransactionRepository transactionRepository;    @Override    @Boundary    public Transaction submitTransferRequest(TransferRequest request) {        if (!TransactionSynchronizationManager.isActualTransactionActive()) {            throw new IllegalStateException("No transaction context");        }    }}

Boundary web controller example:

@RestController@RequestMapping(value = "/api/transaction")public class TransactionController {    @GetMapping    @Boundary    public PagedModel listTransactions(@PageableDefault(size = 5) Pageable page) {        return pagedTransactionResourceAssembler                .toModel(bankService.find(page), transactionResourceAssembler);    }}

Boundary service activator example:

@Servicepublic class KafkaChangeFeedConsumer {    @KafkaListener(topics = TOPIC_ACCOUNTS, containerFactory = "accountListenerContainerFactory")    @Boundary    public void accountChanged(@Payload AccountPayload event,                               @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,                               @Header(KafkaHeaders.OFFSET) int offset) {    }}

Control

A control service is a fine-grained realization of activities or sub-processes. It's where business functionality is implemented. It must always be invoked within the context of a transaction and is not allowed to create new transactions. To that end, it must have the MANDATORY transaction attribute. The same policy applies to repository interfaces or classes that perform persistence logic. A repository is not allowed to create a new transaction.

A control service that is just a thin delegation layer between a boundary and repository contract (like Spring Data repository) adds no real value. In that case, to reduce boilerplate code, consider accessing repository resources directly from boundaries, effectively collapsing the boundary and service into one artefact.

Characteristics

Services should be independent of other services.
The granularity is finer than a boundary.
Services are not available or visible outside of the business tier.
Methods should be idempotent and always be invoked from a transactional context.

Solution

Can be a business service that implements business logic.
Not allowed to start new transactions.
Use MANDATORY transaction propagation attribute.
- @Transactional(propagation = Propagation.MANDATORY)

Conventions

Interface or class with fine-grained methods and PDOs.
Should be located in a dedicated ..service package.
Should use a documentative meta-annotation to emphasise its architectural role.
The business interface should be named after business concepts.

Example of a control service meta-annotation. Notice that it incorporates the Spring @Transactional annotation with propagation MANDATORY.

@Inherited@Documented@Retention(RetentionPolicy.RUNTIME)@Target({ElementType.TYPE, ElementType.METHOD})@Transactional(propagation = Propagation.MANDATORY)public @interface Control {}

A control service example:

@Servicepublic class DefaultTransactionService implements TransactionService {    @Autowired    private AccountRepository accountRepository;    @Override    @Control    public Transaction createTransaction(UUID id, TransactionForm transactionForm) {   Assert.isTrue(TransactionSynchronizationManager.isActualTransactionActive(), "Expected transaction");    }}

Entity

Entities are a static model representation of the application state mapped against a database. Usually through some ORM technology such as JPA and Hibernate. Entities must never be visible outside the system boundaries or JVM, but can optionally have DTO or value object/model representations. In most cases, DTOs add little value to hide implementation detail and protect internal entities. One exception could be representation models in Spring HATEOAS that add hypermedia controls on top of domain entities (you can use EntityModel also).

In terms of ECB, the entity element is simply represented by JPA entities and Spring Data repositories with the @Repository annotation. There's not much more to it than emphasising the architectural role.

Transaction Retries

Now that we are familiar with ECB, let's wrap things up by also adding the capability to retry transient SQL errors. When a SQL error with the state code 40001 encountered, it's typically safe to retry the local transaction from a database point of view. If the retried business facade method and its descendants are nonidempotent, then some precautions may be needed to avoid multiple side effects (again strive for idempotency).

The simplest approach is to use Spring Retry with a custom exception classifier and exponential backoff. How this is done is outlined in more detail in this post.

A brief example of a retriable boundary for completeness (notice the @Retryable):

@Servicepublic class OrderService {    @Boundary    @Retryable    public Order updateOrderStatus(Long orderId, ShipmentStatus status, BigDecimal amount) {        // Call DB and maybe do other idempotent stuff        return order;    }}

You can also push the @Retryable annotation to @Boundary which then automatically adds the retry capability to all annotated methods.

@Inherited@Documented@Retention(RetentionPolicy.RUNTIME)@Target({ElementType.TYPE, ElementType.METHOD})@Transactional(propagation = Propagation.REQUIRES_NEW)@Retryable(exceptionExpression = "@cockroachExceptionClassifier.shouldRetry(#root)", maxAttempts = 5, backoff = @Backoff(maxDelay = 15_000, multiplier = 1.5))public @interface Boundary {}

The exception classifier for completeness:

@Componentpublic class CockroachExceptionClassifier {    private final Logger logger = LoggerFactory.getLogger(getClass());    private static final String SERIALIZATION_FAILURE = "40001";    public boolean shouldRetry(Throwable ex) {        if (ex == null) {            return false;        }        Throwable throwable = NestedExceptionUtils.getMostSpecificCause(ex);        if (throwable instanceof SQLException) {            return shouldRetry((SQLException) throwable);        }        logger.warn("Non-transient exception {}", ex.getClass());        return false;    }    public boolean shouldRetry(SQLException ex) {        if (SERIALIZATION_FAILURE.equals(ex.getSQLState())) {            logger.warn("Transient SQL exception detected : sql state [{}], message [{}]",                    ex.getSQLState(), ex.toString());            return true;        }        return false;    }}

General Guidelines

A few general guidelines for transaction management.

Avoid Remote Calls

Avoid remote calls to external resources from within a database transaction context. You may end up locking up resources for a long time in case of network communication problems or issues with the target endpoint. You are also exposed to the challenge of dual writes, where one part succeeds and the other part fails leaving the system in an inconsistent state (typically addressed with the outbox pattern).

Read-Only Implicit Transactions

If you are not performing any writes, then consider using read-only, implicit transactions. The readOnly attribute in @Transactional gives a clue to the transaction management that it's a read-only operation. The JPA provider may then perform certain optimizations.

Non-transactional read-only (implicit transactions) methods can use SUPPORTS propagation. This works as long as the default autoCommit flag is not enabled in the data source.

HikariDataSource ds = properties        .initializeDataSourceBuilder()        .type(HikariDataSource.class)        .build();ds.setAutoCommit(false); // false is the default, setting it to true makes all transactions explicit

Conclusion

This article describes the ECB architecture pattern to enhance transaction robustness in Spring Boot apps. Database transactions must always be started by boundaries and nowhere else. A boundary is typically a web controller or business service facade.

Boundaries use REQUIRES_NEW propagation.
Control services and repositories use MANDATORY propagation.
Non-transactional read-only methods can use SUPPORTS propagation.

Working with BLOBs in CockroachDB

Kai Niemi — Thu, 27 Apr 2023 11:51:15 GMT

Introduction

Databases aren't great for storing binary large objects, aka BLOBs. By large meaning several MBs of size. If that is needed, then it's likely much more performant to just use the filesystem or cloud storage and only store references in the database for structure.

Smaller objects are typically fine to store in the database. To that end, this article will demonstrate how to manage BLOBs using JPA and Hibernate along with CockroachDB. In CockroachDB, the BLOB type is an alias for the BYTES data type. As mentioned in the docs, it's recommended to keep values under 1 MB to ensure adequate performance. Above that threshold, write amplification and other considerations may cause significant performance degradation.

Mapping BLOBs in JPA

When using Hibernate, you typically use the @Lob annotation and java.sql.Blob which maps to the SQL BLOB data type.

@Entity@Table(name = "attachment")public class Attachment {       @Column(name = "content")    @Basic(fetch = FetchType.LAZY)    @Lob    private Blob content;    ...}

You can also use a byte[] array or a String, but it's generally more performant to use a streaming approach using the Blob type.

Using BLOB Mappings

You need to use Hibernates BlobProxy class to create a Blob. As you can see in the example below, it's pretty straightforward:

// Stores a BLOB represented by the inputStreamBlob content = BlobProxy.generateProxy(inputStream, contentLength);Attachment attachment = new Attachment();attachment.setContent(content);attachment.setName(name);attachment.setDescription(description);attachmentRepository.save(attachment);

That's about it, very simple.

To read the BLOB back again, it recommended to use the streaming approach:

// Lookup attachment by ID and stream the blob to the outputStreamAttachment attachment = attachmentRepository.getReferenceById(id);try (InputStream in = new BufferedInputStream(        attachment.getContent().getBinaryStream())) {    FileCopyUtils.copy(in, outputStream);} catch (SQLException | IOException e) {    throw new DataRetrievalFailureException("Error reading attachment data", e);}

If you would provide a REST endpoint for downloading BLOB attachments, then it could look like this when implemented using Spring Boot:

@GetMapping("/download/{id}")public ResponseEntity downloadAttachment(@PathVariable("id") Long id) {    Attachment attachment = attachmentService.findById(id);    StreamingResponseBody responseBody =            outputStream -> attachmentService.streamAttachment(attachment, outputStream);    return ResponseEntity.ok()            .header(HttpHeaders.CONTENT_TYPE, attachment.getContentType())            .header(HttpHeaders.CONTENT_DISPOSITION, "inline")            .header("Cache-Control", "no-cache, no-store, must-revalidate")            .header("Pragma", "no-cache")            .header("Expires", "0")            .body(responseBody);}

StreamingResponseBody is used for asynchronous request processing where the application can write directly to the response OutputStream.

Demo Project

This demo project is a runnable Spring Boot application that provides a REST API for querying and uploading attachments with BLOB content.

To build:

git clone git@github.com:kai-niemi/roach-spring-boot-v3cd roach-spring-boot-v3chmod +x mvnw./mvnw clean install

To run:

cockroach sql --insecure --host=localhost -e "CREATE database spring_boot_demo"java -jar spring-boot-blob/target/spring-boot-blob.jar

Check that the service is up at http://localhost:8090.

Upload an image file using cURL:

curl http://localhost:8090/attachment/form \-H "Content-Type: multipart/form-data" \-v \-F "content=@spring-boot-blob/src/test/resources/test.jpg" \-F "fileName=test.jpg" \-F "description=test.jpg"

Source Code

The code for this article is available on GitHub.

Conclusion

This article provides a guide on how to manage binary large objects (BLOBs) using JPA and Hibernate with CockroachDB. It demonstrates the @Lob annotation, java.sql.Blob type, and a runnable Spring Boot example.

Defining Quality Attributes

Kai Niemi — Thu, 27 Apr 2023 07:37:44 GMT

Introduction

Quality attributes are synonymous with non-functional requirements, as in the properties and characteristics of a software system that is not directly related to some functional aspect. Quality attributes are what ultimately define a system's runtime and evolutionary characteristics. These attributes indicate the software architecture style of the system and how different implementation mechanisms support these qualities, including the database.

In this article, we'll take a look at some of these attributes and more specifically how they can be defined, quantified and addressed when using the unique capabilities of CockroachDB.

Commonly used quality attributes for software systems (there's a lot more):

Scalability - A systems ability to scale with increasing load and business complexity.
Reliability - A measure of a system's ability to detect and recover from failures and deliver correct, consistent and reliable results.
Performance - The unit of time it takes to execute an operation, usually measured in response time, transaction time and throughput (work per time unit).
Availability - A measure of the ability of a system to function in a state of serious service or infrastructure degradation.
Evolvability - A systems ability to make changes with low cost and small client/user impact.
Maintainability - A systems ability to be diagnosed and repaired after an error occurs.
Interoperability - How the system interacts with other subsystems or foreign services.
Visibility - A systems support for debugging and real-time monitoring.
Security - A systems ability to support security controls including access controls, encryption, data isolation, secure information processing and auditing for compliance.

Functional attributes in a software system are useless without quality, and non-functional attributes are useless without relevant purpose and meaning for a business. It's a task for architects and developers to map the problem domain against a solution domain and deliver solutions that meet both functional and non-functional requirements. The problem domain describes the what & why and the solution describes the how.

The re-usable artifacts are typically software design guidelines and principles that both guide the development of new software components as well as help delay software to deteriorate over time due to change. Change is a natural given for any software component and that's where the real cost sits in the lifecycle of software systems. Not so much in the initial development efforts. The architectural decisions made early on in a product's life cycle to support things like evolvability, maintainability and low cost of change are very important for the total cost of ownership.

There's always a balance between the amount of time invested in quality attributes against getting a new product or feature out on the market. Things need to be prioritized like anything else and the best way is to ask the business stakeholders what is most important to deliver.

Qualifying Quality Attributes

Definiting clear, measurable and quantifiable requirements is an art form. Both business-oriented and of technical nature. To get good at it you need to ask questions and a lot of them.

How can we define, measure and communicate the importance of abstract things like evolvability? Let's give it a try by listing a few questions for each listed quality attribute. Because the database is a critical infrastructure component for any software system, let's also see how each quality attribute can be addressed from a database point of view. Not just any database but CockroachDB.

Scalability

A systems ability to scale with increasing load and business complexity.

Scalability describes the ability of a system to cope with increased load, most often measured in latency percentiles and throughput. When the load increases on a system, it's relevant to observe much more resources are needed to maintain the same level of performance. When a business grows in terms of increased customer demands, new markets or acquisitions, scalability also describes the ability to adapt systems to that new reality without having to undergo major refactoring or redesign efforts.

Questions:

What are the data volumes and whats the expected growth over time?
What is the impact of traffic, data or customer base growing by 10x?
What level of scalability is relevant, local, regional or global?
How important is it to deliver a consistent customer experience globally?
What does the traffic load look like for steady state, peak, extended peak and stress?
Does the system need to auto-adjust to increased/decreased spike demands?
Is data archiving to offline systems needed?

Solutions:

CockroachDB is a geo-distributed SQL database designed to scale horizontally by adding more nodes to a cluster, increasing compute and IO capacity. Given the transactional properties and consistency guarantees, it also enables crafting multi-active systems, where its relevant to distinguish between response times and service times.

Service time is the time it takes to process a synchronous request entering a service's boundary and preparing a response. Response time is service time plus the time it takes to transport traffic over the network including queuing delays. For a cluster spanning for example SA-EU-AP, we can drastically reduce response times by servicing requests in the proximity of where it's stored.

Multi-active systems add the ability to control both service and response time through physical network topologies, data locality and replication policies. Depending on how nodes and replicas are arranged, both latency and fault tolerance or survival goals can also be controlled.

It provides the equivalence of a content-delivery network (CDN) for transactional data, logically spanning the entire globe. No matter which (edge) node you interact with, it will provide a consistent and accurate result.

Reliability

A measure of the system's ability to detect and recover from failures and deliver correct, consistent and reliable results.

Reliability describes the ability of a system to operate correctly (do the right thing) at the desired level of performance, both under heavy concurrent workloads and in the event of infrastructure failures like partial and full zone or region failures. Its more difficult to measure than scalability but can be defined in different ways.

For instance:

Not loose or corrupt data because of infrastructure failures (no partial commits)
Not loose or corrupt data because of concurrency anomalies (ex: lost updates, phantom reads, read/write skew)
Not provide stale data where authoritative data is expected
Prevent diverging histories of state and throwing away committed writes when healing from network partitions
Not breach correctness rules or invariants (ex: negative balance in accounting)
Tolerate human mistakes and errors

Questions:

What is the anatomy of a typical business transaction?
- How does it get triggered?
- What is the average duration?
- How much information needs to be scanned vs returned?
- What are the expected success and failure outcomes?
How are key business rule invariants to be protected?
Is reading information always needed to be authoritative or potentially stale?
How are business transactions spanning multiple services handled?
How are transient transaction errors handled?
How are timeouts/indeterminate outcomes handled?
How is service idempotency implemented?

Solutions:

One feature of multi-active systems is the ability to operate simultaneously from multiple geographies with sustained throughput and transactional integrity, both during steady state and during disruptions or even disasters in other regions. In a 3-region CockroachDB deployment, one entire region can have an outage without affecting forward progress in the other two.

Performance

The unit of time it takes to execute an operation, usually measured in response time or transaction time and throughput (work per time unit).

Questions:

How is customer experience affected by high response times and low throughput?
How many concurrent active sessions/users will connect to the system?
Are there specific performance and throughput targets (service level indicators) on a per-use case basis?
What is the ratio between reads and writes?
Can reads be served as potentially stale or always authoritative?
Will there be a caching tier to scale out reads, in that case, how is the cache invalidated and kept in sync?

Defining performance requirements should ideally be context or journey specific. Most larger systems are composed of different customer journeys such as registration, login, deposit, withdrawal, pay and so on. Not all journeys have the same NFRs and may touch different services, hence it makes sense to define the performance goals based on that rather than at individual service level.

For example, for journey X:

Raw data size: 4TB
Active connections: 400 to 500
Actual users: 7M
Active users: 500k
Reads must be authoritative
Sustained throughput under 60min, qualified with:
- 5,000 business transactions per sec, equivalent to 15K QPS, at P99 < 120ms, read ratio 75%
Peak throughput under 30min, qualified with:
- 7000 business transactions per sec, equivalent to 21K QPS, at P95 < 150ms, read ratio 65%
Extended peak throughput under 15min, qualified with:
- 10,000 business transactions per sec, equivalent to 30K QPS, at P95 < 300ms, read ratio 65%

Solutions:

CockroachDB delivers predictable response times and throughput for different workload scales. When optimizing performance characteristics, it is typically a matter of finding opportunities in:

Application workload patterns
Schema design
Cluster hardware capacity and utilization
Replica and leaseholder placement

Most opportunities are outlined in SQL performance best practices. A workload should be evenly distributed across all machines of a cluster (no hotspots) which happens automatically given a few schema design and load balancing considerations. By using different multi-region capabilities, the coordination between nodes over the network can also be minimized, drastically improving both read and write performance.

Let's wrap with the other quality attributes by just highlighting a few relevant questions.

Availability

A measure of the ability of a system to function in a state of serious service or infrastructure degradation.

Questions:

What level of redundancy will the system have (any accepted SPOFs)?
What type of infrastructure failures must the system survive (cloud, region, zone, rack or server)?
Whats the business impact on degraded or denied forward progress?
Will the system continue to function on partial infrastructure failure?
Are there any specific RTO and RPO metrics?
What are the requirements for backup and restore (MTTR)?

Evolvability

A systems ability to make changes with low cost and small client impact.

Questions:

What's the structure and process around the development and deployment pipeline?
Are downtime windows allowed for production deployments?
Is a pre-production deployment environment needed?
What are the key factors that impact time-to-value in new business initiatives/improvements?
How does changing/adding functionality impact existing functionality?

Maintainability

A systems ability to be repaired after an error occurs.

Questions:

What design principles are applied to reduce maintenance efforts?
How much QA/OPS effort is needed to verify and deploy the system?

Interoperability

How the system interacts with other subsystems or foreign services.

Questions:

How does data flow into the system and what is the output?
Is the system classified as online, nearline or offline?
What are the major infrastructure components involved (database, message broker)?
Is the system classified as a system of record or a system of access?
Is the interaction model a typical request/response based model or an event-driven, async model?

Visibility

A systems support for debugging and real-time monitoring.

Questions:

How is the system monitored and acting on alerts?
How can problems quickly be identified and corrected?

Security

A systems ability to support security controls including access controls, encryption, data isolation, secure information processing and auditing for compliance.

Questions:

Will the system run in a PCI or equivalent regulated environment?
Will the system handle PII data?
What security mechanisms does the system require on ingress and egress channels, or data protection?

Conclusion

This article discusses non-functional requirements, also known as quality attributes, which are properties and characteristics of a software system that is not directly related to its functional aspects. It looks at how these requirements can be defined, quantified, and addressed when using the capabilities of CockroachDB.

Parallel Query Execution in CockroachDB

Kai Niemi — Wed, 26 Apr 2023 14:11:28 GMT

This article provides an example of increasing large query performance by using client-side parallel query execution.

Introduction

CockroachDB uses parallelism in many parts of its architecture to deliver high-scale distributed SQL execution. For example, to improve write performance, it uses a parallel atomic commit protocol designed to cut the commit latency of a transaction from two roundtrips of consensus to one. When combined with transaction pipelining, where write intents are replicated from leaseholders in parallel rather than sequentially, all waiting happens in the end at commit time, thereby drastically reducing latencies for multi-statement write transactions.

To improve read performance in multi-region high-latency deployments, the cost-based optimizer performs what's referred to as a locality-optimized search. The optimizer may begin to scan for rows in the gateway node's local region and fan out to remote regions in parallel, but only if the local region did not satisfy the query. The remote lookup (performed in parallel) result is returned to the gateway once received without having to wait for completion. This increases read performance in multi-region deployments since results can be returned from wherever they are first found, without waiting for the completion of all lookups.

Last but not least, CockroachDB also uses vectorized SQL query execution, designed to process batches of columnar data instead of just a single row at a time. In the longer term, this will also make use of vectorized CPU (SIMD) instructions.

Parallelism is well exploited in the algorithms and mechanisms that CockroachDB uses. This works well for both larger and smaller statements that don't scan large volumes of data, which is typically something you'd want to avoid doing anyway in an OLTP database.

Now to the purpose of this article, what can the client do to take this even further?

Client-side Parallelism

The CockroachDB database (and SQL for that matter) does a decent job to hide the implementation details from clients through all abstraction layers. One of the primary tasks of a SQL database is to provide the illusion to clients that they are the sole users, free to read and write any piece of information without interference from others. In reality, the environment is highly concurrent and parallelized, which in practical terms means that the database is allowed to reorder concurrent transactions as long as the result is the same as if they had executed one at a time (serially), without any concurrency. This is the definition of SERIALIZABLE transaction isolation.

A SQL database is designed to be highly capable of accepting queries from multiple application instances and threads in parallel. In a typical request-response, thread-bound execution model you get a connection from the pool, send a single or multi-statement transaction, await its completion and close the connection (recycled to the pool). While this gives a level of parallelism in terms of multiple application-level threads, it doesn't help that much for larger scans beyond what the database offers.

What if you want to take things a step further in terms of parallel execution and involve the client? For example, by first running a very large query that scans hundreds of thousands of rows to compute an aggregated sum in the database and then do the equivalent client side by decomposing the query into smaller blocks run in parallel. Let's find out if it makes any difference.

Example Use Case

Assume we have a simple product table holding an inventory column.

create table product(    id        uuid           not null default gen_random_uuid(),    version   int            not null default 0,    inventory int            not null,    name      varchar(128)   not null,    price     numeric(19, 2) not null,    sku       varchar(128)   not null unique,    country   varchar(128)   not null,    primary key (id));

Next, we add a covering index on the country and insert a huge bunch of products:

CREATE INDEX ON product (country) STORING (inventory,name,price);insert into product (inventory,name,price,sku,country)select 10 + random() * 50,       md5(random()::text),       500.00 + random() * 500.00,       gen_random_uuid()::text,       'US'from generate_series(1, 500000) as i;-- Repeat insert for 9 more countries, in total 5M rows

Composed Query

Let's run a single composed query to get the total inventory sum grouped by country:

select sum(p.inventory), p.country from product p group by p.country;

Gives:

select sum(p.inventory), p.country from product p group by p.country;    sum    | country-----------+----------  17251976 | BE  17253042 | DE  17234287 | DK  17253539 | ES  17229425 | FI  17250751 | FR  17247093 | NO  17257296 | SE  17237964 | UK  17261461 | US(10 rows)Time: 4.083s total (execution 4.083s / network 0.000s)

This query still runs fairly fast for a total row count of 5M. Let's look a the explain plan to see that we are scanning the entire table:

explain analyze select sum(p.inventory), p.country from product p group by p.country;                                                            info-----------------------------------------------------------------------------------------------------------------------------  planning time: 421s  execution time: 4.1s  distribution: full  vectorized: true  rows read from KV: 5,000,000 (466 MiB, 47 gRPC calls)  cumulative time spent in KV: 3.8s  maximum memory usage: 10 MiB  network usage: 0 B (0 messages)  regions: europe-west1   group (streaming)   nodes: n1   regions: europe-west1   actual row count: 10   estimated row count: 10   group by: country   ordered: +country      scan        nodes: n1        regions: europe-west1        actual row count: 5,000,000        KV time: 3.8s        KV contention time: 0s        KV rows read: 5,000,000        KV bytes read: 466 MiB        KV gRPC calls: 47        estimated max memory allocated: 10 MiB        estimated row count: 7,036,818 (100% of the table; stats collected 5 days ago; using stats forecast for 5 days ago)        table: product@product_country_idx        spans: FULL SCAN(31 rows)Time: 4.091s total (execution 4.090s / network 0.000s)

Decomposed Parallel Queries

Let's decompose the single query into multiple ones and run them in parallel, then combine the results in the end. We refactor the query by removing the GROUP BY and filtering on the indexed country column instead. Effectively the GROUP BY operator is moved client side.

Example of a single country query:

select sum(p1_0.inventory) from product p1_0 where p1_0.country='US';    sum------------  17261461(1 row)Time: 231ms total (execution 231ms / network 0ms)

Let's also check the execution plan:

explain analyze select sum(p1_0.inventory) from product p1_0 where p1_0.country='US';                                                           info---------------------------------------------------------------------------------------------------------------------------  planning time: 535s  execution time: 248ms  distribution: full  vectorized: true  rows read from KV: 500,000 (47 MiB, 5 gRPC calls)  cumulative time spent in KV: 225ms  maximum memory usage: 10 MiB  network usage: 0 B (0 messages)  regions: europe-west1   group (scalar)   nodes: n1   regions: europe-west1   actual row count: 1   estimated row count: 1      scan        nodes: n1        regions: europe-west1        actual row count: 500,000        KV time: 225ms        KV contention time: 0s        KV rows read: 500,000        KV bytes read: 47 MiB        KV gRPC calls: 5        estimated max memory allocated: 10 MiB        estimated row count: 696,645 (9.9% of the table; stats collected 5 days ago; using stats forecast for 5 days ago)        table: product@product_country_idx        spans: [/'US' - /'US'](29 rows)Time: 249ms total (execution 249ms / network 0ms)

The estimated row count is about 10% of the table which sounds about right since we inserted 500K rows per country.

Now we apply a parallel fork and join operation at the client side. This means we fire ten concurrent threads with the individual queries and then await completion before proceeding. After that, the results are joined together.

For this example, we'll use Spring Data and a JPA query:

@Query("select sum(p.inventory) from Product p where p.country = :country")Integer sumInventory(@Param("country") String country);

First queue up the workers, one for each country:

List>> tasks = new ArrayList<>();  StringUtils.commaDelimitedListToSet("SE,UK,DK,NO,ES,US,FI,FR,BE,DE").forEach(country ->  tasks.add(() -> Pair.of(country, productRepository.sumInventory(country))));

Next, we unleash the workers to run in parallel while blocking until completion (or cancellation by timeout):

ConcurrencyUtils.runConcurrentlyAndWait(tasks, 10, TimeUnit.MINUTES, sums::add);

This utility method makes use of Java's CompletableFuture was introduced way back in Java 8. It's like a Swiss army knife for asynchronous computations using parallel decomposition constructs. Tasks are decomposed into steps that can be forked and joined in different stages to a final result. It's a very elegantly designed API.

In this example, we are just using a small subset of it to run our query tasks in parallel and join the results. It also adds cancellation, in case queries would go rogue and run for too long. Cancellation is not a natural part of CompletableFuture so there's a small trick in there to add that.

It's also using a bounded thread pool which means that no matter how many tasks are queued it will only run a limited number of concurrent threads by adding backpressure on the client code queuing up tasks. This is more lenient on thread scheduling since the client will be blocking anyway.

ScheduledExecutorService cancellationService        = Executors.newSingleThreadScheduledExecutor();ExecutorService executor = boundedThreadPool();List> allFutures = new ArrayList<>();final Instant expiryTime = Instant.now().plus(timeout, timeUnit.toChronoUnit());tasks.forEach(callable -> {    allFutures.add(CompletableFuture.supplyAsync(() -> {                if (Instant.now().isAfter(expiryTime)) {                    logger.warn("Task scheduled after expiration time: " + expiryTime);                    return null;                }                Future future = executor.submit(callable);                long cancellationTime = Duration.between(Instant.now(), expiryTime).toMillis();                cancellationService.schedule(() -> future.cancel(true), cancellationTime, TimeUnit.MILLISECONDS);                try {                    return future.get();                } catch (InterruptedException e) {                    Thread.currentThread().interrupt();                    throw new IllegalStateException(e);                } catch (ExecutionException e) {                    throw new IllegalStateException(e.getCause());                }            }, executor)            .thenAccept(completionFunction)            .exceptionally(throwableFunction)    );});CompletableFuture.allOf(        allFutures.toArray(new CompletableFuture[]{})).join();executor.shutdownNow();cancellationService.shutdownNow();

Once all the query sums are gathered we simply add them up client side using a stream API aggregator:

sums.stream().mapToInt(Pair::getSecond).sum()

OK, the result then? Here's the log output:

09:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for UK is 1723796409:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for US is 1726146109:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for DK is 1723428709:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for SE is 1725729609:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for ES is 1725353909:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for NO is 1724709309:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for FR is 1725075109:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for DE is 1725304209:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for FI is 1722942509:21:53.253  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Inventory sum for BE is 1725197609:21:53.254  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Total inventory sum is 17247683409:21:53.254  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Verified inventory sum is 17247683409:21:53.254  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Parallel execution time: PT1.1578745S09:21:53.254  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Serial execution time: PT2.9943538S09:21:53.254  INFO [i.r.s.p.ParallelApplication$$SpringCGLIB$$0] Execution time diff: 259%

In this simple example, we can notice a 260% performance improvement by decomposing the query and running these independently.

Conclusion

This article explains how CockroachDB uses parallelism to improve read and write performance, and how client-side parallel query execution can be used to further increase large query performance. An example use case is provided to illustrate how this works, using Spring Data and a JPA query to run a parallel fork and join operation at the client side with a bounded thread pool and a cancellation service.

The source code for the article is available on GitHub.

Enhancing Global Read Performance in CockroachDB

Kai Niemi — Wed, 26 Apr 2023 14:08:19 GMT

Introduction

CockroachDB is a geo-distributed SQL database purpose-built from the ground up for high scalability, fault tolerance, cloud neutrality and usability for developers and operators. It also offers the highest SQL standard for transactional integrity - serializable isolation.

The term geo-distribution is to emphasise its capability to break out of the low-latency, stable networking assumptions of a single data center or single region deployment. CockroachDB clusters can span the globe and still offer one logical database towards applications with intact semantics and guarantees. No more need for manual sharding.

One major influence on performance, when nodes need to perform some level of coordination, is network latency. There are numerous mechanisms at work in CockroachDB to mitigate the effects of high cross-link latencies. For performance and to ensure safety and liveness in volatile and ephemeral hosting environments.

One key ingredient for high performance and linear scalability in CockroachDB is the ability to distribute workload both vertically and horizontally and thereby leverage the aggregate compute/IO capacity of a cluster. This is mainly achieved by the database itself but application designers can help by using some best practices around schema design and query patterns and load balance traffic across the cluster. In general terms, it's about avoiding hotspots from forming, avoiding contention if possible and reducing large table scans.

Techniques

CockroachDB uses a distributed SQL execution engine at its core, which means many things:

The SQL layer is parallelized and pushes processors close to the proximity of data.
The SQL optimizer is purpose-built with latency as a cost factor and locality awareness.
The transaction layer uses a sophisticated pipelining and parallel commit protocol to reduce round trips to the theoretical minimum for consensus.
Backup and restore are locality-aware.
The vectorized execution engine provides good performance for a wide range of queries.
Load-based splitting and rebalancing heuristics help to balance load across a cluster of machines.

Scaling reads is generally considered to be slightly easier than scaling writes. In a global or multi-region deployment topology, there are a couple of useful patterns including global tables, regional-by-row table localities and follower reads.

Global Tables

A global table means that all voting range replicas reside on nodes in the primary region, and non-voting replicas in remote regions to service consistent reads. The database automatically adjusts the replication factor (RF) to ensure there are range replicas for these tables in each configured region. It also uses something called non-blocking transactions in combination with non-voting replicas to provide low-latency global reads, also during workload contention. This concept is useful if you have a table which has a low volume of writes but high volumes of reads from different regions and the reads must be authoritative.

alter database test primary region "eu-north-1";alter database test add region "eu-west-3";alter database test add region "us-east-1";create table postal_codes(    id   int primary key,    code string);ALTER TABLE postal_codes SET LOCALITY GLOBAL;

Regional by Row

Regional by row is a table locality in which the home region is defined at the row level in a table. In contrast to regional tables, where all rows in a table have the same home region.

alter database test primary region "eu-north-1";alter database test add region "eu-west-3";alter database test add region "us-east-1";CREATE TABLE users(    user_id     INT    NOT NULL,    name        STRING NULL,    postal_code int    NULL,    PRIMARY KEY (user_id ASC));ALTER TABLE users SET LOCALITY regional by row;insert into users (user_id, crdb_region)select no, 'eu-north-1' from generate_series(1, 100) no;insert into users (user_id, crdb_region)select no, 'eu-west-3' from generate_series(101, 200) no;insert into users (user_id, crdb_region)select no, 'us-east-1' from generate_series(201, 300) no;

Follower Reads

Follower reads are akin to Content Delivery Networks (CDN) by not having to chase the leaseholder for a given range that can potentially be located in another part of the world. Instead, the closest replica to a gateway node (receiving the request) can service the read with some staleness. There are two variants of follower reads called exact staleness and bounded staleness reads.

On the surface, follower reads may appear similar to global tables but the latter works quite differently through non-voting replicas and non-blocking transactions.

Follower reads are useful for more ad-hoc SQL queries for both partitioned and unpartitioned tables, where reads are allowed to be non-authoritative (potentially stale). Global tables are always authoritative (no staleness bounds) but pay for that in higher write latency.

The choice between follower reads and global tables should be driven by staleness requirements, read vs write volumes and survival goals. In other words, if the decision to write something is based on a read, and the value read must be authoritative, then a global table is a better choice. On the other hand, if write performance is a priority and a staleness window is acceptable, then follower-reads are better.

-- per statement:SELECT ID FROM USERS AS OF SYSTEM TIME follower_read_timestamp() WHERE id=1;-- alt via session var:BEGIN;SET TRANSACTION AS OF SYSTEM TIME follower_read_timestamp();SELECT ..COMMIT;

Summary

CockroachDB is a geo-distributed SQL database designed for scalability, fault tolerance, cloud neutrality, and usability. It offers distributed SQL execution and concepts like global tables, and follower reads to help balance read-heavy load across a cluster of machines. When deciding between follower reads and global tables, factors such as staleness requirements, read vs write volumes, and survival goals should be taken into consideration. Global tables are better for authoritative reads, while follower reads are better for write performance with an acceptable staleness window.

Spring Retry with CockroachDB

Kai Niemi — Thu, 13 Apr 2023 13:20:20 GMT

Spring Retry is a small library for retrying failed method invocations of a transient nature. Typically when interacting with another service over the network, a message broker or database.

In this tutorial, we'll look at using spring retry for serialization conflict errors denoted by the SQL state code 40001.

Maven Setup

To use Spring Retry, you need to add the Spring Retry and Spring AOP dependencies to your pom.xml.

<dependency>    <groupId>org.springframework.retrygroupId>    <artifactId>spring-retryartifactId>    <version>2.0.1version>dependency><dependency>    <groupId>org.springframeworkgroupId>    <artifactId>spring-aspectsartifactId>    <version>5.3.10version>dependency>

Configuration

To enable Spring Retry in an application, add the @EnableRetry annotation to any of the @Configuration classes:

@EnableRetry@Configuration@SpringBootApplicationpublic class MyApplication {    public static void main(String[] args) {        new SpringApplicationBuilder(MyApplication.class)                .run(args);    }}

Example Service

Using Spring Retry is as simple as adding the @Retryable annotation to the methods to-be-retried:

@Servicepublic class OrderService {    @Transactional(propagation = Propagation.REQUIRES_NEW)    @Retryable    public Order updateOrderStatus(Long orderId,ShipmentStatus status, BigDecimal amount) {        Order order = ...;        return order;    }}

In our case, however, we want to be more specific on what type of exceptions qualify for a retry and also tailor the backoff policy to use an exponentially increasing delay with jitter.

@Servicepublic class OrderService {    @Transactional(propagation = Propagation.REQUIRES_NEW)    @Retryable(exceptionExpression = "@exceptionClassifier.shouldRetry(#root)",            maxAttempts = 5,            backoff = @Backoff(maxDelay = 15_000, multiplier = 1.5))    public Order updateOrderStatus(Long orderId,ShipmentStatus status, BigDecimal amount) {        Order order = ...;        return order;    }}

The backoff annotation parameters defines a policy that results in the ExponentialRandomBackOffPolicy is used at runtime.

Next, let's look at the exception classifier:

public class CockroachExceptionClassifier {    private final Logger logger = LoggerFactory.getLogger(getClass());    private static final String SERIALIZATION_FAILURE = "40001";    public boolean shouldRetry(Throwable ex) {        if (ex == null) {            return false;        }        Throwable throwable = NestedExceptionUtils.getMostSpecificCause(ex);        if (throwable instanceof SQLException) {            return shouldRetry((SQLException) throwable);        }        logger.warn("Non-transient exception {}", ex.getClass());        return false;    }    public boolean shouldRetry(SQLException ex) {        if (SERIALIZATION_FAILURE.equals(ex.getSQLState())) {            logger.warn("Transient SQL exception detected : sql state [{}], message [{}]",                    ex.getSQLState(), ex.toString());            return true;        }        return false;    }}

We also add the classifier bean to the configuration:

    @Bean    public CockroachExceptionClassifier exceptionClassifier() {        return new CockroachExceptionClassifier();    }

The shouldRetry method simply looks for the exception type and if it is a SQLException that it has the proper state code 40001.

We could qualify exceptions with other state codes but then there are no guarantees of multiple side effects when retried. For example, if a transaction involves multiple INSERTs and the COMMIT is successful but lost in transit in the reply back to the client. In that case, it wouldn't use state code 40001 but more likely a broken connection error code.

To be safe, only retry on the state code 40001 and nothing else, unless you are sure about the side effects of your SQL transactions and it's considered safe (or the operations are idempotent).

Demo Project

Roach Retry is a project that provides runnable examples of different transaction retry strategies for Spring Boot and the JavaEE stack. It includes Spring Retry along with a simpler AOP-driven approach and JavaEE interceptors for old-style stateless session beans.

Step 1: Startup

Create the database:

cockroach sql --insecure --host=localhost -e "CREATE database roach_retry"

Build the app:

cd spring-retry../mvnw clean install

Run the app:

java -jar target/roach-retry.jar

Then open another shell window so you have at least two windows. In any of the shells, check that the service is up and connected to the database:

curl --verbose http://localhost:8090/api

Step 2: Get Order Request Form

Print an order form template that we will use to create orders:

curl http://localhost:8090/api/order/template > form.json

Step 3: Submit Order Form

Create at least one purchase order:

curl http://localhost:8090/api/order -H "Content-Type:application/json" -X POST -d "@form.json"

Step 4: Produce a Read/Write Conflict

Assuming that there is now an existing order with ID 1 with status PLACED. We will read that order and change the status to something else, concurrently. This is known as a read-write or unrepeatable-read conflict which is prevented by serializable isolation. As a result, there will be a SQL exception and a rollback.

When this happens, the retry mechanism will kick in and retry the failed transaction. It will then succeed since the two transactions are no longer conflicting since one of them was committed successfully.

To observe this predictably we'll use two separate sessions with a controllable delay between the read and write operations.

Overview of the SQL operations executed (what the service will execute):

BEGIN; -- T1SELECT * FROM purchase_order WHERE id=1; -- T1 -- T1: Assert that status is `PLACED`-- T1: Suspend for 15s  BEGIN; -- T2SELECT * FROM purchase_order WHERE id=1; -- T2-- Assert that status is still `PLACED`UPDATE purchase_order SET order_status='PAID' WHERE id=1; -- T2 COMMIT; -- T2 (OK)UPDATE purchase_order SET order_status='CONFIRMED' WHERE id=1; -- T1 (ERROR!)ROLLBACK; -- T1

Now prepare the two separate shell windows so you can run the commands concurrently.

First, check that the order with ID 1 exists and has the status PLACED (or anything else other than CONFIRMED)

curl http://localhost:8090/api/order/1

Now let's run the first transaction (T1) where there is a simulated 15-sec delay before the commit (you can increase/decrease the time):

curl http://localhost:8090/api/order/1?status=CONFIRMED\&delay=15000 -i -X PUT

In less than 15 sec and before T1 commits, run the second transaction (T2) from another session which doesn't wait and succeeds with a commit:

curl http://localhost:8090/api/order/1?status=PAID -i -X PUT

At this point, T1 has no other choice than to rollback and that will trigger a retry:

ERROR: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write for key /Table/109/1/12/0 at timestamp 1669990868.355588000,0 too old; wrote at 1669990868.778375000,3: "sql txn" meta={id=92409d02 key=/Table/109/1/12/0 pri=0.03022202 epo=0 ts=1669990868.778375000,3 min=1669990868.355588000,0 seq=0} lock=true stat=PENDING rts=1669990868.355588000,0 wto=false gul=1669990868.855588000,0

The retry mechanism will catch that SQL exception, back off for a few hundred millis and then retry until it eventually succeeds (1 attempt).

The expected outcome is a 200 OK returned to both client sessions. The final order status must be CONFIRMED since client 1 request (T1) was retried and eventually committed, thereby overwriting T2.

curl http://localhost:8090/api/order/1

Conclusion

In this tutorial, we explore using Spring Retry, a library for retrying failed method invocations of a transient nature, to handle serialization conflict errors denoted by the SQL state code 40001. We cover how to set up Maven, configure Spring Retry, create a sample service, and demonstrate a retry scenario using a demo project.

Create a Ledger Utilizing CockroachDB - Part III - Architecture

Kai Niemi — Wed, 12 Apr 2023 16:00:39 GMT

In the third part of a series about RoachBank, a full-stack, financial accounting ledger running on CockroachDB, we will look into the design features and architectural mechanisms used.

Problem Statement

Let's begin by describing what the service does by using a problem statement. A problem statement specifies the system requirements at a high level. Input from business or product owners is critical in composing this statement.

The main characteristics include:

Uses business domain language.
Has clear sentences without jargon.
Describes the project scope.
Specifies the context of the business capability.
Specifies the users/actors of the system.
Specifies known business and technical constraints that are important to consider.
Could serve as the foundation for identifying candidate domain objects (picking nouns).

Example:

The business requires an accounting system to keep track of monetary transactions. Users of the system are internal components that need to manage financial transactions between accounts.
The system must keep track of monetary accounts, transaction history on those accounts and account balances. Each account is associated with an account owner and a base currency. A transaction is the outcome of moving funds between different accounts. A transaction contains several account legs that may involve accounts with different currencies.
The safety mechanism used is the double-entry bookkeeping principle where each transaction must have a zero balance sum of all legs with the same currency. The system must also be able to produce reports of account activities and transactions for external auditors.

Problem statements are very useful when creating new components and for understanding the purpose and meaning of existing ones. Like for this hypothetical accounting ledger.

Architectural Mechanisms

Moving on from the problem domain to the solution domain. Software architectures can be visualized using UML diagrams or the https://c4model.com/ that I find quite useful. But what if you don't have diagrams or don't fancy drawing them?

One approach is to analyze what key architectural mechanisms are needed to implement all the features. Mechanisms are abstractions so when refining these into usable components or tools, you effectively do technology selection appropriate for the business domain (and organisation).

An architectural mechanism represents a common solution to a frequently encountered architectural problem that is not specific to a project or business domain. Quite similar to design patterns.

Implementation mechanisms are typically selected from a technology baseline within a tech organisation. For instance, when persistence is needed (analysis mechanism) and ACID properties are required, then an RDBMS (design mechanism) should be used where CockroachDB (implementation mechanism) is a great choice.

Key architectural mechanisms and realizations in this accounting ledger:

Analysis	Design & Implementation	Characteristics & Constraints
Persistence	RDBMS / CockroachDB & PostgreSQL	CockroachDB is used for all persistence needs. PostgreSQL support is also available for reference.
Data Access	ORM / JPA and Hibernate	Hibernate via Spring Data JPA and Spring Data JDBC for reference. JDBC as default. These modes can be switched for performance comparison.
Transaction Management	Local Transactions (no XA)	The system uses a local transaction manager for accessing transactional resources such as the database. For database access, a local JPA transaction manager is used. Serializable transaction isolation is required. For eventing, the system uses a mix of Kafka listeners and CDC webhook endpoints for visualization, both with at least once guarantee.
Interoperability	Hypermedia API, Websockets, Kafka consumer and webhook endpoint for CockroachDB CDC sinks	Spring Hateoas using HAL+json. Streaming text oriented mesaging protocol (STOMP). Kafka consumer/publisher of CDC event
Frontend	Web UI / Thymeleaf template framework, CSS and JQuery, Bootstrap	For visualisation of account activities and system liveness during partial infrastructure disruptions.
Observability	Logging / Pull-based HTTP queries	SLF4J + Logback, Spring Actuators, Prometheus endpoint and TTDDYY proxy for JDBC logging
Caching	HTTP-level / Client side use of cache headers	HTTP cache headers in REST API. Local Spring cache for heavy reporting queries.
Resource Management	Connection Pooling	HikariCP data source.
Scheduling	Application Level	Spring built-in cron task scheduling (non-clustered)
Versioning	Database Versioning	Flyway
Inversion of Control	Application IOC	Spring Boot and AOP aspects for retryable transactions. JDBC driver retries (default)
Deployment	Container / Spring Boot	Spring Boot self-contained executable JAR
Load Balancing	L4 against the database, L7 against service API / HAProxy	Client-to-service HTTP load balancing is optional. Service to DB load balancing via HAProxy
Build	Convention over configuration	Maven 3+, JDK 17 (LTS) at source and target level

Design Overview

The ledger is based on a common Spring Boot microservice stack using Spring Boot, Spring Data JDBC/JPA, Spring Hateoas, HikariCP, Flyway and more. Kafka as CDC sink is optional for driving account balance push events to the web front-end. The default is just to send the push events after each successful transaction (with an AOP after-advice).

There are two distinct data access implementations; JPA via Hibernate and plain JDBC. Both are included in a single self-contained executable JAR artifact with an embedded Jetty servlet container. It's possible to configure the retry strategy, data access strategy and more through Spring profiles.

It connects to either a CockroachDB cluster or PostgreSQL. When using PostgreSQL, some features are disabled such as follower reads and geo-partitioning.

The bank client issues concurrent requests towards the service API endpoint which in turn reads and writes to the database. When a transfer request is processed, the outcome is a permanent record in history (the ledger) and balance updates on the affected accounts. These balance updates are also pushed to the frontends via websocket STOMP events for visualization.

Architecture Diagram

Entity Model

The system uses the following entity model for double-entry bookkeeping of monetary transaction history.

account - Accounts with a derived balance from the sum of transactions
transaction - Owning entity for balanced multi-legged monetary transactions
transaction_item - Association table between transaction and account representing a single leg with a running account balance
region - Static information about deployment regions
outbox - Optional table for showcasing outbox pattern

Main SQL files

Flyway is used to set up the DB schema and account plan during startup time. The schema is not geo-partitioned by default.

Transaction Workflow

Each monetary transaction creates a transaction record (1) and one leg (2) for each account update and also updates the cached balance on each account (3). A CHECK constraint ensures that balances don't end up negative unless allowed for that account (using allow_negative column).

The UPDATE .. FROM (below) with array unnesting is a workaround for the lack of batch updates over the wire. The pgJDBC driver doesn't batch UPDATE statements, only INSERTs up to a given limit using SQL rewrites (aka multi-value inserts).

In the default workflow, the initial balance check on the accounts is redundant since the invariant check is done by looking at the rows affected on the final UPDATE. An UPDATE also takes an implicit lock in the reading part in CockroachDB (configurable) which will reduce retries. In summary, there are no reads involved in the default workflow (not counting the internal read that is part of each UPDATE).

-- (1) headerINSERT INTO transaction (id,city,balance,currency,name,..);-- (2) for each leg (batch)INSERT INTO transaction_item (city,transaction_id,..);-- (3) for each account (batch)UPDATE account SET balance = account.balance + data_table.balance, updated=clock_timestamp()FROM (select unnest(?) as id, unnest(?) as balance) as data_tableWHERE account.id=data_table.id  AND account.closed=false  AND (account.balance + data_table.balance) * abs(account.allow_negative-1) >= 0

The actual CHECK constraints:

alter table account    add constraint check_account_allow_negative check (allow_negative between 0 and 1);alter table account    add constraint check_account_positive_balance check (balance * abs(allow_negative - 1) >= 0);

Transaction Workflow (alternative)

The default workflow yields low contention. There's an alternative workflow designed to provoke more contention to visualize the effects of transient rollback errors and retries. The alternative workflow will perform an initial balance check and optionally use select-for-update (SFU) locks. Without the SFUs, the chance of contention is high. The main benefit of this workflow is that the running balance of the accounts can be stored on the transaction legs.

It's enabled by starting the server with the --roachbank.updateRunningBalance=true option.

-- (1) initial query for all involved accounts (lock is optional)SELECT .. FROM account WHERE id IN (..) AND city IN (..) /* FOR UPDATE */;-- (2) header INSERT INTO transaction (id,city,balance,currency,name,..);-- (3) for each leg (notice running_balance)INSERT INTO transaction_item (city,transaction_id,running_balance,..);-- (4) for each account (batch)UPDATE account SET balance = account.balance + data_table.balance, updated=clock_timestamp()FROM (select unnest(?) as id, unnest(?) as balance) as data_tableWHERE account.id=data_table.id  AND account.closed=false  AND (account.balance + data_table.balance) * abs(account.allow_negative-1) >= 0

Transaction Retry Strategy

Any database running in serializable (such as CockroachDB) is exposed to transient SQL errors on contended workloads. These errors are tagged with SQL state 40001 and can be safely retried.

The ledger has three main strategies for performing retries:

Client-side retries using a Spring/AspectJ AOP "around advice" with exponential backoff.
JDBC driver level retries using the CockroachDB JDBC Driver
No retries where transient SQL errors propagate to the client.

The default is client-side retries.

For more details on pros/cons with these different retry strategies, see:

APIs

The ledger provides two main interfaces:

A hypermedia/REST API for request/response-based interactions based on Spring HATEOAS. The shell client (based on Spring Shell) interacts with the ledger through this API.
A WebSocket Streaming API for reactive front-ends, driven via CockroachDB CDC (optional) or synthetic events.

The Hypermedia API is used to view data, create accounts and generate monetary transactions. A typical HTTP client follows the hyperlinks provided by the API to guide through different workflows, such as placing a monetary transaction or browsing through pages of account details. As with any REST API, following hyperlinks is optional. A client can also bind directly to the resource URI:s with tight coupling as a result. The semantics of the endpoints are tied to the link relations rather than the opaque URI:s.

https://en.wikipedia.org/wiki/Hypertext_Application_Language
https://en.wikipedia.org/wiki/HATEOAS

Conclusion

Roach Bank is a financial accounting ledger demo running on CockroachDB and PostgreSQL. It uses an entity model for double-entry bookkeeping and provides two distinct data access implementations. This article discusses an alternative transaction workflow with a balance check and retry strategy for databases running in serializable mode. The retry strategy is handled via Spring/CGLIB proxies with exponential backoff.

Create a Ledger Utilizing CockroachDB - Part II - Deployment

Kai Niemi — Tue, 11 Apr 2023 16:00:39 GMT

In this second part of a series about RoachBank, a full-stack financial accounting ledger running on CockroachDB, we will look at how to deploy the bank against a global, multi-regional CockroachDB cluster.

Cloud Deployment

The ledger provides a few convenience scripts for deploying to AWS, GCE and Azure using an internal tool called roachprod. This tool is free to use but at your own risk.

The provided scripts will do the following:

Provision a single-region or multi-region CockroachDB cluster
Deploy HAProxy on all client nodes
Deploy bank server and client JAR on all client nodes
Start the bank server on all client nodes
Enable regional-by-row and global tables for multi-region (if needed)

Prerequisites

Roachprod - a Cockroach Labs internal tool for ramping AWS/GCE/Azure VM clusters
You will need the AWS/GCE/AZ client SDK and an account.

Deployment Scripts

AWS Deployment

The $basedir/deploy/aws folder contains a few scripts for provisioning different cluster sizes in different regions. Let's look at the aws-multiregion-eu.sh which is a multi-region configuration spanning eu-west-1, eu-west-2 and eu-central-1.

#!/bin/bash# Script for setting up a multi-region Roach Bank cluster using roachprod in either AWS or GCE.# Configuration########################title="CockroachDB 3-region EU deployment"# CRDB release versionreleaseversion="v22.2.5"# Number of node instances in total including clientsnodes="12"# Nodes hosting CRDBcrdbnodes="1-9"# Array of client nodes (must match size of regions)clients=(10 11 12)# Array of regions localities (must match zone names)regions=('eu-west-1' 'eu-west-2' 'eu-central-1')# AWS/GCE cloud (aws|gce)cloud="aws"# AWS/GCE region zones (must align with nodes count)zones="\eu-west-1a,\eu-west-1b,\eu-west-1c,\eu-west-2a,\eu-west-2b,\eu-west-2c,\eu-central-1a,\eu-central-1b,\eu-central-1c,\eu-west-1a,\eu-west-2a,\eu-central-1a"# AWS/GCE machine typesmachinetypes="c5d.4xlarge"# DO NOT EDIT BELOW THIS LINE#############################functionsdir="../common"source "${functionsdir}/core_functions.sh"main.sh

By the end of running this script, you would have an AWS provisioned 12-node (instances) cluster, out of which the nodes 10, 11 and 12 are hosting the bank application stack including HAProxy.

Something like in this diagram:

The setup script is interactive and each step will ask for confirmation. It's launched by this simple command:

./aws-multiregion-eu.sh

After the steps are completed, you should have a page automatically opened in your default browser along with the service landing page showing account boxes.

GCE Deployment

For GGE, the process is quite similar just using different regions and instance types.

For example:

cd deploy/gcechmod +x *.sh./gce-multiregion-eu.sh

Azure Deployment

The same goes for Azure, however, it only contains a single region provisioning script.

cd deploy/azurechmod +x *.sh./azure-singleregion.sh

Operating in Multi-Region

When the ledger is deployed in a multi-regional topology (like US-EU-APAC), the accounts and transactions need to be pinned/domiciled to each region for best performance.

This is done by using the regional-by-row table locality in CockroachDB. There's an explicit step in the setup script than executes the SQL statements below. This will provide for low read and write latencies in each region.

For the AWS multi-region example:

ALTER DATABASE roach_bank PRIMARY REGION "eu-central-1";ALTER DATABASE roach_bank ADD REGION "eu-west-1";ALTER DATABASE roach_bank ADD REGION "eu-west-2";ALTER TABLE region SET locality GLOBAL;ALTER TABLE account ADD COLUMN region crdb_internal_region AS (    CASE        WHEN city IN ('dublin','belfast','liverpool','manchester','glasgow') THEN 'eu-west-1'        WHEN city IN ('london','birmingham','leeds','amsterdam','rotterdam','antwerp','hague','ghent','brussels') THEN 'eu-west-2'        WHEN city IN ('berlin','hamburg','munich','frankfurt','dusseldorf','leipzig','dortmund','essen','stuttgart','stockholm','copenhagen','helsinki','oslo','riga','tallinn') THEN 'eu-central-1'        ELSE 'eu-central-1'        END    ) STORED NOT NULL;ALTER TABLE account SET LOCALITY REGIONAL BY ROW AS region;ALTER TABLE transaction ADD COLUMN region crdb_internal_region AS (    CASE        WHEN city IN ('dublin','belfast','liverpool','manchester','glasgow') THEN 'eu-west-1'        WHEN city IN ('london','birmingham','leeds','amsterdam','rotterdam','antwerp','hague','ghent','brussels') THEN 'eu-west-2'        WHEN city IN ('berlin','hamburg','munich','frankfurt','dusseldorf','leipzig','dortmund','essen','stuttgart','stockholm','copenhagen','helsinki','oslo','riga','tallinn') THEN 'eu-central-1'        ELSE 'eu-central-1'        END    ) STORED NOT NULL;ALTER TABLE transaction SET LOCALITY REGIONAL BY ROW AS region;ALTER TABLE transaction_item ADD COLUMN region crdb_internal_region AS (    CASE        WHEN city IN ('dublin','belfast','liverpool','manchester','glasgow') THEN 'eu-west-1'        WHEN city IN ('london','birmingham','leeds','amsterdam','rotterdam','antwerp','hague','ghent','brussels') THEN 'eu-west-2'        WHEN city IN ('berlin','hamburg','munich','frankfurt','dusseldorf','leipzig','dortmund','essen','stuttgart','stockholm','copenhagen','helsinki','oslo','riga','tallinn') THEN 'eu-central-1'        ELSE 'eu-central-1'        END    ) STORED NOT NULL;ALTER TABLE transaction_item SET LOCALITY REGIONAL BY ROW AS region;

When transactions are issued against accounts in these different cities, the read-and-write operations will be constrained to the home regions. For example, creating monetary transactions involving accounts in "stockholm" and "helsinki" will be serviced only by the 3 nodes in the region eu-central-1. Read operations will have local latency and write operations will have only one single roundtrip to the next closest region.

As an option to limit the amount of data and overhead of replicating cross regions, you could disable the non-voting replicas with placement restrictions. This would result in no replicas placed outside of the home regions with the consequence of higher latency for follower reads in the other regions.

The tradeoff is regional survival. To combine both regional survival and data domiciling with regional-by-row, you can use super-regions which is covered more in this post.

SET enable_multiregion_placement_policy=on;ALTER DATABASE roach_bank PLACEMENT RESTRICTED;

Running a Global Workload

First, SSH to the first client machine which in the AWS example is node (aka instance) 10.

roachprod run:$CLUSTER:10

Next, start the bank client and type connect. It should print something like this:

                                             C O C K R O A C H D B        ___                __     ___            __      / _ \___  ___ _____/ /    / _ )___ ____  / /__   / , _/ _ \/ _ `/ __/ _ \  / _  / _ `/ _ \/  '_/   /_/|_|\___/\_,_/\__/_//_/ /____/\_,_/_//_/_/\_\     bank-client (v2.0.1.BUILD-SNAPSHOT) powered by Spring Boot (v3.0.4)                 Active profiles: ${spring.profiles.active}15:30:37.219  INFO [main] [io.roach.bank.client.ClientApplication] Starting ClientApplication v2.0.1.BUILD-SNAPSHOT using Java 17.0.6 with PID 10267 (/home/ubuntu/bank-client.jar started by ubuntu in /home/ubuntu)15:30:37.220  INFO [main] [io.roach.bank.client.ClientApplication] No active profile set, falling back to 1 default profile: "default"15:30:38.414  INFO [main] [io.roach.bank.client.ClientApplication] Started ClientApplication in 1.564 seconds (process running for 2.095)disconnected:$ connect15:30:42.949  INFO [main] [io.roach.bank.client.command.Connect] Connecting to http://localhost:8090/api..15:30:43.084  INFO [main] [io.roach.bank.client.command.Connect] Welcome to text-only Roach Bank. You are in a dark, cold lobby.15:30:43.084  INFO [main] [io.roach.bank.client.command.Connect] Type help for commands.localhost:$

Next, let's run some account transfers across the cities in the local region. First, we need to verify that the local gateway region is eu-west-1:

localhost:$ gateway-regioneu-west-1

Then start the transfers:

localhost:$ transfer --regions eu-west-1

If you look in the browser tab pointing at the regional bank service, you should see some effects on the accounts in that region. If you are not sure about the URL, use roachprod ip:

roachprod ip $CLUSTER:10-12 --external

Pick the first IP and append port 8090 and you should see:

Note: For simplicity, the push events that update the balances are not broadcasted across regions, so you can only see effects at a regional level.

The transfer command runs with a very low volume by default but it can be ramped up with more concurrent threads and a higher selection of accounts to avoid contention. The low amount range reduce the risk of ending up with a negative balance causing aborts.

transfer --regions eu-west-1 --concurrency 10 --limit 1000 --amount 0.01-0.15

To run a 100% read-based workload we can use the balance command. This will start then concurrent workers per city in the given region and run point lookups.

balance --regions eu-west-1 --followerReads --concurrency 10 --limit 1000

Lastly, repeat the steps above for client nodes 11 and 12 so you end up with 3 concurrent clients and servers, one pair per region.

roachprod run:$CLUSTER:11..

Once the workloads run at full speed, you should see metrics picking up in the DB Console.

As we can see in the hardware dashboard, the vCPU utilization starts reaching the 50% threshold. Using these 16vCPU VMs, we get around 40K QPS at less than 2ms on P99. Keep in mind the cluster stretches across 3 regions in EU.

Summary

This article provides instructions on how to deploy the RoachBank accounting ledger demo on a multi-regional CockroachDB cluster, including instructions for deploying on AWS, GCE and Azure, setting up regional-by-row and global tables, and using the roachprod tool. Prerequisites include the AWS/GCE/AZ client SDK and an account.

Create a Ledger Utilizing CockroachDB - Part I - Introduction

Kai Niemi — Mon, 10 Apr 2023 16:00:39 GMT

This is the first part of a series about RoachBank, a full-stack financial accounting ledger demo running on both CockroachDB and PostgreSQL. It's designed to demonstrate the safety and liveness properties of a globally deployed, system-of-record type of financial workload.

Introduction

The concept behind the ledger is to move funds between monetary accounts using balanced, multi-legged transactions, at a high frequency. As a financial system, correctness is defined as conserving money at all times and providing an audit trail of monetary transactions performed towards the accounts. Put simply, when externally observing the system, the total account balance must be constant at all times. Funds are simply moved between different accounts using balanced transactions.

This is visualized by the service using a single page to display accounts as rectangles with their current balance.

Key Invariants

There are a few business rule invariants that must hold at all times regardless of observer and activities. Such as infrastructure failure (nodes crashing) or conflicting operations when concurrently updating the same accounts.

The total balance of all accounts must be constant.
User accounts must have a positive balance (account types that disallow negative balance).
An audit trail of all transactions must be stored from which the account balances can be derived.

The system must refuse forward progress if an operation would result in any of these invariants being compromised. For example, if a variation of the total balances is observed at any given time, then money has either been "invented" or "destroyed".

These invariants are safeguarded by ACID guarantees and real serializable transactions. CockroachDB defaults to only serializable while PostgreSQL defaults to read-committed but can be elevated to serializable-snapshot or SSI.

Double-entry Bookkeeping

To satisfy the audit trail requirement, the ledger follows the double-entry bookkeeping principle. This principle was originally formalized and published by the Italian mathematician Luca Pacioli during the 15th century.

It involves making at least two account entries for every transaction. A debit in one account and a corresponding credit in another account. The sum of all debits must equal the sum of all credits, providing a simple method for error detection. Real accounting doesn't use negative numbers, but for simplicity, this ledger does (it's not about modelling the true complexity of accounting).

A positive value means increasing the value (credit), and a negative value means decreasing the value (debit). A transaction is considered balanced when the sum of the legs with the same currency equals zero.

In the following example, there are four different accounts involved with zero-sum in the end.

Account | Credit(+) | Debit(-) |A         100               B                     -50C          25D*                    -25 \                           -75 (coalesced)D*                    -50 /------------------------------------------         125    +   -125 = 0

Building

Prerequisites

Java 17
- https://openjdk.org/projects/jdk/17/
- https://www.oracle.com/java/technologies/downloads/#java17
Maven 3+
- https://maven.apache.org/

The service is built with Maven 3.1+. Tanuki's Maven wrapper is included (mvnw) so Maven is optional. All 3rd party dependencies are available in public Maven repositories except for the CockroachDB JDBC driver which is available in GitHub Packages (you only need a GitHub account).

These dependencies are available in GitHub packages:

    io.cockroachdb.jdbc    cockroachdb-jdbc-driver    1.0.0    io.cockroachdb    spring-data-cockroachdb    1.0.0

To allow Maven to use this repository, add a github profile (or similar) to your Maven settings.xml file. Edit $user.dir/.m2/settings.xml.

An example is provided below:

"1.0" encoding="UTF-8"?>"http://maven.apache.org/SETTINGS/1.2.0"          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">              github        your-github-id        your-personal-access-token                      maven-default-http-blocker      external:http:*      Pseudo repository to mirror external repositories initially using HTTP.      http://0.0.0.0/      true                      github                                    central                https://repo1.maven.org/maven2                                        github                https://maven.pkg.github.com/cockroachlabs-field/*                                    true                                                    true                                                  github

Clone the project

git clone git@github.com:kai-niemi/roach-bank.git

Build the executable jars

Using installed Maven:

cd roach-bank chmod +x mvnw mvn clean install

Using the Maven wrapper (where you need to specify settings.xml):

cd roach-bank chmod +x mvnw ./mvnw clean install -s /settings.xml

Local Deployment

Assuming you already have a local CockrochDB cluster running.

First, create the database:

cockroach sql --insecure --host=localhost -e "CREATE database roach_bank"

Then start the server:

java -jar bank-server/target/bank-server.jar

Then start the client:

java -jar bank-client/target/bank-client.jar

The client is used to issue business transactions to the server's REST API. The client and server could be on separate hosts with an L7 load balancer in-between, but for convenience, the client connects to localhost by default.

Next Steps

In the second part of this series, we'll cover how to run the bank against a multi-regional cloud deployment. The third part goes into design details and the technology stack.

Conclusion

RoachBank is a full-stack financial accounting ledger demo running on both CockroachDB and PostgreSQL. It follows the double-entry bookkeeping principle and is designed to demonstrate the safety and liveness properties of a globally deployed, system-of-record type of financial workload. This article provides instructions on how to set up and run the demo, as well as details on the technology stack and design.

CockroachDB JDBC Driver: Part III - Bulk Rewrites

Kai Niemi — Sat, 08 Apr 2023 16:03:39 GMT

The CockroachDB JDBC driver wraps the PostgreSQL driver and offers performance optimizations that are transparent towards applications.

Article series on the JDBC driver:

https://blog.cloudneutral.se/series/cockroachdb-jdbc-driver

Introduction

This article will highlight one specific optimization feature, namely, batch DML rewrites for bulk operations using FROM clause with array unnesting. It's a mouthful, but conceptually it's a transparent rewrite at the driver level of batch DML updates like:

UPDATE product SET inventory=?, price=? WHERE id=1UPDATE product SET inventory=?, price=? WHERE id=2UPDATE product SET inventory=?, price=? WHERE id=3...

Which can be collapsed to a single update statement using FROM:

UPDATE product SET inventory=data_table.new_inventory, price=data_table.new_price FROM (select unnest(?) as id, unnest(?) as new_inventory, unnest(?) as new_price) as data_table WHERE product.id=data_table.id

This will dramatically improve the performance for bulk operations that send a series of non-aggregated UPDATEs to independent rows. It's also not limited to UPDATEs either and can also rewrite INSERT and UPSERT statements. This may also improve performance since it allows for higher batch sizes than 128 which is the soft limit in the pgJDBC driver for rewriting INSERTs.

Consider the following example schema:

create table if not exists product(    id        uuid           not null default gen_random_uuid(),    inventory int            not null,    name      varchar(128)   not null,    price     numeric(19, 2) not null,    sku       varchar(128)   not null unique,    primary key (id));

Next, let's add a few rows:

insert into product (inventory,name,price,sku)values (10, 'A', 12.50, gen_random_uuid()),       (10, 'B', 13.50, gen_random_uuid()),       (10, 'C', 14.50, gen_random_uuid()),       (10, 'D', 15.50, gen_random_uuid());select inventory,name,price,sku from product order by name;

The listed result is equivalent to the next statement, the only difference being its temp data:

select unnest(ARRAY[10, 10, 10, 10]) as inventory,       unnest(ARRAY['A', 'B', 'C', 'D']) as name,       unnest(ARRAY[12.50, 13.50, 14.50, 15.50]) as price,       unnest(ARRAY[gen_random_uuid(),                    gen_random_uuid(),                    gen_random_uuid(),                    gen_random_uuid()]) as skuorder by name;

The unnest function uses arrays to generate a temporary table with each array representing a column. Now, let's flip this over to INSERT into FROM:

insert into product (inventory,name,price,sku)(select unnest(ARRAY[10, 10, 10, 10]) as inventory,        unnest(ARRAY['A', 'B', 'C', 'D']) as name,        unnest(ARRAY[12.50, 13.50, 14.50, 15.50]) as price,        unnest(ARRAY[gen_random_uuid(),                     gen_random_uuid(),                     gen_random_uuid(),                     gen_random_uuid()]) as sku);

That will create four new products out of the contents of the arrays. When put into a JDBC-prepared statement context:

try (PreparedStatement ps = connection.prepareStatement(        "INSERT INTO product(id,inventory,price,name,sku)"                + " select"                + "  unnest(?) as id,"                + "  unnest(?) as inventory, unnest(?) as price,"                + "  unnest(?) as name, unnest(?) as sku")) {    // chunks is a segmented stream of products    chunks.forEach(chunk -> {        List qty = new ArrayList<>();        List price = new ArrayList<>();        List ids = new ArrayList<>();        List name = new ArrayList<>();        List sku = new ArrayList<>();        chunk.forEach(product -> {            ids.add(product.getId());            qty.add(product.getInventory());            price.add(product.getPrice());            name.add(product.getName());            sku.add(product.getSku());        });        try {            ps.setArray(1, ps.getConnection().createArrayOf("UUID", ids.toArray()));            ps.setArray(2, ps.getConnection().createArrayOf("BIGINT", qty.toArray()));            ps.setArray(3, ps.getConnection().createArrayOf("DECIMAL", price.toArray()));            ps.setArray(4, ps.getConnection().createArrayOf("VARCHAR", name.toArray()));            ps.setArray(5, ps.getConnection().createArrayOf("VARCHAR", sku.toArray()));            ps.executeLargeUpdate();        } catch (SQLException e) {            throw new RuntimeException(e);        }    });}

This is technically equivalent to using addBatch() or executeLargeBatch() with the important difference that it requires the pgJDBC driver's reWriteBatchedInserts to be set to true.

From a performance standpoint, both approaches are equivalent up to a certain point which is a batch size of 128, which is the hardcoded limit in the pgJDBC driver. Using batch sizes higher than that is not possible so the array unnesting approach is more performant beyond this limit. Depending on the workload you can go up to a batch number of around 16-32K until performance starts to level out again.

The hardcoded limit of 128 may be appropriate for PostgreSQL but CockroachDB is not PostgreSQL (only at the wire protocol level) and can leverage much higher bulk statement sizes. Until this limit is removed or made configurable in pgJDBC, there are only two options:

Modify the pgJDBC driver and built a custom library. This is quite straightforward but requires a separate forked version to be maintained.
Rewrite INSERTs, UPSERTs and UPDATEs with unnesting of arrays

Bulk Inserts

To recap, here's an UPSERT example that hits the pgJDBC driver 128 size limit on INSERT rewrites.

List products = Arrays.asList(        Product.builder().withName("A").withInventory(1).withPrice(new BigDecimal("10.15")).build(),        Product.builder().withName("B").withInventory(2).withPrice(new BigDecimal("11.15")).build(),        Product.builder().withName("C").withInventory(3).withPrice(new BigDecimal("12.15")).build()        // .. etc to several 1000s);Stream> chunks = chunkedStream(products.stream(), 128);dataSource.addDataSourceProperty("reWriteBatchedInserts",true);try (Connection connection = dataSource.getConnection()) {    connection.setAutoCommit(true);    chunks.forEach(chunk -> {        try (PreparedStatement ps = connection.prepareStatement(                "INSERT INTO product (id,inventory,price,name,sku) values (?,?,?,?,?) ON CONFLICT (id) DO NOTHING")) {            for (Product product : chunk) {                ps.setObject(1, product.getId());                ps.setObject(2, product.getInventory());                ps.setObject(3, product.getPrice());                ps.setObject(4, product.getName());                ps.setObject(5, product.getSku());                ps.addBatch();            }            ps.executeBatch();        } catch (SQLException ex) {            throw new RuntimeException(ex);        }    });}

For completeness, the chunkedStream method which just slices up a stream into even chunks:

    public static  Stream> chunkedStream(Stream stream, int chunkSize) {        AtomicInteger idx = new AtomicInteger();        return stream.collect(Collectors.groupingBy(x -> idx.getAndIncrement() / chunkSize)).values().stream();    }

This UPSERT statement is executed using implicit transactions and it's fairly fast with the reWriteBatchedInserts property. The pgJDBC rewrite feature works both for regular INSERTs and INSERT .. ON CONFLICT, aka UPSERTs.

Bulk Updates

Given the previous example, it's fair to assume UPDATEs work in the same way:

try (PreparedStatement ps = connection.prepareStatement(    "UPDATE product SET inventory=?, price=? WHERE id=?")) {    chunk.forEach(product -> {        try {            ps.setInt(1, product.getInventory());            ps.setBigDecimal(2, product.getPrice());            ps.setObject(3, product.getId());            ps.addBatch();        } catch (SQLException e) {            throw new RuntimeException(e);        }    });    ps.executeLargeBatch();  } catch (SQLException ex) {    throw new RuntimeException(ex);}

However, there's no batching done here whatsoever at the JDBC driver level. To apply individual row UPDATEs in bulk format, you can however use arrays again:

try (PreparedStatement ps = connection.prepareStatement(        "UPDATE product SET inventory=data_table.new_inventory, price=data_table.new_price "                + "FROM (select "                + "unnest(?) as id, "                + "unnest(?) as new_inventory, "                + "unnest(?) as new_price) as data_table "                + "WHERE product.id=data_table.id")) {    List qty = new ArrayList<>();    List price = new ArrayList<>();    List ids = new ArrayList<>();    chunk.forEach(product -> {        qty.add(product.addInventoryQuantity(1));        price.add(product.getPrice().add(new BigDecimal("1.00")));        ids.add(product.getId());    });    ps.setArray(1, ps.getConnection()            .createArrayOf("UUID", ids.toArray()));    ps.setArray(2, ps.getConnection()            .createArrayOf("BIGINT", qty.toArray()));    ps.setArray(3, ps.getConnection()            .createArrayOf("DECIMAL", price.toArray()));    ps.executeLargeUpdate(); } catch (SQLException e) {    throw new RuntimeException(e);}

The performance improvement with this approach is monumental. The only problem is that it's rather clunky and requires code refactoring.

The CockroachDB JDBC driver can however rewrite bulk DML operations on behalf of applications, which makes it transparent. See the github repo https://github.com/cloudneutral/cockroachdb-jdbc for more details.

Conclusion

This article discusses a feature of the CockroachDB JDBC driver to optimize batch updates with array unnesting, allowing for much larger batch sizes than the pgJDBC driver. It also considers the performance improvement of this approach and the code refactoring required. This approach can be used for both INSERTs and UPSERTs, as well as individual row UPDATEs.

Introduction to Spring Data CockroachDB

Kai Niemi — Fri, 07 Apr 2023 16:10:39 GMT

The Spring Data CockroachDB project aims to provide a familiar and consistent Spring-based programming model for CockroachDB as a SQL database.

The primary goal of the Spring Data project is to make it easier to build Spring-powered applications that use new data access technologies such as relational databases, non-relational databases, map-reduce frameworks, and cloud-based data services.

CockroachDB is a distributed SQL database built on a transactional and strongly-consistent key-value store. It scales horizontally; survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention; supports strongly-consistent ACID transactions; and provides a familiar SQL API for structuring, manipulating, and querying data.

Features

Bundles the CockroachDB JDBC driver
Meta-annotations for declaring:
- Retryable transactions
- Read-only transactions
- Strong and stale follower-reads
- Custom session variables including timeouts
AOP aspects for:
- Retrying transactions on serialization conflicts
- Configuring session variables, like follower-reads
Connection pool factory settings for HikariCP
Datasource proxy logging via TTDDYY
Simple JDBC shell client for ad-hoc queries and testing

Getting Started

Here is a quick teaser of an application using Spring Data JPA Repositories in Java:

@Repositorypublic interface AccountRepository extends JpaRepository<Account, UUID> {    Optional findByName(String name);    @Query(value = "select a.balance "            + "from Account a "            + "where a.id = ?1")    BigDecimal findBalanceById(UUID id);    @Query(value = "select a.balance "            + "from account a AS OF SYSTEM TIME follower_read_timestamp() "            + "where a.id = ?1", nativeQuery = true)    BigDecimal findBalanceSnapshotById(UUID id);    @Query(value = "select a "            + "from Account a "            + "where a.id in (?1)")    @Lock(LockModeType.PESSIMISTIC_READ)    List findAllForUpdate(Set ids);}@Servicepublic class AccountService {    @Autowired    private AccountRepository accountRepository;    @NotTransactional    public Account create(Account account) {        return accountRepository.save(account);    }    @NotTransactional    public Account findByName(String name) {        return accountRepository.findByName(name)                .orElseThrow(() -> new ObjectRetrievalFailureException(Account.class, name));    }    @NotTransactional    public Account findById(UUID id) {        return accountRepository.findById(id).orElseThrow(() -> new ObjectRetrievalFailureException(Account.class, id));    }    @TransactionBoundary    @Retryable    public Account update(Account account) {        Account accountProxy = accountRepository.getReferenceById(account.getId());        accountProxy.setName(account.getName());        accountProxy.setDescription(account.getDescription());        accountProxy.setBalance(account.getBalance());        accountProxy.setClosed(account.isClosed());        return accountRepository.save(accountProxy);    }    @NotTransactional    public BigDecimal getBalance(UUID id) {        return accountRepository.findBalanceById(id);    }    @TransactionBoundary(timeTravel = @TimeTravel(mode = TimeTravelMode.FOLLOWER_READ), readOnly = true)    public BigDecimal getBalanceSnapshot_Explicit(UUID id) {        return accountRepository.findBalanceById(id);    }    @NotTransactional    public BigDecimal getBalanceSnapshot_Implicit(UUID id) {        return accountRepository.findBalanceSnapshotById(id);    }    @TransactionBoundary    public void delete(UUID id) {        accountRepository.deleteById(id);    }    @TransactionBoundary    public void deleteAll() {        accountRepository.deleteAll();    }}@Configuration@EnableTransactionManagement(order = AdvisorOrder.TRANSACTION_ADVISOR)@EnableJpaRepositories(basePackages = {"org.acme.bank"})public class BankApplication {    @Bean    public TransactionRetryAspect retryAspect() {        return new TransactionRetryAspect();    }    @Bean    public TransactionBoundaryAspect transactionBoundaryAspect(JdbcTemplate jdbcTemplate) {        return new TransactionBoundaryAspect(jdbcTemplate);    }}

Maven configuration

Add this dependency to your pom.xml file:

<dependency>    <groupId>io.cockroachdbgroupId>    <artifactId>spring-data-cockroachdbartifactId>    <version>{version}version>dependency>

Then add the Maven repository to your pom.xml file (alternatively in Maven's settings.xml):

<repository>    <id>cockroachdb-jdbcid>    <name>Maven Packagesname>    <url>https://maven.pkg.github.com/cloudneutral/cockroachdb-jdbcurl>    <snapshots>        <enabled>trueenabled>    snapshots>repository>

Finally, you need to authenticate to GitHub Packages by creating a personal access token (classic) that includes the read:packages scope. For more information, see Authenticating to GitHub Packages.

Add your personal access token to the servers section in your settings.xml:

<server>    <id>githubid>    <username>your-github-nameusername>    <password>your-access-tokenpassword>server>

Now you should be able to build your project with the JDBC driver as a dependency:

mvn clean install

Alternatively, you can just clone the repository and build it locally using mvn install.

Modules

There are several modules in this project:

spring-data-cockroachdb

Provides a Spring Data module for CockroachDB, bundling the CockroachDB JDBC driver, connection pooling support via Hikari and meta-annotations and AOP aspects for client-side retry logic, as an alternative to JDBC driver level retries.

spring-data-cockroachdb-shell

An interactive spring shell client for ad-hoc SQL queries and CockroachDB settings and metadata introspection using the CockroachDB JDBC driver.

spring-data-cockroachdb-distribution

Distribution packaging of runnable artifacts including the shell client and JDBC driver, in tar.gz format. Activated via Maven profile, see build section further down in this page.

spring-data-cockroachdb-it

Integration and functional test harness. Activated via Maven profile, see build section further down in this page.

Building from Source

Spring Data CockroachDB requires Java 17 (or later) LTS.

Prerequisites

JDK17+ LTS for building (OpenJDK compatible)
Maven 3+ (optional, embedded)

If you want to build with the regular mvn command, you will need Maven v3.x or above.

Install the JDK (Linux):

sudo apt-get -qq install -y openjdk-17-jdk

Install the JDK (macOS):

brew install openjdk@17

Dependencies

This project depends on the CockroachDB JDBC driver whose artifacts are available in GitHub Packages.

Clone the project

git clone git@github.com:cloudneutral/spring-data-cockroachdb.gitcd spring-data-cockroachdb

Build the project

chmod +x mvnw./mvnw clean install

If you want to build with the regular mvn command, you will need Maven v3.5.0 or above.

Build the distribution

./mvnw -P distribution clean install

The distribution tar.gz is now found in spring-data-cockroachdb-distribution/target.

Run Integration Tests

The integration tests will run through a series of contended workloads to exercise the retry mechanism and other JDBC driver features.

First, start a local CockroachDB node or cluster and create the database:

cockroach sql --insecure --host=localhost -e "CREATE database spring_data_test"

Then activate the integration test Maven profile:

./mvnw -P it clean install

See the pom.xml file for changing the database URL and other settings (under t profile).

Conclusion

The Spring Data CockroachDB project offers a Spring-based programming model for CockroachDB, a distributed SQL database. It simplifies building Spring-powered applications with new data access technologies and includes features like bundling the CockroachDB JDBC driver, meta-annotations for transactions, connection pooling support, and a shell client for ad-hoc queries. The project requires Java 17 or later and can be built using Maven.

CockroachDB JDBC Driver: Part II - Design and implementation Details

Kai Niemi — Thu, 06 Apr 2023 16:06:39 GMT

In this second article, we'll take a closer look at the design and implementation of the custom-made CockroachDB JDBC driver. See Part I for an introduction to the driver.

Article series on the JDBC driver:

https://blog.cloudneutral.se/series/cockroachdb-jdbc-driver

Overview

The CockroachDB JDBC driver wraps the PostgreSQL JDBC driver (pgjdbc) which must also be on the app's classpath. There are no other dependencies besides SLF4J for which any supported logging framework can be used.

It works by the JDBC Driver accepting a unique URL prefix jdbc:cockroachdb to separate itself from jdbc:postgressql. When the driver is asked to open a connection (typically by the connection pool), it passes the call forward to pgJDBC and then wraps the connection in a CockroachDB connection proxy with a custom interceptor (invocation handler). The driver delegates all calls to the underlying pgJDBC driver and does not interact directly with the database itself at any point.

Internal Retries

One of the driver features includes internal retries in contrast to application-side or client-side retries which is the common option. It works by the driver wrapping each JDBC connection, statement and result set in a dynamic proxy and interceptors capable of detecting and retrying aborted transactions, warranted that the SQL exceptions are of a qualified type. The qualifying exception types are of two main categories: Serialization Conflicts and Connection Errors.

Serialization Conflicts

The JDBC driver can optionally perform internal retries of failed transactions due to serialization conflicts denoted by the 40001 state code. Serialization conflict errors are safe to retry by the client, or in this case by the driver. Safe, in terms of not producing duplicate side effects since the transaction was rolled back.

This type of error is more likely to manifest in databases running with serializable transaction isolation (1SR), in particular for contended workloads subject to read/write and write/read conflicts.

There are however more limitations with driver-level retries than application-level. See the implementation section further below for more details.

Connection Errors

The JDBC driver can also perform internal retries on connection errors denoted by any of the 08001, 08003, 08004, 08006, 08007, 08S01 or 57P01 state codes. Connection errors during in-flight transactions are generally safe to retry, but there is a potential for duplicate side effects if the SQL operations performed are non-idempotent like INSERTs or UPDATEs with increment operations (UPDATE x set y=y-1).

This could happen for example if a transaction commit was successful but the response back to the client was lost due to a connection failure. In that case, the result is ambiguous and the driver can't tell if the transaction was successfully committed or rolled back.

Retry Implementation

Transaction conflicts and connection errors can surface at read, write and commit time which means there are retry interceptors wrapped around the following JDBC API artefacts:

java.sql.Connection
- implemented by CockroachConnection
- proxied by ConnectionRetryInterceptor, retries on commit()
java.sql.Statement
- implemented by CockroachStatement
- proxied by StatementRetryInterceptor, retries on write operations
java.sql.PreparedStatement
- implemented by CockroachPreparedStatement
- proxied by PreparedStatementRetryInterceptor, retries on write operations
java.sql.ResultSet
- implemented by CockroachResultSet
- proxied by ResultSetRetryInterceptor, retries on read operations

Retries are possible by recording most JDBC operations during an explicit transaction (autoCommit set to false). If a transaction is aborted due to a transient error it will be rolled back and the connection is closed. The recorded operations are then repeated on a new connection delegate while comparing the results against the initial transaction attempt.

If the results observed by the application client are in any way different (determined by SHA-256 checksums), the driver is forced to give up the retry attempt to preserve a serializable outcome towards the application, still waiting for completion.

To illustrate:

try (Connection connection         = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable") {  try (PreparedStatement ps = connection.prepareStatement("update table set x = ? where id = ?")) {        ps.setObject(1, x);        ps.setObject(2, y);        ps.executeUpdate();  }}

In this example, assume the executeUpdate() method throws a SQLException with state code 40001. This exception is caught by the retry interceptor which will roll back and close the current connection, then repeat the recorded operations on a new connection delegate and hope for a different interleaving of other concurrent operations that allow for the transaction to complete.

From the perspective of the application, the executeUpdate() operation will block until this process is either successful or considered futile, in which case a separate SQLException is thrown with the same state code.

Limitations of driver-level retries

By contrast, when using application-level retries you would typically need to apply retry logic. Something like the following:

int numCalls=1;do {  try {      // Must begin and commit/rollback transactions      return businessService.someTransactionBoundaryOperation();   } catch (SQLException sqlException) { // Catch r/w and commit time exceptions      // 40001 is the only state code we are looking for in terms of safe retries      if (PSQLState.SERIALIZATION_FAILURE.getState().equals(sqlException.getSQLState())) {          // handle by logging and waiting with an exponentially increasing delay      } else {          throw sqlException; // Some other error, re-throw instantly      }  }} while (numCalls < MAX_RETRY_ATTEMPTS);

This type of logic fits well into an AOP aspect with an around advice (or interceptor in JavaEE), weaving in between the caller and transaction boundary (typically a service facade, service activator, or web/API controller).

Application-level retries always have a higher chance of success over driver-level because the application logic is applied in each repeat cycle. For example, if you are checking for a negative account balance in the app code, then it may cancel out additional writes based on the value read when the operation is repeated. Neither the JDBC driver nor the database has any visibility to the application logic, which means that a retry attempt can only succeed if all previously observed outcomes are identical to the new ones.

The practical use of driver-level retries is therefore more narrow for common read/write and write/read conflicts, in which case client-side retries are the preferred approach.

Implicit SELECT FOR UPDATE rewrites

The JDBC driver can optionally append a FOR UPDATE clause to qualified SELECT statements.

A SELECT query qualifies for a rewrite when:

It's not part of a read-only connection
There are no aggregate functions (max, min, avg, etc.)
There are no distinct or GROUP BY operators
There are no internal CockroachDB schema references

A SELECT .. FOR UPDATE will lock the rows returned by a selection query such that other transactions trying to access those rows are forced to wait for the transaction that locked the rows to finish. These other transactions are effectively put into a queue based on when they tried to read the value of the locked rows.

Notice that this does not eliminate the chance of serialization conflicts (which can also be due to time uncertainty) but will greatly reduce it. Combined with driver-level retries, this can eliminate the need for app-level retry logic for some workloads.

The following example shows a write skew (G2-item) scenario which is prevented by CockroachDB serializable isolation:

T1	T2
begin;	begin;
select * from test where id in (1,2);
	select * from test where id in (1,2);
update test set value = 11 where id = 1;	(reads 10,20)
	update test set value = 21 where id = 2;
commit;
	commit; --- "ERROR: restart transaction.."

Running the same sequence with FOR UPDATE:

T1	T2
begin;	begin;
select * from test where id in (1,2) FOR UPDATE;
	select * from test where id in (1,2) FOR UPDATE;
	-- blocks on T1
update test set value = 11 where id = 1;
commit;	-- unblocked, reads 11,20
	update test set value = 21 where id = 2;
	commit;

The initial read in T1 will lock the rows and T2 is forced to wait for T1 to finish. When T1 has finished with a commit, the read in T2 is reflecting the write of T1 and not that of T2 at the initial read timestamp. The T2 read is effectively pushed into the future with the desired effect of these operations resulting in a serializable transaction ordering, allowing for both to commit.

Sequence Diagrams

The driver concepts are illustrated with sequence diagrams using https://www.websequencediagrams.com.

Happy Path

This diagram illustrates executing a single update with a happy outcome, equivalent to:

try (Connection connection         = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable") {  try (PreparedStatement ps = connection.prepareStatement("update table set x = ? where id = ?")) {        ps.setObject(1, x);        ps.setObject(2, y);        ps.executeUpdate();  }}

Unhappy Path

This diagram illustrates executing the same single block with an unhappy outcome, equivalent to:

Conclusion

This article discusses the design and implementation of a custom-made CockroachDB JDBC driver, which wraps the PostgreSQL JDBC driver and provides features such as internal retries for serialization conflicts and connection errors. The driver also uses driver-level retries and SELECT FOR UPDATE rewrites to reduce the chance of serialization conflicts in a transaction. Sequence diagrams are provided to illustrate the process.

CockroachDB JDBC Driver: Part I - A Beginner’s Guide

Kai Niemi — Wed, 05 Apr 2023 16:00:39 GMT

Introduction

This article describes the recently released open-source JDBC driver for CockroachDB. It wraps the PostgreSQL JDBC driver (pgjdbc) which in turn communicates in the PostgreSQL native network wire (v3.0) protocol with CockroachDB.

Article series on the JDBC driver:

https://blog.cloudneutral.se/series/cockroachdb-jdbc-driver

Features

This JDBC driver adds certain features on top of pgJDBC that are relevant to CockroachDB.

Internal retries on serialization conflicts.
Internal retries on connection errors.
Rewriting qualified SQL queries to use SELECT FOR UPDATE to reduce serialization conflicts.
CockroachDB-specific database metadata and version info.

All these features are disabled by default, which means the driver is operating in a pass-through mode delegating all JDBC API invocations to the pgJDBC driver.

Enabling internal retries may reduce the need for application-level retry logic and thereby enhance compatibility with 3rd-party products that don't implement any transaction retries.

Enabling SELECT FOR UPDATE rewrites may reduce serialization conflicts from appearing in the first place and thereby reduce retries to a bare minimum or none at all, at the expense of imposing locks on every read operation.

SELECT FOR UPDATE rewrites can be scope to connection level where all qualified SELECT queries are rewritten, or to transaction level where all qualified SELECT within a given transaction are rewritten.

For more information about client-side retry logic, see also:

Getting Started

Below is an example of creating a JDBC connection and executing a simple SELECT query in an implicit transaction (auto-commit):

try (Connection connection         = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable") {  try (Statement statement = connection.createStatement()) {    try (ResultSet rs = statement.executeQuery("select version()")) {      if (rs.next()) {        System.out.println(rs.getString(1));      }    }  }}

Next is an example of executing a SELECT and an UPDATE in an explicit transaction with FOR UPDATE rewrites:

try (Connection connection             = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable")) {    connection.setAutoCommit(false);    try (Statement statement = connection.createStatement()) {        statement.execute("SET implicitSelectForUpdate = true");    }    // Will be rewritten by the driver to include suffix "FOR UPDATE"    try (PreparedStatement ps = connection.prepareStatement("select balance from account where id=?")) {        ps.setLong(1, 100L);        try (ResultSet rs = ps.executeQuery()) {            if (rs.next()) {                BigDecimal balance = rs.getBigDecimal(1); // check                try (PreparedStatement ps2 = connection.prepareStatement("update account set balance = balance + ? where id=?")) {                    ps2.setBigDecimal(1, new BigDecimal("10.50"));                    ps2.setLong(2, 100L);                    ps2.executeUpdate(); // check                }            }        }    }    connection.commit();}

Same as above where all qualified SELECTs are suffixed with FOR UPDATE:

try (Connection connection             = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable&implicitSelectForUpdate=true")) {    connection.setAutoCommit(false);    ...    connection.commit();}

Maven configuration

Add this dependency to your pom.xml file:

<dependency>    <groupId>io.cockroachdb.jdbcgroupId>    <artifactId>cockroachdb-jdbc-driverartifactId>    <version>{version}version>dependency>

Then add the Maven repository to your pom.xml file (alternatively in Maven's settings.xml):

<repository>    <id>githubid>    <name>Maven Packagesname>    <url>https://maven.pkg.github.com/cloudneutral/cockroachdb-jdbcurl>    <snapshots>        <enabled>trueenabled>    snapshots>repository>

You need to authenticate to GitHub Packages by creating a personal access token (classic) that includes the read:packages scope. For more information, see Authenticating to GitHub Packages.

Add your personal access token to the servers section in your settings.xml:

<server>    <id>githubid>    <username>your-github-nameusername>    <password>your-access-tokenpassword>server>

Take note that the server and repository id:s must match (it can be different than github). Now you should be able to build your project with the JDBC driver as a dependency:

mvn clean install

Alternatively, you can just clone the repository and build it locally using mvn install.

Modules

The JDBC driver project is a multi-module Maven project with the following components:

cockroachdb-jdbc-driver

The main library for the CockroachDB JDBC driver. This is all you need and it transitively pulls in pgJDBC and log4j as only dependencies.

cockroachdb-jdbc-it

Integration tests and functional tests that are activated via Maven profiles.

cockroachdb-jdbc-demo

A standalone demo app to showcase the retry mechanism and other features.

Supported CockroachDB and JDK Versions

The driver is CockroachDB version agnostic and supports any version supported by the PostgreSQL JDBC driver v 42.5+ (pgwire protocol v3.0). It's built for Java 8 at the language source and target level but requires Java 17 LTS for building.

URL Properties

The driver uses the jdbc:cockroachdb: JDBC URL prefix and supports all PostgreSQL URL properties on top of that. To configure a data source to use this driver, you typically configure it for PostgreSQL and only change the URL prefix and the driver class name.

The general format for a JDBC URL for connecting to a CockroachDB server:

jdbc:cockroachdb:[//host[:port]/][database][?property1=value1[&property2=value2]...]

See pgjdbc for all supported driver properties and the semantics.

In addition, this driver has the following CockroachDB-specific properties:

retryTransientErrors

(default: false)

The JDBC driver will automatically retry serialization failures (40001 state code) at read, write or commit time. This is done by keeping track of all statements and the results during a transaction, and if the transaction is aborted due to a transient 40001 error, it will roll back and retry the recorded operations on a new connection and compare the results with the initial commit attempt. If the results are different, the driver will be forced to give up the retry attempt to preserve a serializable outcome.

Enable this option if you want to handle aborted transactions internally in the driver, preferably combined with select-for-update locking. Leave this option disabled if you want to handle aborted transactions in your application.

retryConnectionErrors

(default: false)

The CockroachDB JDBC driver will automatically retry transient connection errors with SQL state 08001, 08003, 08004, 08006, 08007, 08S01 or 57P01 at read, write or commit time.

Applicable only when retryTransientErrors is also true.

Disable this option if you want to handle connection errors in your own application or connection pool.

CAUTION! Retrying on non-serializable conflict errors (i.e anything but 40001) may produce duplicate outcomes if the SQL statements are non-idempotent. See the design notes for more details.

retryListenerClassName

(default: io.cockroachdb.jdbc.retry.LoggingRetryListener)

Name of the class that implements io.cockroachdb.jdbc.retry.RetryListener to be used to receive callback events when retries occur. One instance is created for each JDBC connection.

retryStrategyClassName

(default: io.cockroachdb.jdbc.retry.ExponentialBackoffRetryStrategy)

Name of the class that implements io.cockroachdb.jdbc.retry.RetryStrategy to be used when retryTransientErrors property is set to true. If this class also implements io.cockroachdb.jdbc.proxy.RetryListener it will receive callback events when retries happen. One instance of this class is created for each JDBC connection.

The default ExponentialBackoffRetryStrategy will use an exponentially increasing delay with jitter and a multiplier of 2 up to the limit set by retryMaxBackoffTime.

retryMaxAttempts

(default: 15)

A maximum number of retry attempts on transient failures (connection errors/serialization conflicts). If this limit is exceeded, the driver will throw a SQL exception with the same state code signalling yielding further retry attempts.

retryMaxBackoffTime

(default: 30s)

Maximum exponential backoff time in the format of a duration expression (like 12s). The duration applies to the total time for all retry attempts at transaction level.

Applicable only when retryTransientErrors is true.

implicitSelectForUpdate

(default: false)

The driver will automatically append a FOR UPDATE clause to all qualified SELECT statements within connection scope. This parameter can also be set in an explicit transaction as a session variable in which case its scope to the transaction.

The qualifying requirements include:

Not used in a read-only connection
No time travel clause (as of system time)
No aggregate functions
No group by or distinct operators
Not referencing internal table schema

useCockroachMetadata

(default: false)

By default, the driver will use PostgreSQL JDBC driver metadata provided in java.sql.DatabaseMetaData rather than CockroachDB-specific metadata. While the latter is more correct, it causes incompatibilities with libraries that bind to PostgreSQL version details, such as Flyway and other tools.

Logging

This driver uses SLF4J for logging which means it's agnostic to the logging framework used by the application. The JDBC driver module does not include any logging framework dependency transitively.

Additional Examples

Plain Java Example

Class.forName(CockroachDriver.class.getName());try (Connection connection         = DriverManager.getConnection("jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable&implicitSelectForUpdate=true&retryTransientErrors=true") {  try (Statement statement = connection.createStatement()) {    try (ResultSet rs = statement.executeQuery("select version()")) {      if (rs.next()) {        System.out.println(rs.getString(1));      }    }  }}

Spring Boot Example

Configure the datasource in src/main/resources/application.yml:

spring:  datasource:    driver-class-name: io.cockroachdb.jdbc.CockroachDriver    url: "jdbc:cockroachdb://localhost:26257/jdbc_test?sslmode=disable&application_name=MyTestAppe&implicitSelectForUpdate=true&retryTransientErrors=true"    username: root    password:

Optionally, configure the data source programmatically and use the TTDDYY logging proxy:

@Bean@Primarypublic DataSource dataSource() {    return ProxyDataSourceBuilder            .create(hikariDataSource())            .traceMethods()            .logQueryBySlf4j(SLF4JLogLevel.DEBUG, "io.cockroachdb.jdbc")            .asJson()            .multiline()            .build();}@Bean@ConfigurationProperties("spring.datasource.hikari")public HikariDataSource hikariDataSource() {    HikariDataSource ds = dataSourceProperties()            .initializeDataSourceBuilder()            .type(HikariDataSource.class)            .build();    ds.setAutoCommit(false);    ds.addDataSourceProperty(PGProperty.REWRITE_BATCHED_INSERTS.getName(), "true");    ds.addDataSourceProperty(CockroachProperty.IMPLICIT_SELECT_FOR_UPDATE.getName(), "true");    ds.addDataSourceProperty(CockroachProperty.RETRY_TRANSIENT_ERRORS.getName(), "true");    ds.addDataSourceProperty(CockroachProperty.RETRY_MAX_ATTEMPTS.getName(), "5");    ds.addDataSourceProperty(CockroachProperty.RETRY_MAX_BACKOFF_TIME.getName(), "10000");    return ds;}

To configure src/main/resources/logback-spring.xml to capture all SQL statements and JDBC API calls:

<configuration>    <include resource="org/springframework/boot/logging/logback/defaults.xml"/>    <include resource="org/springframework/boot/logging/logback/console-appender.xml" />    <logger name="org.springframework" level="INFO"/>    <logger name="io.cockroachdb.jdbc" level="DEBUG"/>    <root level="INFO">        <appender-ref ref="CONSOLE"/>    root>configuration>

Building

The CockroachDB JDBC driver requires Java 17 (or later) LTS but is cross-compiled to run on any platform for which there is a Java 8 runtime.

Prerequisites

JDK17+ LTS for building (OpenJDK compatible)
Maven 3+ (optional, embedded)

If you want to build with the regular mvn command, you will need Maven v3.x or above.

Install the JDK (Linux):

sudo apt-get -qq install -y openjdk-17-jdk

Install the JDK (macOS):

brew install openjdk@17

Clone the project

git clone git@github.com:cloudneutral/cockroachdb-jdbc.gitcd cockroachdb-jdbc

Build the project

chmod +x mvnw./mvnw clean install

The JDBC driver jar is now found in cockroachdb-jdbc-driver/target.

Run Integration Tests

The integration tests will run through a series of contended workloads to exercise the retry mechanism and other driver features.

First, start a local CockroachDB node or cluster.

Create the database:

cockroach sql --insecure --host=localhost -e "CREATE database jdbc_test"

Then activate the integration test Maven profile:

./mvnw -P it -Dgroups=anomaly-test clean install

Test groups include:

anomaly-test - Runs through a series of RW/WR/WW anomaly tests.
connection-retry-test - Run a test with connection retries enabled.
batch-insert-test - Batch inserts load test.
batch-update-test - Batch updates load test.

See the pom.xml file for changing the database URL and other settings (under t profile).

Summary

This article provides instructions on how to configure and build a new open-source JDBC driver for CockroachDB. It covers parameters such as retryTransientErrors, implicitSelectForUpdate to reduce transient SQL exceptions on contended workloads. It also explains the configuration of the driver, including retry strategies, URL properties, and logging settings, as well as examples of how to configure the driver in plain Java and Spring Boot.

CockroachDB and JDBI: A Practical Example

Kai Niemi — Tue, 04 Apr 2023 16:00:39 GMT

In this article, we're taking a look at JDBI as an alternative to JDBC to access CockroachDB.

JDBI is not an ORM but a simple abstraction on top of JDBC that depend heavily on reflection and lambda expressions to provide a better developer experience. JDBC has been around for decades but is still a very verbose API to use in contrast.

Source Code

The source code for the examples of this article can be found on GitHub.

Introduction

The JDBI example is part of a project called Roach Data which showcases different data access frameworks and ORMs for the Java platform.

The purpose of this project is to showcase how CockroachDB can be used with a mainstream Java stack composed of Spring Boot and some of the available Spring Data modules, or data access frameworks.

It provides examples of the following:

JDBC - using Spring Data JDBC
JPA - using Spring Data JPA with Hibernate as ORM provider
jOOQ - using Spring Boot with jOOQ
MyBatis - using Spring Data MyBatis/JDBC
Reactive - using Spring Data r2dbc with the reactive PSQL driver
(new) JDBI - using JDBI with the PSQL driver

The demos are independent and use a similar schema and test workload.

JDBI Setup

To get started, add the Maven dependency:

<dependency>    <groupId>org.jdbigroupId>    <artifactId>jdbi3-coreartifactId>    <version>3.37.1version>dependency>

Connecting to the database

Connecting is as simple as:

Jdbi jdbi = Jdbi.create("jdbc:postgresql://localhost:26257/roach_data?sslmode=disable", "root", "");

We will however use a HikariCP datasource, wrapped in a datasource logging proxy. This logging proxy will then (through interceptors) log all SQL operations, parameter binding values, batched or not batched and so on.

HikariDataSource hikariDS = new HikariDataSource();hikariDS.setJdbcUrl("jdbc:postgresql://localhost:26257/roach_data?sslmode=disable");hikariDS.setUsername("root");hikariDS.setMaximumPoolSize(32);hikariDS.setMinimumIdle(32);hikariDS.setAutoCommit(true); DataSource ds = ProxyDataSourceBuilder                .create(hikariDS)                .asJson()                .logQueryBySlf4j(SLF4JLogLevel.TRACE, "io.roach.SQL_TRACE")                .multiline()                .build();Jdbi jdbi = Jdbi.create(ds);

Using Handles

JDBI uses handles which represent JDBC connections. By using lambda expressions we don't need to care about closing the resources.

private static List readAccountNames(Handle handle) {    Query query = handle.createQuery("SELECT name FROM account");    return query.mapTo(String.class).collect(Collectors.toList());}List names = jdbi.withHandle(JdbiApplication::readAccountNames);

Querying

Querying for information and mapping the results to single values or value objects is simple. Here's another example of a point lookup with parameter binding:

private static BigDecimal readBalance(Handle handle, String name) {    Query query = handle.createQuery("SELECT balance FROM account WHERE name = ?");    query.bind(0, name);    return query.mapTo(BigDecimal.class).findOne()            .orElseThrow(() -> new BusinessException("Account not found: " + name));}

It's worth mentioning that the parameter binding starts at index 0 and not 1 as in JDBC.

Updating

Updating data is equally straightforward by using handles:

private static void updateBalance(Handle handle, String name, BigDecimal balance) {    Update update = handle.createUpdate("UPDATE account SET balance = ?, updated=clock_timestamp() where name = ?");    update.bind(0, balance);    update.bind(1, name);    if (update.execute() != 1) {        throw new DataAccessException("Rows affected != 1  for " + name);    }}

Transactions

For transactions, we are going to use the SerializableTransactionRunner that will retry on transient SQL exceptions with state code 40001. There's a special inTransaction method in the handler for this purpose.

Jdbi jdbi = Jdbi.create(ds);jdbi.setTransactionHandler(new SerializableTransactionRunner());

In the next example, we are both reading and writing in an explicit transaction. If there is a serialization conflict, the transaction will be rolled back and retried. The SerializableTransactionRunner is fairly simple however and doesn't do any exponential backoffs.

private static BigDecimal transfer(DataSource ds, List legs) {    Jdbi jdbi = Jdbi.create(ds);    jdbi.setTransactionHandler(new SerializableTransactionRunner());    return jdbi.inTransaction(TransactionIsolationLevel.SERIALIZABLE, transactionHandle -> {        BigDecimal total = BigDecimal.ZERO;        BigDecimal checksum = BigDecimal.ZERO;        for (Account leg : legs) {            BigDecimal balance = readBalance(transactionHandle, leg.name);            updateBalance(transactionHandle, leg.name, balance.add(leg.amount));            checksum = checksum.add(leg.amount);            total = total.add(leg.amount.abs());        }        if (checksum.compareTo(BigDecimal.ZERO) != 0) {            throw new BusinessException(                    "Sum of account legs must equal 0 (got " + checksum.toPlainString() + ")"            );        }        return total;    });}

That's it for this very brief tutorial. There's a lot more stuff you can do with JDBI, so check out their website. From a CockroachDB standpoint, however, it's not much different to use JDBI than JDBC directly.

Conclusion

This article looks at JDBI as an alternative to JDBC for accessing CockroachDB. The example is part of Roach Data which is a project that provides examples of JDBC, JPA, jOOQ, MyBatis, Reactive, and JDBI. This article demonstrates how to use JDBI to read and write data in an explicit transaction with SerializableTransactionRunner.

Introduction to Roach Data

Kai Niemi — Mon, 03 Apr 2023 17:00:39 GMT

Roach Data is a collection of small demos using different Java data access frameworks and ORMs with CockroachDB.

The demo projects include:

JDBC - using Spring Data JDBC which is just a simpler wrapper around JDBC
JDBC (plain) - using plain JDBC without Spring Data JDBC
JPA - using Spring Data JPA with Hibernate as ORM provider
JPA Orders - using Spring Data JPA to model a very simple purchase order system
jOOQ - using Spring Boot with jOOQ (which is not officially supported by spring-data)
MyBatis - using Spring Data MyBatis/JDBC
JSON - using Spring Data JPA and JSONB types with inverted indexes
Reactive - using Spring Data r2dbc with the reactive PSQL driver

The demos are independent but use a similar schema and test workload, except for the Orders and JSON demos.

The demos cover the following concepts/features:

Liquibase Schema versioning
Connection Pooling via HikariCP
Executable jar with embedded Jetty container
Pagination via Spring Data JPA
Transaction retries with exponential backoffs

Source Code

The source code for the project can be found on GitHub.

Project Setup

The project is packaged as a single executable JAR file and runs on any platform for which there is a Java 8+ runtime.

Prerequisites

CockroachDB Core (does not require a trial enterprise license)
Linux / macOS
JDK8+ with 1.8 language level (OpenJDK compatible)
Maven 3+ (optional, embedded wrapper available)
- https://maven.apache.org/

Setup CockroachDB

Create a local cluster of at least three nodes:

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --backgroundcockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --backgroundcockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --backgroundcockroach init --insecure --host=localhost:26257

Next, set up a database called roach_data:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database roach_data"

Setup the Demos

Install the JDK

Install the JDK (Ubuntu example):

sudo apt-get install openjdk-8-jdk

Confirm the installation by running:

java -version

Clone the project

git clone git@github.com:cockroachlabs/roach-data.gitcd roach-data

Build the executable jars

chmod +x mvnw./mvnw clean install

Running

Most demos will do the same thing which is to run through a series of concurrent account transfer requests. The requests are intentionally submitted in a way that will cause contention (by overlapping reads and writes on the same keys) in the database and trigger transaction aborts and retries.

By default, the contention level is zero (effectively serial execution) so you won't see any transient errors. To observe serialization conflict errors, pass a number (>1) to the command line representing the thread count. Then you should start seeing transaction conflicts and retries until the demo settles with an end message:

"All client workers finished but the server keeps running. Have a nice day!"

The service remains running after the test is complete and can be accessed via: http://localhost:9090.

JDBC demo

The JDBC demo uses Spring Data JDBC.

In this example, we are passing a custom JDBC URL to connect to a specific IP (default is localhost).

java -jar roach-data-jdbc/target/roach-data-jdbc.jar --spring.datasource.url=jdbc:postgresql://192.168.1.66:26257/roach_data?sslmode=disable

To run with higher contention/retries:

java -jar roach-data-jdbc/target/roach-data-jdbc.jar --concurrency=8

JPA demo

The JPA demo uses Spring Data JPA with Hibernate. It's run in the same way as the previous JDBC demo.

java -jar roach-data-jpa/target/roach-data-jpa.jar

JPA Orders Demo

The JPA Orders demo also uses Spring Data JPA with Hibernate but deviates a bit and models purchase orders.

java -jar roach-data-jpa-orders/target/roach-data-jpa-orders.jar

jOOQ demo

The jOOQ demo does not use Spring Data (unsupported) but the Spring Boot jOOQ starter.

java -jar roach-data-jooq/target/roach-data-jooq.jar

MyBatis demo

The MyBatis demo use the MyBatis data access strategy on top of Spring Data JDBC.

java -jar roach-data-mybatis/target/roach-data-mybatis.jar

Reactive demo

The reactive demo uses the Spring Boot r2dbc starter.

java -jar roach-data-reactive/target/roach-data-reactive.jar

Future Work

The next addition to the framework collection will likely be JDBI, yet another JDBC wrapper as an alternative to more advanced ORMs.

These demos are based on Spring 2.7 and Java 8. Spring Boot / Data 3 is gaining momentum so it may be time to migrate to the latest Spring 3 version and Java 17 LTS.

Conclusion

Roach Data is a collection of small Spring Boot demos that demonstrate how CockroachDB can be used with a mainstream Java framework stack. The demos run through a series of concurrent account transfer requests.

Effective Strategies for Planning and Executing Data Migration Projects

Kai Niemi — Sun, 02 Apr 2023 11:00:39 GMT

Introduction

In this article, we'll look at two common strategies for both data and service migrations: Lift and Shift vs Strangler. Which one works best depends on many factors often guided by business requirements.

Some typical migration requirements and objectives include:

Zero planned downtime
- The migration process must not require any planned downtime windows that would disrupt production traffic.
Option to pause and resume
- Migration processes can run for a long time and cannot interfere with other priority tasks.
- Team focus can shift and migration must be put on hold
Option to go back and abort up to a given point
- Things fail so having the option to revert and cancel the process until passing a confidence level may be needed.
Ability to test and verify up to a given point
- To reach a confidence level that things will go accordingly to plan.

Migrations projects are complicated and involve risk assessments and mitigation steps. What could go wrong, in which steps, what's the business impact and what are the actions to take if it happens?

Risks can include:

Data loss, data corruption or duplication
Performance degradation
Reduced feature velocity during migration

Risk mitigations include:

Well-defined goals and measurable success
Rollback option until the point of no return
Repeatability of migration steps
Testability of migration steps
Observability of the process

All these requirements and goals guide towards either a lift and shift approach or a strangler approach.

Lift and Shift
- Take a snapshot of the old system and load it into the new system
- Planned downtime is needed
Strangler
- Gradually migrate data and business components until the old system is drained and decommissioned
- Very limited downtime if any

Let's break these two approaches down into pros/cons.

Lift and Shift

This approach could be composed of the following rough phases:

Phases

Take a snapshot of the primary DB, bulk load to replacement DB
Write to both primary and replacement DB via the change stream
Switch all reads to the replacement DB but keep writing to both
Switch writes to the replacement DB, turn off the change stream
- At this point, only rolling forward is possible.
Decommission primary

Pros

Reduced migration project timeline (a bit simpler)
Good tooling available (dump/export/import)
One-off, completed over a short period

Cons

Downtime required at snapshot/load time (step 1)
Less control of the pace (has to be completed)
Higher risk of things going wrong

Strangler

The strangler fig metaphor was originally coined by Martin Fowler. It reflects the strangler fig tree which has seeds of branches that descend to the ground and eventually, these branches root in the soil and give birth to new trees while the old one is strangled to death and left to decay.

The parallel in software is to have the new system initially supported by and wrapping the existing system, gradually taking over.

The stranger approach is a typical architecture pattern for larger system rewrites as well as migrations. Sometimes these efforts include migrated stored procedure logic in the database to be refactored and moved to the application tier.

Example scenarios:

One monolithic system to another (refactoring/redesign)
One monolithic system decomposed into multiple microservices (rewrite)
Externalizing functionality to foreign systems
Migrating data and mechanisms, such as stored procedures

This strangler approach to data migration can be outlined in the following phases:

Phases

Route traffic for migrated data to a replacement DB through a proxy/gateway
Initiate a per-customer or market migration through a change feed trigger
Channel back to primary DB via change feed, signalling completion
Eventually decommission primary

Pros

No planned downtime windows needed
Reduced risk by more control of the pace

Cons

More complicated, more components
Takes a longer time to complete

There's much more in the details of course but one important distinction to lift-and-shift is that there are two separate instances of the service running. The new one can also be implemented using a different more modern tech stack while still preserving all external contracts. The gateway mechanism can be external or embedded into both components to reduce network traffic.

Application Migrations

Often both the application codebase and the database need to be migrated simultaneously. For the application tier, there are a few different approaches with different systems and business impacts.

Redesign
- Create a new project that includes all key features and alters external properties.
Rewrite
- Major refactoring and new features at the same time without altering existing external properties.
Refactor
- Improve a software system's internal structure without altering external properties (mainly quality attributes or non-functional requirements).

Summary

Method	Description	System Impact	Business Impact
Redesign	Complete redesign and implementation	Internal and external properties change. The system is not in an operational state.	New features are paused.
Rewrite	Reimplementation of existing functionality	Mainly internal properties change. The system is not in an operational state.	Larger features are paused.
Refactor	Larger and smaller incremental improvements	Mainly internal properties change. System in an operational state.	Allows for new features.

Strangler Principles

How do we go about strangling a system? The first step is to identify an isolated part of the system. The next one is to implement that in a new service while improving/evolving it. Its still not used or available for traffic which allows off-the-side incremental development of this section without interfering with the primary system. The last step is to redirect the calls to the new service while leaving the old one in place since it's not worth the effort of decommissioning. This works quite well if the functional areas are well-isolated.

Functionality is however often entangled, so when moving one piece of functionality it may bring these dependencies with it. To avoid that, the moved functionality can make use of downstream functionality in the old system through an API. That way, the yet-to-be-migrated functionality is partly used while maintaining a controlled and incremental approach to moving things over.

Strangling Stored Procedures

Applying the strangler approach to stored procedures follows the same architectural pattern. The business logic is rewritten in a higher-level language in the application tier of the new service.

Diagram showing the combination of both application refactoring/rewrites:

Conclusion

In this article, we looked at two classical approaches for data and service migration projects. One is lift-and-shift, where things are more or less copied over with some planned downtime. The other is a strangler approach where systems run for a longer period in parallel while the old one is gradually strangled by moving data and functions to the new platform. Both approaches have pros/cons which need to be put in a business context to make sense.

Leveraging Spring State Machine with CockroachDB

Kai Niemi — Fri, 31 Mar 2023 14:49:16 GMT

This article focuses on a practical use case example for Spring State Machine together with CockroachDB for persistent state storage.

Introduction

A state machine, also called a finite-state machine (FSM) or finite-state automaton, is a mathematical model of computation used to build an abstract machine with its roots way back in the 1940s. This abstract machine can be in exactly one of a finite number of states at any given time. Each state represents the status of the system that can move to another state through events or signals, called transitions. You interact with the state machine by sending events, listening to actions, or requesting the current state. You can progress through a workflow by sending events, making it a good fit for reactive, event-driven architectures.

This model is elegant and powerful since the behaviour of a system becomes more precise, consistent and readable. It helps to model certain types of complex business logic around the notion of system states and actions.

Use Cases

Typical use cases for state machines are event-driven applications where behaviour changes based on known business events. Such as order fulfilment, payment workflows, logistics and loyalty systems.

Gaming

It's also commonly used in game engines to tailor certain types of player vs AI behaviour. Modelling behaviour by using states with transition pre and post-conditions avoid many of the if, then, else and switch flow control structures you otherwise need, which can quickly form an unreadable ball of mud.

(Below) A visual example of a danger assessment FSM prioritizing an event queue:

(Below) Another example of a FSM for closer AI threat assessment:

BPMs

Business process management (BPM) engines are also a type of state machine engine with a fairly high level of sophistication and tooling. It can be a good fit for very complex and long-running business processes but comes with a high cost of complexity and the need for specialists.

Spring Statemachine is a lightweight alternative with a much smaller footprint but it still has all the fundamentals you would need from a state machine engine.

Terminology

States - The specific states of the state machine that are finite and predetermined.
Events - Something that happens in the system that can cause a state change.
Actions - Side-effects in reaction to events fired, which can be calling a method, invoking a foreign API, writing to a database and so on.
Transitions - Type of action which changes state.
Guards - Pre-conditions as boolean predicates to control transitions.
Extended State - Application state that is separate from the state machine, like variables or computed values.

Creating a Payments State Machine

In this practical example, we are going to model a typical "two-phase" credit card payment workflow and use CockroachDB to store the current state of the payment as it progresses through this flow.

Spring Statemachine (SSM) is modelled around describing states and events using Java enumerations, so let's begin there. In our system, a credit card payment will have the following states:

public enum PaymentState {    CREATED("Initial state"),    AUTHORIZED("Charge approved by processor"),    AUTH_ERROR("Charge declined by processor"),    ABORTED("Payment aborted before auth"),    CANCELLED("Payment cancelled before capture"),    CAPTURED("Payment verified and settled by processor"),    CAPTURE_ERROR("Authorized charge declined by processor"),    REVERSED("Captured payment refunded"),    REVERSE_ERROR("Captured reversal failed");    String note;    PaymentState(String note) {        this.note = note;    }    public String getNote() {        return note;    }}

Our payment SM has the following events:

public enum PaymentEvent {    ABORT("Abort payment"),    AUTHORIZE("Contact processor for charge authorization"),    AUTH_APPROVED("Processor approved charge"),    AUTH_DECLINED("Processor rejected charge"),    CANCEL("Approved charge cancellation"),    CAPTURE("Authorized amount settlement"),    CAPTURE_SUCCESS("Capture approved"),    CAPTURE_FAILED("Capture failed"),    REVERSE("Captured amount reversal"),    REVERSE_SUCCESS("Refund successful"),    REVERSE_FAILED("Refund failure");    String note;    PaymentEvent(String note) {        this.note = note;    }    public String getNote() {        return note;    }}

In summary, to visualize all of this in a state diagram:

This workflow represents a typical two-phase payment which is a common payment type used for card payments, mobile payments and invoice payments. It's performed in two steps (hence the name) - an authorization that reserves the payer's funds and then a capture of those funds which is a form of settlement.

As the state diagram illustrates, you can abort a new payment before it's authorized and cancel it before it's captured. Capture is a term used to finally charge the payer's card or for the payment to be billed by invoice. After the funds are captured, a reversal can be done to return funds to the payer.

Other payment types omit the capture stage and directly settle funds at the point of authorization, called a one-phase payment.

Maven Dependencies

Add the maven dependency:

<dependency>    <groupId>org.springframework.statemachinegroupId>    <artifactId>spring-statemachine-starterartifactId>    <version>3.2.0.RELEASEversion>dependency>

Since we are going to use JPA, also add that:

<dependency>    <groupId>org.springframework.bootgroupId>    <artifactId>spring-boot-starter-data-jpaartifactId>    <version>3.0.2version>dependency>

State Machine Configuration

Using the specified states and events, let's go ahead and define the state machine:

@Configuration@EnableStateMachineFactorypublic class StateMachineConfiguration extends StateMachineConfigurerAdapter<PaymentState, PaymentEvent> {    @Override    public void configure(StateMachineStateConfigurer states) throws Exception {        states.withStates()                .initial(PaymentState.CREATED)                .states(EnumSet.allOf(PaymentState.class))                .end(PaymentState.REVERSED)                .end(PaymentState.REVERSE_ERROR)                .end(PaymentState.AUTH_ERROR)                .end(PaymentState.CAPTURE_ERROR)                .end(PaymentState.ABORTED)                .end(PaymentState.CANCELLED);    }    @Override    public void configure(StateMachineTransitionConfigurer transitions) throws Exception {        transitions                // Branches from state CREATED                .withExternal().source(PaymentState.CREATED).target(PaymentState.CREATED)                .event(PaymentEvent.AUTHORIZE)                .action(Actions.errorCallingAction(authAction, errorAction)).guard(paymentIdGuard)                .and()                .withExternal().source(PaymentState.CREATED).target(PaymentState.AUTHORIZED)                .event(PaymentEvent.AUTH_APPROVED)                .and()                .withExternal().source(PaymentState.CREATED).target(PaymentState.AUTH_ERROR) // end state                .event(PaymentEvent.AUTH_DECLINED)                .and()                .withExternal().source(PaymentState.CREATED).target(PaymentState.ABORTED) // end state                .event(PaymentEvent.ABORT)                .action(Actions.errorCallingAction(abortAction, errorAction))                // Branches from state AUTHORIZED                .and()                .withExternal().source(PaymentState.AUTHORIZED).target(PaymentState.AUTHORIZED)                .event(PaymentEvent.CAPTURE)                .action(Actions.errorCallingAction(captureAction, errorAction))                .and()                .withExternal().source(PaymentState.AUTHORIZED).target(PaymentState.CAPTURED)                .event(PaymentEvent.CAPTURE_SUCCESS)                .and()                .withExternal().source(PaymentState.AUTHORIZED).target(PaymentState.CAPTURE_ERROR) // end state                .event(PaymentEvent.CAPTURE_FAILED)                .and()                .withExternal().source(PaymentState.AUTHORIZED).target(PaymentState.CANCELLED) // end state                .event(PaymentEvent.CANCEL)                .action(Actions.errorCallingAction(cancelAction, errorAction))                // Branches from state CAPTURED                .and()                .withExternal().source(PaymentState.CAPTURED).target(PaymentState.CAPTURED)                .event(PaymentEvent.REVERSE)                .action(Actions.errorCallingAction(reverseAction, errorAction))                .and()                .withExternal().source(PaymentState.CAPTURED).target(PaymentState.REVERSED) // end state                .event(PaymentEvent.REVERSE_SUCCESS)                .and()                .withExternal().source(PaymentState.CAPTURED).target(PaymentState.REVERSE_ERROR)                .event(PaymentEvent.REVERSE_FAILED);    }}

Now we can wire in the state machine factory for this configuration:

@Autowiredprivate StateMachineFactory stateMachineFactory;

In the next section, we ask the factor for an instance of the state machine and start it.

StateMachine sm = stateMachineFactory.getStateMachine(UUID.randomUUID());sm.startReactively().subscribe();logger.info("State initially: {}", sm.getState().toString());sm.sendEvent(Mono.just(MessageBuilder.withPayload(PaymentEvent.AUTHORIZE).build())).subscribe();logger.info("State after authorize: {}", sm.getState().toString());sm.sendEvent(Mono.just(MessageBuilder.withPayload(PaymentEvent.AUTH_APPROVED).build())).subscribe();logger.info("State after auth_approved: {}", sm.getState().toString());sm.sendEvent(Mono.just(MessageBuilder.withPayload(PaymentEvent.AUTH_DECLINED).build())).subscribe();logger.info("State after auth_declined: {}", sm.getState().toString());

Actions

Actions are executed around state transactions where you can perform whatever business logic that is needed. In this payment flow, let's focus on one of the actions - the authorize action which in the real world would invoke the bank to authorize (or decline) a charge amount.

The authorize action is defined at the very beginning:

transitions.withExternal().source(PaymentState.CREATED).target(PaymentState.CREATED)                .event(PaymentEvent.AUTHORIZE)                .action(Actions.errorCallingAction(authAction, errorAction)).guard(paymentIdGuard)                .and()...

From state CREATED to state CREATED (self-invoke), triggered by the AUTHORIZE event, perform the authorization action and if it fails, call the errorAction. The pre-condition for this transition is satisfied by the paymentIdGuard that just checks for the header.

@Componentpublic class AuthorizeAction extends implements Action<PaymentState, PaymentEvent> {    @Override    public void execute(StateContext context) {        Object paymentId = context.getMessageHeader(PaymentServiceImpl.PAYMENT_ID_HEADER);        int randomErrorProbability = ..        if (Randomizer.withProbability(() -> true, () -> false, randomErrorProbability)) {            getLogger().info("Authorize approved! {}", paymentId);            context.getStateMachine().sendEvent(MessageBuilder.withPayload(PaymentEvent.AUTH_APPROVED)                    .setHeader(PaymentServiceImpl.PAYMENT_ID_HEADER,                           context.getMessageHeader(PaymentServiceImpl.PAYMENT_ID_HEADER))                    .build());        } else {            getLogger().info("Authorize declined! {}", paymentId);           context.getStateMachine().sendEvent(MessageBuilder.withPayload(PaymentEvent.AUTH_DECLINED)                    .setHeader(PaymentServiceImpl.PAYMENT_ID_HEADER,                            context.getMessageHeader(PaymentServiceImpl.PAYMENT_ID_HEADER))                    .build());        }    }}

In the authorize action, we're not calling any external API but using a probability factor to either approve or decline the authorization. This is done by again sending an event to the state machine. We also provide the extended state in form of a header value holding the unique payment ID. This payment ID in turn is validated through a guard which wouldn't allow this transition unless it was set.

Entity Model

The entity model is simple, it's just one single table and entity called Payment which holds a few attributes like the state and charge amount.

@Entity@Table(name = "payment")@DynamicInsert@DynamicUpdatepublic class Payment extends AbstractEntity<Long> {    @Id    @Column(updatable = false, nullable = false)    @GeneratedValue(strategy = GenerationType.IDENTITY)    private Long id;    @Enumerated(EnumType.STRING)    private PaymentState state;    @Column    private String merchant;    @Column    private BigDecimal amount;    @Override    public Long getId() {        return id;    }..}

Repository

To persist and manage payments in the database, we use a simple Spring Data repository:

import org.springframework.data.jpa.repository.JpaRepository;import io.roach.demo.statemachine.domain.Payment;public interface PaymentRepository extends JpaRepository<Payment, Long> {}

The database topic is covered in more detail below.

Service

Next, there's a payment service which acts as a facade against the state machine.

public interface PaymentService {    Payment createPayment(Payment payment);    Optional findPayment(Long paymentId);    StateMachine authorizePayment(Long paymentId);    StateMachine capturePayment(Long paymentId);    StateMachine refundPayment(Long paymentId);    StateMachine cancelPayment(Long paymentId);    StateMachine abortPayment(Long paymentId);}

The service implementation is pretty standard where the main thing happening is loading the payment entity by id and sending events to the state machine. Notice also the @Transactional annotation with REQUIRES_NEW propagation, signalling it's a boundary. It uses explicit transactions to load and update the payment entity.

    @Transactional(propagation = Propagation.REQUIRES_NEW)    @Override    public StateMachine authorizePayment(Long paymentId) {        StateMachine sm = load(paymentId);        sendEvent(paymentId, sm, PaymentEvent.AUTHORIZE);        return sm;    }..

Let's look closer at the load method that is a bit special. It has to do with the fact we persist each state transition in the database using an interceptor. The interceptor (listener) needs to be weaved into the state machine when loading it in each business method.

First, the payment is loaded by reference which means only the ID is read and other attributes are lazy-initialized, hence the payment object reference here is a lazy proxy.

Then we ask the state machine factory to instantiate a new state machine and we give it the payment ID. To add the interceptor, we need to first stop the state machine, then reset it and finally restart it.

private StateMachine load(Long paymentId) {    Payment payment = paymentRepository.getReferenceById(paymentId);    StateMachine sm = stateMachineFactory.getStateMachine(            Long.toString(payment.getId()));    sm.stopReactively().block();    sm.getStateMachineAccessor()            .doWithAllRegions(sma -> {                sma.addStateMachineInterceptor(paymentStateChangeInterceptor);                sma.resetStateMachineReactively(                                new DefaultStateMachineContext<>(payment.getState(), null, null, null))                        .block();            });    sm.startReactively().block();    return sm;}

State Change Interceptor

The state change interceptor is used to update the state of the payment before each state change. It overrides the preStateChange method and uses the payment repository to load and persist the change, all while expecting an active transaction context.

If this method throws an exception of any kind, it will be silently swallowed (but logged) by the state machine and the state transition is denied. This is sort of a pickle since the outer transaction boundary method (the payment service) is not aware of this and will proceed with the commit.

@Componentpublic class PaymentStateChangeInterceptor extends StateMachineInterceptorAdapter<PaymentState, PaymentEvent> {    @Autowired    private PaymentRepository paymentRepository;    @Override    public void preStateChange(State state, Message message,                               Transition transition,                               StateMachine stateMachine,                               StateMachine rootStateMachine) {        super.preStateChange(state, message, transition, stateMachine, rootStateMachine);        Optional.ofNullable(message).ifPresent(msg -> {            Long paymentId = Long.class.cast(                    msg.getHeaders().getOrDefault(PaymentServiceImpl.PAYMENT_ID_HEADER, -1L));            if (paymentId != null) {                Assert.isTrue(TransactionSynchronizationManager.isActualTransactionActive(), "No transaction context!");                Payment payment = paymentRepository.getReferenceById(paymentId);                payment.setState(state.getId());                paymentRepository.save(payment);            }        });    }}

Handling Retries

If you are familiar with CockroachDB or any other RDBS where you are using serializable isolation, you are aware of the importance of adopting retry logic for transient SQL errors.

Assuming the state machine is initialized concurrently by different clients or even threads, there's a chance of state transition contention conflicts. Transaction T1 may attempt to transition from CREATED to AUTHORIZED while T2 transitions from CREATED to ABORTED. This must not be allowed but there's no inter-process or cross-thread coordination between state machines. Instead, we depend on the database for that and since CockroachDB only runs in serializable, we are guaranteed that there will be no anomalies. To prevent such anomalies, the database may raise transient SQL state 40001 errors that we should intercept and handle.

State machines are a bit tricky in the sense there's not always an explicit point in the code that you identify as the boundary. Retry logic must always surround the transaction boundary points. In our case, we do have explicit boundaries marked with @Transactional only we can't just throw in an around advice that replays the method because these errors are swallowed downstream in the state machine interceptor.

There are a few different ways to fix this, a few being:

Capture transient exceptions in the interceptor and add these as extended state variables to be picked up by the service transaction boundary which rolls back the transaction. Then add a retry AOP around advice for the service methods.
Redesign the transaction semantics and model in infrastructure failures to the state machine with an option for retry and recovery. This isn't trivial and makes the transaction boundaries even more blurred out.
Use the CockroachDB JDBC driver with internal retries and select-for-update query rewrites. This means the SFU rewrites will likely reduce or eliminate the transient exceptions but there's also a retry mechanism in the driver itself to take care of stragglers.

Turns out that for this payment state machine, the JDBC alternative works just as well as the first option of using client-side retries.

JDBC Driver Retries

This is enabled by using the CockroachDB JDBC driver with these specific parameters enabled:

implicitSelectForUpdate = true
retryTransientErrors = true

Example application.yml:

spring:  datasource:    url: jdbc:cockroachdb://kai-odin-hnb.aws-eu-north-1.cockroachlabs.cloud:26257/spring_sm_demo?sslmode=require&implicitSelectForUpdate=true&retryTransientErrors=true    username: guest    password: UqhyOq3l_M8Yn_Uq0S4VvA    driver-class-name: io.cockroachdb.jdbc.CockroachDriver

The Maven dependency required:

    io.cockroachdb.jdbc    cockroachdb-jdbc-driver    1.0.0

Client Retries

Using client-side retries is quite straightforward using AOP and an around-advice, intercepting all method joinpoints that are annotated with @Transactional.

You do however need to tag the state machine extended state with the transient errors, which are thrown somewhere in the scope of the transactional business service methods.

// transientException - derived from org.springframework.dao.TransientDataAccessException with underlying SQLException with state 40001.stateMachine.getExtendedState().getVariables().put("error", transientException);

See full implementation here.

@Configuration@EnableAspectJAutoProxypublic class AopConfig {    @Bean    @Profile("retry-client")    public TransactionRetryAspect transactionRetryAspect() {        return new TransactionRetryAspect();    }}

Tests

The following test will summarize the typical payment flow:

    @RepeatedTest(10)    @Order(1)    public void whenAuthorizePayment_expectAuthorizedOrRejectedState() {        Payment payment = paymentService.createPayment(this.payment);        Assertions.assertEquals(PaymentState.CREATED, payment.getState());        StateMachine sm = paymentService.authorizePayment(payment.getId());        Assertions.assertTrue(EnumSet.of(PaymentState.AUTH_ERROR, PaymentState.AUTHORIZED)                .contains(sm.getState().getId()));        Payment authedPayment = paymentService.findPayment(payment.getId())                .orElseThrow(() -> new ObjectRetrievalFailureException(Payment.class, payment.getId()));        Assertions.assertTrue(EnumSet.of(PaymentState.AUTH_ERROR, PaymentState.AUTHORIZED)                .contains(authedPayment.getState()));    }

Conclusion

This article explains the concept of a state machine, which is a mathematical model of computation used to build an abstract machine, and provides a practical example of payments using Spring Statemachine. It explains the terminology used in state machines, such as states, events, actions, transitions, guards and extended states. It also provides code snippets for setting up a state machine for a payment flow, with actions to be performed around state transitions.

Exploring the Secondary Regions Feature in CockroachDB

Kai Niemi — Fri, 31 Mar 2023 14:32:34 GMT

Introduction

In a previous article, we looked at using super regions for data domiciling in CockroachDB. To recap, data domiciling is the art of controlling the placement of subsets of data in specific regions or locations. This is often required by privacy regulations like GDPR and The Wire Act in the US.

In this article, we'll look at a new concept in CockroachDB (since v22.2) called Secondary Regions. Secondary regions allow you to define a database region that will be used for failover in the event your primary region goes down. Previously, when the primary region failed, the leaseholders would be transferred to another region at random. Secondary regions now add control over that "fail-over".

In the context of super regions, if the primary region is part of a super region, the secondary region must also be a region within the primary super region.

For a primer in data domiciling and database survival in CockroachDB, see:

Let's go through a demo to see how secondary regions work with super regions.

Demo

Similar to the super region demo, we will deploy a global cluster of 18 nodes (on different ports on a single machine) stretching the EU to the west and east coast of the US.

When properly configured, the DB Console should look like this:

Cluster Setup

The following script will start 18 nodes on a local machine.

#!/bin/bashportbase=26258httpportbase=8081host=localhostLOCALITY_ZONE=(  'region=eu-north-1,zone=eu-north-1a'  'region=eu-north-1,zone=eu-north-1b'  'region=eu-north-1,zone=eu-north-1c'  'region=eu-west-1,zone=eu-west-1a'  'region=eu-west-1,zone=eu-west-1b'  'region=eu-west-1,zone=eu-west-1c'  'region=eu-west-2,zone=eu-west-2a'  'region=eu-west-2,zone=eu-west-2b'  'region=eu-west-2,zone=eu-west-2c'  'region=us-east-1,zone=us-east-1a'  'region=us-east-1,zone=us-east-1b'  'region=us-east-1,zone=us-east-1c'  'region=us-east-2,zone=us-east-2a'  'region=us-east-2,zone=us-east-2b'  'region=us-east-2,zone=us-east-2c'  'region=us-west-1,zone=us-west-1a'  'region=us-west-1,zone=us-west-1b'  'region=us-west-1,zone=us-west-1c')node=0;for zone in "${LOCALITY_ZONE[@]}"do    let node=($node+1)    let offset=${node}-1    let port=${portbase}+$offset    let httpport=${httpportbase}+$offset    let port1=${portbase}    let port2=${portbase}+1    let port3=${portbase}+2    join=${host}:${port1},${host}:${port2},${host}:${port3}    mempool="128MiB"    cockroach start \    --locality=${zone} \    --port=${port} \    --http-port=${httpport} \    --advertise-addr=${host}:${port} \    --join=${join} \    --insecure \    --store=datafiles/n${node} \    --cache=${mempool} \    --max-sql-memory=${mempool} \    --backgrounddonecockroach init --insecure --host=${host}:${portbase}

Check that the cluster is running by browsing to http://localhost:8081/.

Next, we'll add the regions and configure the database for region-level survival.

cockroach sql --insecure --host=localhost:26258

Note: For the next steps, you will need an enterprise trial license key.

For the node map to show all regions, add these localities:

DELETE FROM system.locations WHERE 1 = 1;INSERT into system.locationsVALUES        ('region', 'eu-north-1', 59.0, 18.0),       ('region', 'us-east-1', 37.478397, -76.453077),       ('region', 'us-east-2', 40.417287, -76.453077),       ('region', 'us-east-3', 25.457287, -80.453077),       ('region', 'us-west-1', 38.837522, -120.895824),       ('region', 'us-west-2', 43.804133, -120.554201),       ('region', 'ca-central-1', 56.130366, -106.346771),       ('region', 'eu-central-1', 50.110922, 8.682127),       ('region', 'eu-west-1', 53.142367, -7.692054),       ('region', 'eu-west-2', 51.507351, -0.127758),       ('region', 'eu-west-3', 48.856614, 2.352222),       ('region', 'ap-northeast-1', 35.689487, 139.691706),       ('region', 'ap-northeast-2', 37.566535, 126.977969),       ('region', 'ap-northeast-3', 34.693738, 135.502165),       ('region', 'ap-southeast-1', 1.352083, 103.819836),       ('region', 'ap-southeast-2', -33.86882, 151.209296),       ('region', 'ap-south-1', 19.075984, 72.877656),       ('region', 'sa-east-1', -23.55052, -46.633309),       ('region', 'eastasia', 22.267, 114.188),       ('region', 'southeastasia', 1.283, 103.833),       ('region', 'centralus', 41.5908, -93.6208),       ('region', 'eastus', 37.3719, -79.8164),       ('region', 'eastus2', 36.6681, -78.3889),       ('region', 'westus', 37.783, -122.417),       ('region', 'northcentralus', 41.8819, -87.6278),       ('region', 'southcentralus', 29.4167, -98.5),       ('region', 'northeurope', 53.3478, -6.2597),       ('region', 'westeurope', 52.3667, 4.9),       ('region', 'japanwest', 34.6939, 135.5022),       ('region', 'japaneast', 35.68, 139.77),       ('region', 'brazilsouth', -23.55, -46.633),       ('region', 'australiaeast', -33.86, 151.2094),       ('region', 'australiasoutheast', -37.8136, 144.9631),       ('region', 'southindia', 12.9822, 80.1636),       ('region', 'centralindia', 18.5822, 73.9197),       ('region', 'westindia', 19.088, 72.868),       ('region', 'canadacentral', 43.653, -79.383),       ('region', 'canadaeast', 46.817, -71.217),       ('region', 'uksouth', 50.941, -0.799),       ('region', 'ukwest', 53.427, -3.084),       ('region', 'westcentralus', 40.890, -110.234),       ('region', 'westus2', 47.233, -119.852),       ('region', 'koreacentral', 37.5665, 126.9780),       ('region', 'koreasouth', 35.1796, 129.0756),       ('region', 'francecentral', 46.3772, 2.3730),       ('region', 'francesouth', 43.8345, 2.1972),       ('region', 'us-east1', 33.836082, -81.163727),       ('region', 'us-east4', 37.478397, -76.453077),       ('region', 'us-central1', 42.032974, -93.581543),       ('region', 'us-west1', 43.804133, -120.554201),       ('region', 'northamerica-northeast1', 56.130366, -106.346771),       ('region', 'europe-west1', 50.44816, 3.81886),       ('region', 'europe-west2', 51.507351, -0.127758),       ('region', 'europe-west3', 50.110922, 8.682127),       ('region', 'europe-west4', 53.4386, 6.8355),       ('region', 'europe-west6', 47.3769, 8.5417),       ('region', 'asia-east1', 24.0717, 120.5624),       ('region', 'asia-east2', 24.0717, 120.5624),       ('region', 'asia-northeast1', 35.689487, 139.691706),       ('region', 'asia-southeast1', 1.352083, 103.819836),       ('region', 'australia-southeast1', -33.86882, 151.209296),       ('region', 'asia-south1', 19.075984, 72.877656),       ('region', 'southamerica-east1', -23.55052, -46.633309),       ('region', 'gcp-europe-west4', 53.4386, 6.8355),       ('region', 'gcp-us-west2', 43.804133, -120.554201),       ('region', 'gcp-australia-southeast1', -33.86882, 151.209296),       ('region', 'asia-southeast-1', 1.290270, 103.851959),       ('region', 'asia-southeast-2', -6.173292, 106.841036),       ('region', 'asia-southeast-3', 3.140853, 101.693207),       ('region', 'au-nsw', -31.86882, 152.209296),       ('region', 'au-vic', -37.5, 144.5),       ('region', 'sa', -23.55052, -46.633309);

Now, let's configure the regions and enable region survival:

create database test;use test;-- Add the 6 regionsalter database test primary region "eu-north-1";alter database test add region "eu-west-1";alter database test add region "eu-west-2";alter database test add region "us-east-2";alter database test add region "us-east-1";alter database test add region "us-west-1";show regions;-- Add the super regionsSET enable_super_regions = 'on';ALTER DATABASE test ADD SUPER REGION eu VALUES "eu-north-1","eu-west-1","eu-west-2";ALTER DATABASE test ADD SUPER REGION us VALUES "us-west-1","us-east-2","us-east-1";-- Enable region survivalALTER DATABASE test SURVIVE REGION FAILURE;

Next, let's verify that we have two super regions:

SHOW SUPER REGIONS FROM DATABASE test;

Add Test Data

The main schema parts are now done so let's add two tables and some sample data. The first table postal_codes is a global table and the second table users is using regional-by-row locality.

-- Add a GLOBAL tablecreate table postal_codes(    id   int primary key,    code string);ALTER TABLE postal_codes SET LOCALITY GLOBAL;-- Insert some datainsert into postal_codes (id, code)select unique_rowid() :: int,        md5(random()::text)from generate_series(1, 100);-- Add a regional-by-row tableCREATE TABLE users(    id          INT   NOT NULL,    name        STRING NULL,    postal_code STRING NULL,    PRIMARY KEY (id ASC));-- Make it RBRALTER TABLE users SET LOCALITY REGIONAL BY ROW;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-north-1'from generate_series(1, 10) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-west-1'from generate_series(11, 20) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-west-2'from generate_series(21, 30) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-east-1'from generate_series(31, 40) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-east-2'from generate_series(41, 50) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-west-1'from generate_series(51, 60) no;select *,crdb_region from users;select crdb_region,count(1) from users group by crdb_region;

Verify Zone Configuration

Let's look at the zone configurations before applying secondary regions. This will tell what the voter constraints and lease preferences are for each table.

select raw_config_sql from [show zone configuration for table postal_codes];

Output:

ALTER TABLE postal_codes CONFIGURE ZONE USING      range_min_bytes = 134217728,      range_max_bytes = 536870912,      gc.ttlseconds = 90000,      global_reads = true,      num_replicas = 7,      num_voters = 5,      constraints = '{+region=eu-north-1: 1, +region=eu-west-1: 1, +region=eu-west-2: 1, +region=us-east-1: 1, +region=us-east-2: 1, +region=us-west-1: 1}',      voter_constraints = '{+region=eu-north-1: 2}',      lease_preferences = '[[+region=eu-north-1]]'(1 row)-- For users table:-- select raw_config_sql from [show zone configuration for table users];ALTER DATABASE test CONFIGURE ZONE USING      range_min_bytes = 134217728,      range_max_bytes = 536870912,      gc.ttlseconds = 90000,      num_replicas = 7,      num_voters = 5,      constraints = '{+region=eu-north-1: 1, +region=eu-west-1: 1, +region=eu-west-2: 1, +region=us-east-1: 1, +region=us-east-2: 1, +region=us-west-1: 1}',      voter_constraints = '{+region=eu-north-1: 2}',      lease_preferences = '[[+region=eu-north-1]]'(1 row)

Add Secondary Region

Now we are ready to add the secondary region. We verified that eu-north-1 is the primary region and part of super region eu, so we add eu-west-1 as the secondary region:

ALTER DATABASE test SET SECONDARY REGION "eu-west-1";show regions;

Let's see what changed in the zone configs:

select raw_config_sql from [show zone configuration for table postal_codes];select raw_config_sql from [show zone configuration for table users];

Output:

ALTER TABLE postal_codes CONFIGURE ZONE USING      range_min_bytes = 134217728,      range_max_bytes = 536870912,      gc.ttlseconds = 90000,      global_reads = true,      num_replicas = 8,      num_voters = 5,      constraints = '{+region=eu-north-1: 1, +region=eu-west-1: 1, +region=eu-west-2: 1, +region=us-east-1: 1, +region=us-east-2: 1, +region=us-west-1: 1}',      voter_constraints = '{+region=eu-north-1: 2, +region=eu-west-1: 2}',      lease_preferences = '[[+region=eu-north-1], [+region=eu-west-1]]'(1 row)-- AndALTER TABLE users CONFIGURE ZONE USING      range_min_bytes = 134217728,      range_max_bytes = 536870912,      gc.ttlseconds = 90000,      num_replicas = 8,      num_voters = 5,      constraints = '{+region=eu-north-1: 1, +region=eu-west-1: 1, +region=eu-west-2: 1, +region=us-east-1: 1, +region=us-east-2: 1, +region=us-west-1: 1}',      voter_constraints = '{+region=eu-north-1: 2, +region=eu-west-1: 2}',      lease_preferences = '[[+region=eu-north-1], [+region=eu-west-1]]'(1 row)

We can tell that adding a secondary region resulted in the lease preferences now having two voting replicas in the secondary region.

Simulate Region Failure

Let's simulate a primary region failure by simply killing the region nodes ungracefully.

ps -ef | grep cockroach | grep region=eu-north-1-- note PIDs, then run:kill -TERM id1kill -TERM id2kill -TERM id2

Check the range lease holder locality for a potential user row with id #1:

SHOW RANGE FROM TABLE users FOR ROW ('eu-north-1',1);

Notice region=eu-west-1,zone=eu-west-1b in the output:

  start_key |      end_key      | range_id | lease_holder |      lease_holder_locality       |   replicas    |                                                                                  replica_localities------------+-------------------+----------+--------------+----------------------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  /"\x80"   | /"\x80"/PrefixEnd |       55 |            3 | region=eu-west-1,zone=eu-west-1b | {1,3,8,12,17} | {"region=eu-north-1,zone=eu-north-1a","region=eu-west-1,zone=eu-west-1b","region=eu-west-1,zone=eu-west-1c","region=eu-north-1,zone=eu-north-1c","region=eu-west-2,zone=eu-west-2b"}(1 row)

Lastly, if you restart the 3 nodes again (just re-run the original script) you will see the lease-holder reverting to the primary region (region=eu-north-1,zone=eu-north-1b):

SHOW RANGE FROM TABLE users FOR ROW ('eu-north-1',1);  start_key |      end_key      | range_id | lease_holder |       lease_holder_locality        |   replicas   |                                                                                  replica_localities------------+-------------------+----------+--------------+------------------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  /"\x80"   | /"\x80"/PrefixEnd |       55 |            7 | region=eu-north-1,zone=eu-north-1b | {1,3,7,8,17} | {"region=eu-north-1,zone=eu-north-1a","region=eu-west-1,zone=eu-west-1b","region=eu-north-1,zone=eu-north-1b","region=eu-west-1,zone=eu-west-1c","region=eu-west-2,zone=eu-west-2b"}(1 row)

Conclusion

This article explains how to configure CockroachDB's Secondary Regions feature, which allows for control over failover in the event of a primary region going down. It provides a demo of how to set up a global cluster, how to configure the database for region-level survival, and how to add localities to the node map. It also provides instructions on how to add two tables and sample data, and how to verify the zone configuration before applying secondary regions.

Generate Workloads for CockroachDB

Kai Niemi — Fri, 31 Mar 2023 14:25:06 GMT

Introduction

CockroachDB Workload is an easy-to-use command line tool for generating load against CockroachDB.

It includes the following workloads:

ledger - Financial ledger use case using the double-entry principle
order - Purchase orders as part of an e-commerce use case
outbox - Simulates the transactional outbox and inbox patterns

This tool provides an interactive shell for running workloads and monitoring progress and performance. Each workload has an init command to set up the test fixture and one or more run commands to execute the actual workload for a given period of time.

Source Code

The source code for the workload tool can be found on GitHub.

Project Setup

The project is packaged as a single executable JAR file and runs on any platform for which there is a Java 17+ (LTS) runtime.

Prerequisites

CockroachDB, with a trial enterprise license.
Linux / macOS
Java 17
- https://openjdk.org/projects/jdk/17/
- https://www.oracle.com/java/technologies/downloads/#java17
Maven 3+ (optional, embedded wrapper available)
- https://maven.apache.org/

Setup CockroachDB

Create a local cluster of at least three nodes:

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --backgroundcockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --backgroundcockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --backgroundcockroach init --insecure --host=localhost:26257

Next, set up a database called workload:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database workload"

Setup Workload

Install the JDK

Install the JDK (Ubuntu example):

sudo apt-get install openjdk-17-jdk

Confirm the installation by running:

java --version

Clone the project

git clone git@github.com:cockroachlabs-field/cockroachdb-workload.gitcd cockroachdb-workload

Build the executable jar

chmod +x mvnw./mvnw clean install

Usage

Start the shell with:

java -jar target/workload.jar --help

Type help for additional CLI guidance.

Ledger Workload

The Ledger workload demonstrates a simple financial accounting system using the double-entry bookkeeping principle. The concept is to move funds between accounts using balanced, multi-legged transactions at a high frequency. As a financial system, it will conserve money at all times and provide an audit trail for all transactions performed towards the accounts.

Start the workload shell with:

java -jar target/workload.jar ledger

Then initialize the workload by creating an account plan:

init --help# or just type 'help'

The commands can run concurrently. The "transfer" command will move funds between random accounts and the "balance" command will just read account balances.

transfer --help

Hint: The ledger workload is implemented using both plain JDBC and JPA with Hibernate as ORM provider. You can switch between these implementations using the --jpa option (where JDBC is default).

Order Workload

The Order workload is a basic purchase order creation and reading workload. Order creation and reading workloads can run concurrently.

Start the workload with:

java -jar target/workload.jar order# Then in the shell:initrun

Outbox Workload

The Outbox workload simulates the transactional outbox and inbox design patterns. It writes event records to an outbox table that has TTL and CDC enabled (optional). It can be used to evaluate creating a CDC-based data pipeline and/or the TTL mechanism.

Start the workload with:

java -jar target/workload.jar outbox# In the shell:initrun

Extending

It's fairly simple to extend this tool with additional workloads. There are skeleton beans available in the "io.cockroachdb.workload.template" namespace that can be copied to a new workload package.

The main steps include:

Implement an init method for setting up the test fixture
- Add drop and create SQL scripts if needed
Implement a run method for running the workload for a defined period
- Choose between JDBC or JPA or any other data access library

Conclusion

In this article, we looked at a simple workload tool for CockroachDB using the Java stack. It can be used to send semi-realistic SQL traffic for pure testing and load-testing purposes. It's also quite extendable.

Distributed System Tradeoffs

Kai Niemi — Mon, 30 Jan 2023 19:55:16 GMT

Introduction

This article takes a look at some fundamental tradeoffs in distributed systems captured by CAP, PACELC, Harvest & Yield. Let's begin with distributed systems 101 - the Two Generals Problem and FLP before jumping to the CAP Theorem. We'll finish off with an example of using CockroachDB to implement exactly-once processing semantics.

Two-Generals Problem

One thought experiment that captures the challenge with indeterminate outcomes and agreement in distributed systems, is the widely known Two Generals Problem. It shows that its impossible to reach an agreement between two parties if communication is asynchronous in the presence of link failures.

It goes roughly like this: Two allied armies (A and B) are on opposite sides of a fortified city in a valley, defended by army C. The two generals leading armies A and B must coordinate an attack on the city to end the siege. If either army goes in alone, they will be defeated, so a coordinated attack is required.

Each army general has an initial order; attack the city at some given point in time. The only way for the generals to communicate is by sending messages carried by runners through the valley, which are then prone to get captured or neutralized.

The general of army A initially agreed on communicating the time of the attack. The army A general decides on a time and dispatches a runner through the valley to carry the time of the attack to the general of army B. However, general A has no idea if the message got through or not.

Army general B decides that when the message arrives, a new runner will be sent back to acknowledge the message before attacking. But again, there's no way of telling if the reply message got through the valley or not.

(Above) One of the generals and his adjutant are on the lookout for runners.

The punch line is that it doesn't matter how many runners are dispatched from either side to solve this problem. To approach this problem militarily, both generals must accept that messages could get lost. They can mitigate that uncertainty by sending many runners instead of a single one, but it still doesnt change the fact. Each side could send its whole army carrying the message, but the problem is still the same and it leads to the following proof:

There's no deterministic algorithm for reaching a consensus if communication is asynchronous in the presence of link failures.

Back in the binary world, a node can't determine whether another critical node for decision-making is either dead or taking too long to answer. The outcome is undetermined.

FLP Impossibility

In a paper by Fisher, Lynch, and Paterson, the authors describe whats known as the FLP Impossibility Problem. This paper (winner of the Dijkstra award) shows that given the assumption that processing is asynchronous, and theres no shared notion of time between processes, there exists no protocol that can guarantee consensus in bounded time in the presence of a single faulty process.

FLP is an impossibility proof that tells us that we cannot always reach a consensus in an asynchronous system in a bounded time. Specifically, FLP claims that we cannot achieve all three properties of Termination (Liveness), Agreement (Safety) and Fault Tolerance at the same time under the asynchronous network model.

That alludes that consensus is impossible, but in a practical sense by introducing the notion of synchrony and timeouts into the model (relaxation), reaching consensus is indeed possible. There wouldnt be protocols such as Paxos, RAFT, ZAB and so forth otherwise.

Two-generals paradox and FLP are a good introduction to understanding the tradeoffs in distributed systems, leading to the infamous CAP theorem and its derivatives, or relatives.

CAP, PACELC, Harvest & Yield

The CAP conjecture, formulated by Dr. Eric Brewer, was intended to highlight tradeoffs in distributed systems. CAP is based on a simple abstraction: a single register that can hold some value that you either read (get) or write (set). When interacting with this register, the CAP theorem explicitly prohibits having both consistency (C) and availability (A) in presence of partitions (P).

It is a quite rough definition of the tensions between C, A and P in distributed systems, but it provides a useful taxonomy for highlighting these opposing forces. CAP does not model a very practical or realistic system, but that's not the point. It's more about highlighting these opposing forces and what tradeoffs that can be made.

Another, perhaps more intuitive way to put it, is that a system is either consistent or available when partitioned because partitions are not something you can choose not to have due to the network fallacies.

C, as in Consistency stands for linearizability, which is a strong consistency property where everyone read on any node returns that latest write or an error (latest referring to real wallclock time and not any causal order). This is different and not to be confused with C in ACID which stands for moving from one valid state to another valid state, preserving constraints.

A, as in Availability means total availability, where every non-failing node in the system must return a successful response but not necessarily the latest written value. Note that "must respond" is not time bound, so its subjective on how long that wait can be.

P, as in Partition tolerance means the system will continue to function by preserving either C or A, but never both, despite an arbitrary number of messages dropped by the network, like a network partition or split-brain scenario with multiple primaries (isolated nodes that think they are all leaders).

That is the original Brewer's conjecture which was later formally proven by Seth Gilbert and Nancy Lynch (MIT), turning it into a theorem.

Ideally, we would want both consistency and availability when partitioned, but its just not possible when processes are unable to share information and coordinate. Availability requires that any nonfailing node must deliver a response, while consistency requires results to be linearizable.

The bottom line is that we cannot implement a system that provides both CAP consistency and CAP availability in the presence of a network partition, and partitions are a constant we cant ignore because our network is not always reliable.

This narrows the CAP continuum of choices to systems that are either:

Consistent and partition tolerant (C+P)

A system that prefers refusing requests over serving inconsistent data. A C+P system MUST refuse requests on all or some nodes in the event of P to preserve C.

This is the category that Google Spanner as well as CockroachDB sorts into, including most traditional SQL databases like SQL Server, Oracle, PostgreSQL etc.

CAP is often illustrated in a triangle or a Venn diagram. Another way to look at it is that C and A are pivoting on top of P with no equilibrium*.* Either C or A will have a majority mass.

Available and partition tolerant (A+P)

A system that prefers serving potentially inconsistent data over refusing requests. An A+P system MUST respond to requests on all non-failing nodes to preserve A in the event of P, and then have to forfeit C. High availability is a separate concept (linked more to capacity) from CAP availability.

This is the category that dynamo-based systems sort into like Cassandra, DynamoDB, Riak and so on.

A C+A system would only mean that the system fails to provide either C or A in the presence of a partition (P) and often unpredictably. Theoretically, a C+A system would model a single node with a single wall clock and no networking (in other words, it doesn't exist).

The CAP conjecture is sometimes illustrated as a triangle as if we could turn the knobs and have as much as we want on each of the three parameters. Balancing on the top of partition tolerance, sort of. In the meaning of the conjecture, however, we can tune by choosing between C and A but partition tolerance is not something that can be tuned or traded in any practical sense.

PACELC

The PACELC conjecture is an extension to CAP, stating that in the presence of network partitions, theres a choice between consistency and availability (PAC). Else (E), even if the system is running in a steady state, theres still a choice between latency and consistency (LC). It brings latency into the picture during a steady state, since unavailability is ultimately a metric of infinitely long latency.

Harvest & Yield

Harvest and Yield separate from the CAP conjecture by using more relaxed assumptions that the strict binary choices between availability and consistency.

Harvest defines the completeness of a request, lets say you get 9 of 10 items due to a partial failure which is considered better than returning nothing.

Yield defines the ratio between the total number of attempted requests and completed ones. That way, we can trade between harvest and yield as a more relative approach.

Whether PACELC or H & Y helps to clarify the tradeoffs better than CAP, is totally up to you.

Exactly-Once Semantics

Distributed databases arent the only systems with strong guarantees. Messaging and streaming systems can also provide strong consistency guarantees only represented by exactly-once processing semantics. I.e if the outcome of processing an event or message is to create a permanent record with side effects, those side effects cannot be lost or turned out differently due to double-processing caused by redelivery.

Exactly once delivery at the protocol or transport level is impossible under the models described by the two-generals paradox and FLP. There are however a few methods to ensure exactly-once processing semantics in a practical sense.

One approach is to update the global state transactionally (atomically) as part of the final output from the processing. It requires that the messaging system and global state system (database) can participate in a two-phase commit protocol, which can be difficult to achieve in practice (XA/two-phase commit) since it has its own set of tradeoffs.

Another, less reliable, but still useful method without a two-phase commit is to surface business rule and data integrity conflicts early and delay the commit and message acknowledgement phase as late as possible in the business transaction. A kind of "delayed" one-phase commit. The main tradeoff is that there are no atomicity and consistency guarantees in case of transient failures in the final commit stage, which can be unacceptable depending on the business domain.

Idempotence is a third option for exactly-once processing, often used with the inbox and outbox patterns. Idempotency is a key property and is widely adopted in distributed systems design because it relieves the burden on clients to perform cleanups in case of failures. A client that doesn't receive an acknowledgement of an operation in due time (indeterminate) or a failure can simply resend the same request any number of times without causing multiple side effects (or being liable for it).

Idempotency:
Calling a function multiple times is the same as calling it once.

Multiple machines in different data centers may try to process the same event concurrently and repeatedly. That is fine, as long as all outcomes of that processing are observable exactly once in the output, which is the equivalent of exactly-once processing semantics.

In CockroachDB, uniqueness constraints are enforced globally. It provides ACID transaction guarantees and strong consistency guarantees regardless of the deployment topology, meaning also when nodes are deployed in a global manner stretching different regions or even cloud providers. That way, the database can represent a global control plane for recording side effects.

Example: A message processor writes some output and commits the outcome in a global control plane. When another processor does the same writes, based on the same deterministic input, it will have no additional effect (cancelled out).

Summary

In this article, we look at the tradeoffs in distributed systems captured by the CAP theorem and its derivatives.

Using Spring Batch to migrate to CockroachDB

Kai Niemi — Sun, 29 Jan 2023 10:25:16 GMT

Introduction

Two previous articles covered the creation of simple data pipelines to keep systems in sync by using change data capture (CDC).

In this article, we'll use the same technique to migrate both PostgreSQL schema and data to CockroachDB in a streamlined manner by using Spring Batch. It's a batch framework wrapped in the same pipeline tool used in the previous posts. The main difference is that it's not using CDC but just batches of SQL DDL and DML statements.

Setup

Prerequisites:

PostgreSQL
CockroachDB
Pipeline, an open-source Java tool built on top of Spring Batch
Java 17+ Runtime
Linux / macOS

PostgreSQL Setup

First, create a sample schema in PostgreSQL with FK constraints and load it with some test data:

DDL

create table account(    id             int,    balance        numeric(19, 2) not null,    currency       varchar(64)    not null,    name           varchar(128)   not null,    primary key (id));create table transaction(    id               int             not null,    booking_date     date             null,    primary key (id));create table transaction_item(    transaction_id  int           not null,    account_id      int           not null,    amount          numeric(19, 2) not null,    currency        varchar(64)    not null,    running_balance numeric(19, 2) not null,    note            varchar(255),    primary key (transaction_id, account_id));alter table transaction_item    add constraint fk_txn_item_ref_transaction        foreign key (transaction_id) references transaction (id);alter table transaction_item    add constraint fk_txn_item_ref_account        foreign key (account_id) references account (id);

DML

insert into account (id,balance,currency,name)select no,       500.00 + random() * 500.00,       'USD',       md5(random()::text)from generate_series(1, 10) no;insert into transaction (id,booking_date)select no,       now()::datefrom generate_series(1, 1000) no;insert into transaction_item (transaction_id,account_id,amount,currency,running_balance,note)select no,       round(1 + random() * 9),       500.00 + random() * 500.00,       'USD',       500.00 + random() * 500.00,       'Cockroaches can eat anything'from generate_series(1, 1000) no;

CockroachDB Setup

Create a local cluster of three nodes (or one, it doesn't matter for this article):

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --backgroundcockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --backgroundcockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --backgroundcockroach init --insecure --host=localhost:26257

Next, set up a database called crdb_test that will become populated with the PostgreSQL schema and data:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database crdb_test"

Pipeline Setup

Pipeline is a tool that wraps Spring Batch and it will do all the work for us. Initially, clone the repo and build it locally:

git clone git@github.com:kai-niemi/roach-pipeline.git pipelinecd pipelinechmod +x mvnw./mvnw clean install

The executable jar is now available under the target folder.

Start it up with (change URL and credentials to match your PSQL setup):

java -jar target/pipeline.jar \--pipeline.template.source.url=jdbc:postgresql://localhost:5432/crdb_test \--pipeline.template.source.username=postgres \--pipeline.template.source.password=**** \--pipeline.template.target.url=jdbc:postgresql://localhost:26257/crdb_test ?sslmode=disable

This configures the source and target datasources which are used to pre-populate batch job forms.

Configure the Pipeline

Now we are ready to copy data from PostgreSQL to CockroachDB using the sql2sql REST endpoint.

It works by first requesting a form for each table to be copied. The form will be pre-filled with information (from pipeline.template.source.*) except for the CREATE TABLE statement that you need to enter manually. The reason is that there's no support for SHOW CREATE TABLE in PostgreSQL, which is what this tool uses for introspection.

Generate Form Templates

Let's get a form template for each table in three separate REST API calls.

curl -X GET http://localhost:8090/sql2sql/form?table=account > account.jsoncurl -X GET http://localhost:8090/sql2sql/form?table=transaction > transaction.jsoncurl -X GET http://localhost:8090/sql2sql/form?table=transaction_item > transaction_item.json

Add a CREATE TABLE statement to each form JSON file. It will be used to create the tables in CockroachDB. You can copy the statements from PSQL using pg_dump:

pg_dump --dbname=crdb_test --table=account --schema-onlypg_dump --dbname=crdb_test --table=transaction --schema-onlypg_dump --dbname=crdb_test --table=transaction_item --schema-only

Then add the DDL parts to the createQuery parameter in the JSON form files. If you also add IF NOT EXISTS to createQuery, then the batch jobs will be fully repeatable.

Example:

"CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n    ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"

Shows just the account.json file:

{  "_links" : {    "self" : {      "href" : "http://localhost:8090/sql2sql/form?table=account"    }  },  "table" : "account",  "restartExecutionId" : 0,  "sourceUrl" : "jdbc:postgresql://localhost:5432/crdb_test",  "sourceUsername" : "postgres",  "sourcePassword" : "****",  "targetUrl" : "jdbc:postgresql://localhost:26257/crdb_test?sslmode=disable",  "targetUsername" : "root",  "targetPassword" : "",  "concurrency" : 8,  "chunkSize" : 32,  "linesToSkip" : 0,  "pageSize" : 32,  "sortKeys" : "id ASC",  "selectClause" : "SELECT *",  "fromClause" : "FROM account",  "whereClause" : "WHERE 1=1",  "insertQuery" : "UPSERT INTO account(id,balance,currency,name) VALUES (:id,:balance,:currency,:name)",  "createQuery" : "CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n    ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"}

Submit Batch Jobs

The next step is to POST the forms back, which will register and start the jobs. Because we use foreign keys in PSQL, the jobs will need to be registered in a sorted topology order based on the foreign key constraints.

To find out the order, we can simply ask the tool:

curl -X GET http://localhost:8090/datasource/source-tables | grep topologyOrder

Which tells you something like:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left    "topologyOrder" : "transaction,account,transaction_item"

All we need to do now is submit the forms in that given order:

curl -d "@transaction.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sqlcurl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sqlcurl -d "@transaction_item.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql

On each of these POST requests, you will get a 202 (Accepted) response with a pipeline:execution link relation and an HREF to a resource if the job was successfully submitted. A 202 only means that the request was accepted for async processing. If you follow that link URI, it takes you to the registered job with the current status.

$ curl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100  1342    0   336  100  1006   1648   4935 --:--:-- --:--:-- --:--:--  6610{  "_links" : {    "pipeline:execution" : {      "href" : "http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6"    },    "curies" : [ {      "href" : "http://localhost:8090/rels/{rel}",      "name" : "pipeline",      "templated" : true    } ]  },  "message" : "SQL2SQL Job Accepted"}

$ curl -X GET http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6

Once complete, you should see a complete copy of the PostgreSQL database in CockroachDB.

Conclusion

In this article, we looked at creating a simple batch-oriented SQL to SQL pipeline at the table level between PostgreSQL and CockroachDB.

Application migration from PostgreSQL to CockroachDB

Kai Niemi — Sat, 28 Jan 2023 14:12:54 GMT

This article provides a hands-on guide for migrating a business service component using PostgreSQL to CockroachDB. Specifically, when the application stack is roughly based on the following:

Spring Boot + Spring Data
JPA and Hibernate
PostgreSQL 9+

Getting Started

First off, this is not a complete guide for data migrations. For schema and data migration techniques, see the official Migrate from PostgreSQL guide provided by Cockroach Labs.

The focus here is instead on the application tier and what type of refactoring efforts and other key considerations there are from an application architecture standpoint during a migration.

Let's begin by describing the process in the form of a gap analysis.

Current State

You have just migrated your PostgreSQL database to CockroachDB and the new cluster runs like clockwork. Now you turn the focus at your codebase, thinking OK, what's next to make this thing work?

Future State

You have migrated both the database and the application codebase that previously used PostgreSQL to also be capable to use CockroachDB. Not necessarily at the same time but configurable at startup time using Spring profiles.

The Gap

CockroachDB is compatible with the PostgreSQL v3.0 wire protocol (pg-wire) and works with the majority of tools that also work with PostgreSQL. This includes most PostgreSQL drivers, object-relational mapping frameworks, database schema version management tools, etc.

It is, however, not a clone or derivative of the PostgreSQL codebase, but a completely separate implementation written from the ground up in Go for solving different business problems that PostgreSQL and similar single-leader database architectures were crafted for. For more background see: Why is CockroachDB compatible with PostgreSQL?

Some features in PostgreSQL are not available in CockroachDB, most notably triggers and stored procedures. Depending on your development philosophy, this may not be a huge loss since adding business logic into the database itself could be a no-no.

For triggers, there's something conceptually similar but stronger called change-data-capture (CDC). For stored procedures, there's currently no other option than to move that logic to the app tier which is probably what you would want to do anyway.

CockroachDB also runs under the serializable isolation level, meaning that applications are more subject to transient retriable errors for explicit transactions under contended workloads. Hence a transaction retry strategy is recommended, even if contention can be largely avoided by design.

Bridging the Gap

The effort level for an application-level migration depends much on the characteristics of the workload, the scope of the project and what type of dependencies there are on PostgreSQL features not available in CockroachDB.

The first order of business would be to assess which unimplemented features that are currently used in PostgreSQL. For each unsupported item, you can check the product roadmap for upcoming support or look for an alternative approach. Rewriting or refactoring application code also presents an opportunity to re-design and work through the technical debt mountain.

For tracking issues and more details, see: https://www.cockroachlabs.com/docs/stable/sql-feature-support.html#miscellaneous

Feature	Alternative
Stored procedures	Migrate SP-level logic to app-tier
Triggers	CDC
Events	See SQL Feature Support
FULLTEXT functions and indexes	See FAQ
Drop primary key	See SQL Feature Support
XML functions	See SQL Feature Support
Column-level privileges	See SQL Feature Support
XA syntax	Inbox and outbox pattern
Creating a database from a template	See SQL Feature Support
Dropping a single partition from a table	See SQL Feature Support
Foreign data wrappers	See SQL Feature Support

Migration Steps

Now let's cover some fundamental migration steps. Some of these may already be in place in your application since it's all common good practice regardless of the RDBMS used. These steps are based on a specific tech stack, as outlined in the overview.

Step 1: Check Dependencies

CockroachDB works perfectly fine with the PostgreSQL JDBC driver (pg-jdbc), so it's just a matter of picking the appropriate version.

If you are using the spring-boot-starter-parent as your parent Maven POM (as you should), then the driver version is inherited via postgresql.version. For Spring 2.7.6 it's 42.3.8 and for Spring 3.0.2 it's 42.5.1. In most cases, the pg-jdbc driver is pulled in transitively so it's often not necessary to define it in your project pom.xml.

    org.postgresql    postgresql    ${postgresql.version}

Flyway is a great schema management tool integrated into spring-boot, meaning there's auto-configuration for it when detected on the classpath.

    org.flywaydb    flyway-core

Flyway is very simple to use with its naming convention on the DDL SQL files that are in plain SQL.

Yaml configuration example:

spring:  flyway:    locations: classpath:db/migration

Migration script example:

[src/main/resources/db/migration/V1_1__create.sql]create table .. (..);create table .. (..);

Another equally capable tool is Liquibase, which is more centered around XML configuration.

    org.liquibase    liquibase-core

Step 2: Add Connection Details

CockroachDB uses the same JDBC driver prefix as PostgreSQL with different connection parameters depending on a secure or insecure cluster for development purposes. In this example, we are connecting to an insecure local cluster.

[src/main/resources/application.yml]spring:  datasource:    url: jdbc:postgresql://localhost:26257/sleipner?sslmode=disable    driver-class-name: org.postgresql.Driver    username: root    password:

In this example, we are connecting to a secure Cockroach Dedicated cluster with user credentials:

spring:  datasource:    url: jdbc:postgresql://odin-gc8.gcp-europe-west4.cockroachlabs.cloud:26257/hugin?sslmode=require    driver-class-name: org.postgresql.Driver    username: admin    password: ...

In this example, we are connecting to a secure Cockroach Dedicated cluster with user credentials and the root certificate embedded in the executable JAR:

spring:  datasource:    url: jdbc:postgresql://odin-gc8.gcp-europe-west4.cockroachlabs.cloud:26257/munin?sslmode=verify-full&sslfactory=org.postgresql.ssl.SingleCertValidatingFactory&sslfactoryarg=classpath:certs/sleipner-secure.crt    driver-class-name: org.postgresql.Driver    username: root    password: ...

Step 3: Setup Connection Pooling

Connection pooling is very important for both performance and graceful JDBC connection management regardless of RDBMS. Connection establishment is an expensive operation and running through that for each short-lived transaction would add significant overhead otherwise.

For the Java platform, HikariCP is becoming the de-facto standard for client-side connection pooling, and it's also the default pooling option in Spring Boot.

Example configuration:

spring:  datasource:    url: "jdbc:postgresql://localhost:26257/sleipner?sslmode=disable"    username: root    password:    hikari:      pool-name: my-service-pool      maximum-pool-size: 48      minimum-idle: 48

In this example, we configured the two most important parameters for the connection pool (max size and min idle). The maximum pool size should be set to 4 times the total vCPU count you have for the CockroachDB cluster, divided by the number of connection pool instances. So this would map against a 12 vCPU cluster or 4 vCPUs per node, using 3 nodes and one application instance with one datasource / connection pool.

The minimum idle set to the same value makes the pool act as a fixed-sized pool, ideal for bursty workloads (this is the default Hikari behavior). The max pool size is not an exact figure since all workloads are different, but more of a ballpark number at which point it's better to start applying backpressure client-side. "Better" in terms of there's a diminishing return in having more concurrent, active open connections due to scheduling and context-switching overhead.

    @Bean    @Primary    public DataSource primaryDataSource() {        return hikariDataSource();    }    @Bean    @ConfigurationProperties("spring.datasource.hikari")    public HikariDataSource hikariDataSource() {        HikariDataSource ds = dataSourceProperties()                .initializeDataSourceBuilder()                .type(HikariDataSource.class)                .build();        ds.addDataSourceProperty("reWriteBatchedInserts", "true");        ds.setPoolName("my-service-pool");        ds.setAutoCommit(false);        ds.setMaximumPoolSize(80);        ds.setMinimumIdle(80);        return ds;    }

The reWriteBatchedInserts property (case sensitive) enables driver-level rewriting of INSERT statements to use multi-value inserts with JDBC batch statements.

Setting auto commit to false should be used in combination with Hibernates Environment.CONNECTION_PROVIDER_DISABLES_AUTOCOMMIT set to true, which is a minor optimization skipping the check for auto commits. This is mostly helpful when you use a transaction strategy with explicit transaction boundaries @Transactional(propagation=REQUIRES_NEW).

For more details, see Connection pooling with Spring Boot and CockroachDB.

Step 4: Implement Retry Logic

Any client-server remote call (RPC) over the network can potentially fail with a transient, retriable error. As a good design principle, all remote calls should therefore have a mechanism for doing retries including to the database.

Transient errors are less likely to happen in PostgreSQL because it runs in read-committed (RC) isolation level by default, whereas CockroachDB runs in serializable (1SR). For a contended workload where overlapping concurrent reads and writes to the same keys, it's more likely to see serialization conflicts. CockroachDB is also distributed by nature and depends on semi-synchronized clocks for correctness, which under some circumstances may also force transactions to be retried.

The How to retry failed transactions post goes into detail on how retry-logic can be easily implemented in a spring boot application. The Spring Annotations for CockroachDB go into more detail on using meta-annotations to declare transaction boundaries across the codebase.

Step 5: Review JPA and Hibernate Configuration

CockroachDB Dialect

CockroachDB provides a custom Hibernate dialect that is derived from the PostgreSQL dialect.

spring:  jpa:    properties:      hibernate:        # Hibernate 5.x         dialect: org.hibernate.dialect.CockroachDB201Dialect        # Hibernate 6.x         # dialect: org.hibernate.dialect.CockroachDialect        jdbc:          lob:            non_contextual_creation: true

The non_contextual_creation is optional but it prevents the warning about createBlob if you find that annoying.

Primary Keys

The JPA specification offers four different primary key generation strategies:

AUTO - The persistence provider attempts to figure out the best strategy based on database dialect and key type (default).
IDENTITY - The persistence provider depends on a database-generated ID.
SEQUENCE - The persistence provider depends on a database sequence.
TABLE - Legacy method to simulate sequences (avoid).

The most performant option and ideal for data distribution are using UUID as the primary key for tables and AUTO. It also works well with batch INSERTs, which by the way are automatically disabled when using IDENTITY. SEQUENCE is not ideal since indexing on sequential primary and secondary indexes may cause range hot spots (for which hash sharding can help a bit). TABLE isn't used much these days and is considered legacy.

To consistently apply the same ID strategy for all entities, you could create an AbstractEntity implementing Persistable that also hosts the id column. Alternatively, move the id to the concrete subclass.

@Entity@Table(name = "account")class AccountEntity extends AbstractEntity<UUID> {  @Id  @Column(updatable = false, nullable = false)  @GeneratedValue(strategy = GenerationType.AUTO)  private UUID id;  ...}// ---import org.springframework.data.domain.Persistable;@MappedSuperclassabstract class AbstractEntity<ID> implements Persistable<ID> {    @Transient    private boolean isNew = true;    @PostPersist    @PostLoad    void markNotNew() {        this.isNew = false;    }    @Override    public boolean isNew() {        return isNew;    }}

There are more details in this article.

Batch Statements

Using batch INSERTs with multi-value rewrites in the PostgreSQL driver will have a big impact on performance. To enable INSERT rewrites, just set the following pg-jdbc driver property either in the connection URL or as a datasource property:

spring:  datasource:    url: jdbc:postgresql://localhost:26257/spring_boot?sslmode=disable&reWriteBatchedInserts=true

Alternatively, in the datasource bean factory method:

@Bean    @ConfigurationProperties("spring.datasource.hikari")    public HikariDataSource hikariDataSource() {        HikariDataSource ds = dataSourceProperties()                .initializeDataSourceBuilder()                .type(HikariDataSource.class)                .build();        ds.addDataSourceProperty("reWriteBatchedInserts", "true");        return ds;    }

Lastly, you also need to enable batching in Hibernate (batch_size > 0):

spring:  jpa:    show-sql: true    hibernate:      ddl-auto: none    properties:      hibernate:        order_updates: true        order_inserts: true        jdbc:          batch_versioned_data: true          batch_size: 128          fetch_size: 256        cache:          use_second_level_cache: false

There are more details in this article.

Step 6: Profile Activation

One quite useful approach when switching to CockroachDB is to use Spring Profiles. That way you can keep the CockroachDB connection and Hibernate dialect specifics in a separate application.yml file which is activated at startup time.

For example:

src/main/resources/  application.yml      # contains common spring boot config  application-psql.yml # contains datasource config for PSQL  application-crdb.yml # contains datasource config for CockroachDB

Then its a matter of activating the proper profile at startup:

java -jar app.jar --spring.profiles.active=psql # orjava -jar app.jar --spring.profiles.active=crdb

Next Steps

This guide only scratches the surface of data and service migration projects which can have different types of challenges. One particular challenge often arises from zero downtime requirements, which effectively requires the service to run both databases at the same time and gradually "strangle" the old databases and logic. To implement this pattern inspired by the strangler fig (below), you could for example set up separate instances and use CDC to stream writes from the current primary to the other.

Additional Resources

Summary

This article provides a hands-on guide for migrating a business service component from PostgreSQL to CockroachDB, focusing on application architecture and refactoring efforts. By following these steps, you can ensure a smooth transition from PostgreSQL to CockroachDB for your application.

Using the CDC Webhook Sink in CockroachDB

Kai Niemi — Sun, 22 Jan 2023 15:05:18 GMT

Introduction

In this article, we'll demonstrate creating a simple streaming data pipeline using a small micro-batching tool and CockroachDB's CDC Webhook sink.

Setup

Prerequisites:

CockroachDB, with a trial enterprise license.
Pipeline, an open-source Java tool built on top of Spring Batch
Java 17+ Runtime
Linux / macOS

CockroachDB Setup

Initially, create a local cluster of three nodes (or one node, not important):

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --backgroundcockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --backgroundcockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --backgroundcockroach init --insecure --host=localhost:26257

Next, setup a source database called tpcc and a target database called tpcc_copy:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc"cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc_copy"

Lastly, load a small TPC-C workload fixture (schema and data) to the source database:

cockroach workload fixtures import tpcc --warehouses=10 'postgres://root@localhost:26257?sslmode=disable'

The objective is to have a few of these tables copied and mirrored in the target database.

Pipeline Setup

Initially, clone the repo and build it locally:

git clone git@github.com:kai-niemi/roach-pipeline.git pipelinecd pipelinechmod +x mvnw./mvnw clean install

The executable jar is now available under the target folder.

Try it out with:

java -jar target/pipeline.jar --help

Configure the Pipeline

We are now ready to create cdc2sql jobs for each TPC-C table we want to be streamed from the source to the target database.

We'll use the REST API of Pipeline. If you remember, REST is really about following links and POSTing forms (rather than concatenating URIs). The tool provides a real hypermedia-driven API, but we'll not going to build any smart client app for it, just use cURL instead.

Generate Form Templates

First off, we get a form template for each table. The form is going to be pre-populated with SQL statements for the table. We are only using a subset of the TPC-C workload tables, but the process is the same for all tables *).

*) There's also an option to get a form bundle with all tables in one go, but it's WIP.

curl -X GET http://localhost:8090/cdc2sql/form?table=warehouse > warehouse-cdc2sql.jsoncurl -X GET http://localhost:8090/cdc2sql/form?table=district > district-cdc2sql.jsoncurl -X GET http://localhost:8090/cdc2sql/form?table=customer > customer-cdc2sql.json

Feel free to inspect the JSON files which should give you an idea of how the batch jobs are configured and run. At this point, we haven't started anything yet. The JSON files typically don't need much editing if the template settings are properly set (everything defaults to using localhost).

Submit Batch Jobs

The next step is to POST the forms back, which will register the jobs and start them up. The jobs need to be registered in the sorted topology order of the foreign key constraints (warehouse <- district <- customer) since we'll be creating tables on-the-fly.

curl -d "@warehouse-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sqlcurl -d "@district-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sqlcurl -d "@customer-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sql

Take note of the CREATE CHANGEFEED statements in the responses. We will need them for the next step, which is to configure the change feeds.

Connect to the source database and execute (after changing URIs):

CREATE CHANGEFEED FOR TABLE warehouse     INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true'     WITH updated, resolved='15s',CREATE CHANGEFEED FOR TABLE district     INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true'     WITH updated, resolved='15s',CREATE CHANGEFEED FOR TABLE customer     INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true'     WITH updated, resolved='15s',

You should now see the target database starting to fill up and eventually reach the same state as the source database. If you would also run the TPC-C workload, you will see any changes reflected also in the target.

Conclusion

In this article, we looked at creating a simple streaming data pipeline at table level between two separate CockroachDB databases using the CDC webhook sink.

Using the CDC Kafka Sink in CockroachDB

Kai Niemi — Sun, 22 Jan 2023 15:04:37 GMT

Introduction

In this article, we'll demonstrate creating a simple streaming data pipeline using a small micro-batching tool and CockroachDB's CDC Kafka sink.

Setup

Prerequisites:

CockroachDB, with a trial enterprise license.
Kafka
Pipeline, an open-source Java tool built on top of Spring Batch
Java 17+ Runtime
Linux / macOS

CockroachDB Setup

Initially, create a local cluster of three nodes (or one node, not important):

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --backgroundcockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --backgroundcockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --backgroundcockroach init --insecure --host=localhost:26257

Then, setup the source database tpcc and the target database tpcc_copy:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc"cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc_copy"

Finally, load the TPC-C fixture (schema and data) to the source database:

cockroach workload fixtures import tpcc --warehouses=10 'postgres://root@localhost:26257?sslmode=disable'

Kafka Setup

Ref: https://kafka.apache.org/quickstart

Initially, setup a local Kafka server that we'll use as CDC sink (using KRaft over ZK):

tar -xzf kafka_2.13-3.3.1.tgzcd kafka_2.13-3.3.1KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.propertiesbin/kafka-server-start.sh config/kraft/server.properties

Optionally, start a console consumer to see the change events flashing by later. In this example for the warehouse table/topic:

bin/kafka-console-consumer.sh --topic warehouse --from-beginning --bootstrap-server localhost:9092

Pipeline Setup

Initially, clone the repo and build it locally:

git clone git@github.com:kai-niemi/roach-pipeline.git pipelinecd pipelinechmod +x mvnw./mvnw clean install

The executable jar is now available under the target folder. Try it out with:

java -jar target/pipeline.jar --help

Configure the Pipeline

Now we are ready to create kafka2sql jobs for each TPC-C table we want to be streamed from the source to the target database.

Generate Form Templates

First off, we get form templates that are going to be pre-populated with SQL statements for each table in question. We are only using a subset of the TPC-C workload tables, but the process is the same for all tables.

curl -X GET http://localhost:8090/kafka2sql/form?table=warehouse > warehouse-kafka2sql.jsoncurl -X GET http://localhost:8090/kafka2sql/form?table=district > district-kafka2sql.jsoncurl -X GET http://localhost:8090/kafka2sql/form?table=customer > customer-kafka2sql.json

Feel free to inspect the JSON files which should give an idea of how the batch jobs are configured and run. At this point, we haven't started anything yet. The JSON files typically don't need any editing if the template settings are properly set (everything defaults to using localhost).

Submit Batch Jobs

The next step is to POST the forms back which will register the jobs and start them up. The jobs need to be registered in the sorted topology order of the foreign key constraints (warehouse <- district <- customer) since we'll be creating tables on-the-fly.

curl -d "@warehouse-kafka2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/kafka2sqlcurl -d "@district-kafka2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/kafka2sqlcurl -d "@customer-kafka2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/kafka2sql

The final step is to configure the Kafka change feeds for these three tables.

Connect to the source database and execute:

CREATE CHANGEFEED FOR TABLE warehouse INTO 'kafka://localhost:9092' WITH updated,resolved = '15s';CREATE CHANGEFEED FOR TABLE district INTO 'kafka://localhost:9092' WITH updated,resolved = '15s';CREATE CHANGEFEED FOR TABLE customer INTO 'kafka://localhost:9092' WITH updated,resolved = '15s';

You should see the target database starting to fill up and eventually reach the same state as the source database. If you would also run the TPC-C workload, you will see any changes reflected also in the target.

Conclusion

In this article, we looked at creating a simple streaming data pipeline at table level between two separate CockroachDB databases using the CDC Kafka sink.

Data Domiciling using Super Regions in CockroachDB

Kai Niemi — Sun, 15 Jan 2023 14:58:41 GMT

Introduction

Data domiciling is the art of controlling the placement of subsets of data in specific regions or locations. This is often required by privacy regulations like GDPR and The Wire Act in the US, where bet placements must not leave the state line where the bet was placed.

This presents an interesting technical challenge as far as databases are concerned: How do you still meet the database survival goals? For example, zone-level survival or even region-level survival when the data is not allowed to "leave" these boundaries?

You would like to avoid having to segment or shard your system in such a way that you end up with isolated islands deployed all across these different jurisdictions. That only adds operational complexity, risk and cost and also doesn't solve the survival problem.

What if you instead could use the database as one logical entity that stretches across all these locations while you can read, write and manage all data from any location? Without compromising on survival, consistency, transactional integrity, developer experience or data locality regulations, or having to redesign your apps or schemas?

This is complexity containment, where the challenges involved in providing these guarantees are moved from the business and application tier to the database itself, thereby offloading app developers to focus on domain-specific problems rather than data management problems.

In CockroachDB, this is achieved in two different ways depending on the survival goal defined for the database.

Data Domiciling with Zone Survival

Data domiciling in CockroachDB allows users to keep certain subsets of data in specific localities. For example US data only across nodes in the US, EU data across nodes in the EU and so forth for compliance and performance reasons. Its transparent towards applications and implemented by controlling the placement of specific row or table data using regional tables with the REGIONAL BY ROW and REGIONAL BY TABLE clauses.

Replica placement constraints can be either through placement restrictions or super regions. For zone-level survival, placement restrictions is the approach. It tells the database to disable non-voting replicas and contain the placement of voting replicas to the specified home region (applied at the replication zone level).

For region survival, you will need to use super regions since its not possible to combine region survival with placement restrictions. Let's look at that in the next section.

For illustration, below is an example of using regional-by-row, with zone survival and placement restrictions:

In this diagram, there is one regional-by-row table (yellow, blue and purple ranges) and one global table (green range). The RBR tables have domiciled ranges in each region, and the home region is defined at the row level. Non-voting replicas are disabled due to the use of placement restrictions, which is why you see three of them rather than five. Global tables are always excluded and unaffected by placement restrictions, so you can see them spread across all three regions.

Next, lets look at a failure scenario in EU-1 where two nodes fail.

In the above diagram, forward progress is denied for the domiciled ranges in region EU-1 since theres no majority available (2 of 3 offline). All other table ranges are available, including the ones for the global table.

This highlights the data domiciling challenge: How to provide region-level survival and data domiciling at the same time?

Data domiciling with zone survival is achieved with placement restrictions. A database can use PLACEMENT RESTRICTED to opt out of non-voting replicas, which can be placed outside of the regions indicated in zone configuration constraints.

In addition to data domiciling, PLACEMENT RESTRICTED can be used for the following:

Reduce the total amount of data in the cluster
Reduce the overhead of replicating data across a large number of regions

Note that global tables are not affected by PLACEMENT RESTRICTED and will still be placed in all database regions.

Data Domiciling with Region Survival

To implement data domiciling with region survival, you will need to use something called Super Regions. It was developed and introduced in CockroachDB 22.1 primarily for data domiciling requirements.

Super regions allow a user to define a set of regions in the database such that regional and regional-by-row tables located within the super region will have all of their replicas located within the super region.

In contrast to PLACEMENT RESTRICTED that disable non-voting replicas (with implications on remote region read performance), super regions make it so that all replicas (both voting and non-voting) are placed within the super region.

It means with super regions, you get to have both data domiciling and region survivability, and region-local latencies on reads and writes. The likelihood of a super-region having a full outage is significantly lower than a single region, but in case that would happen, only access to domiciled data would be refused.

Notice however that super regions rely on the underlying replication zone system, which was historically built for performance, not for domiciling. The replication system's top priority is to prevent the loss of data and it may override the zone configurations if necessary to ensure data durability.

In practical terms, this means that if there are not enough nodes in each region and super-region to satisfy the survival goal (replication factor), then it will prioritize avoiding data loss and place replicas outside of the domiciling constraints, which would be a violation of the placement constraints.

The ideal pattern for both performance, availability and compliance with super regions is therefore three nodes per region. This means you can lose a node in any region without needing to perform reads from a remote region and writes can reach a consensus agreement without cross-region coordination.

These additional guarantees come with a cost in terms of the number of nodes required. For two super regions, you effectively need 18 nodes in total.

For example:

Super-region US contains:
- us-east-1 (a,b,c)
- us-east-2 (a,b,c)
- us-west-1 (a,b,c)

Super-region EU contains:
- eu-west-1 (a,b,c)
- eu-west-2 (a,b,c)
- eu-north-1 (a,b,c)

With 3 nodes in each region that sums up to 18 nodes. If each node is sized to 2vCPUs then the total vCPU count is 36, which isn't much more than a typical single-region CockroachDB cluster.

For every additional super-region, the ideal is adding 9 additional nodes in 3 regions, like in this example:

In this diagram, there are 3 super regions with domiciled data and it's still one single logical database.

Example

In this example, we will deploy a global cluster stretching the EU to the west and east coast of the US. We'll use 18 nodes in total, but run them all locally listening on different ports since it's just a demo. The example will not be exposed to the same type of cross-link latencies, but you could always add in fake network delays using different tooling.

This demo will focus mainly on the configuration and usability aspects of a setup like this.

Cluster Setup

To set this, we will use a simple script to start 18 nodes on a local machine.

#!/bin/bashportbase=26258httpportbase=8081host=localhostLOCALITY_ZONE=(  'region=eu-north-1,zone=eu-north-1a'  'region=eu-north-1,zone=eu-north-1b'  'region=eu-north-1,zone=eu-north-1c'  'region=eu-west-1,zone=eu-west-1a'  'region=eu-west-1,zone=eu-west-1b'  'region=eu-west-1,zone=eu-west-1c'  'region=eu-west-2,zone=eu-west-2a'  'region=eu-west-2,zone=eu-west-2b'  'region=eu-west-2,zone=eu-west-2c'  'region=us-east-1,zone=us-east-1a'  'region=us-east-1,zone=us-east-1b'  'region=us-east-1,zone=us-east-1c'  'region=us-east-2,zone=us-east-2a'  'region=us-east-2,zone=us-east-2b'  'region=us-east-2,zone=us-east-2c'  'region=us-west-1,zone=us-west-1a'  'region=us-west-1,zone=us-west-1b'  'region=us-west-1,zone=us-west-1c')node=0;for zone in "${LOCALITY_ZONE[@]}"do    let node=($node+1)    let offset=${node}-1    let port=${portbase}+$offset    let httpport=${httpportbase}+$offset    let port1=${portbase}    let port2=${portbase}+1    let port3=${portbase}+2    join=${host}:${port1},${host}:${port2},${host}:${port3}    mempool="128MiB"    cockroach start \    --locality=${zone} \    --port=${port} \    --http-port=${httpport} \    --advertise-addr=${host}:${port} \    --join=${join} \    --insecure \    --store=datafiles/n${node} \    --cache=${mempool} \    --max-sql-memory=${mempool} \    --backgrounddonecockroach init --insecure --host=${host}:${portbase}

Next, we'll add the regions and configure the database for region-level survival.

create database test;use test;-- Add the 6 regionsalter database test primary region "eu-north-1";alter database test add region "eu-west-1";alter database test add region "eu-west-2";alter database test add region "us-east-2";alter database test add region "us-east-1";alter database test add region "us-west-1";show regions;-- Add the super regionsSET enable_super_regions = 'on';ALTER DATABASE test ADD SUPER REGION eu VALUES "eu-north-1","eu-west-1","eu-west-2";ALTER DATABASE test ADD SUPER REGION us VALUES "us-west-1","us-east-2","us-east-1";SHOW SUPER REGIONS FROM DATABASE test;-- Enable region survivalALTER DATABASE test SURVIVE REGION FAILURE;

Next, let's create two tables and add some sample data. The first postal_codes table is a global table and the second table users is using regional-by-row locality.

-- Add a GLOBAL tablecreate table postal_codes(    id   int primary key,    code string);ALTER TABLE postal_codes SET LOCALITY GLOBAL;-- Insert some datainsert into postal_codes (id, code)select unique_rowid() :: int,        md5(random()::text)from generate_series(1, 100);-- Add a regional-by-row tableCREATE TABLE users(    id          INT   NOT NULL,    name        STRING NULL,    postal_code STRING NULL,    PRIMARY KEY (id ASC));-- Make it RBRALTER TABLE users SET LOCALITY REGIONAL BY ROW;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-north-1'from generate_series(1, 10) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-west-1'from generate_series(11, 20) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'eu-west-2'from generate_series(21, 30) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-east-1'from generate_series(31, 40) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-east-2'from generate_series(41, 50) no;insert into users (id,name,postal_code,crdb_region)select no,    gen_random_uuid()::string,    '123 45',    'us-west-1'from generate_series(51, 60) no;select *,crdb_region from users;

You may notice above that use used explicit region values when inserting into the users table. This isn't required but we do it here so we can see how the replica placement works.

Finally, let's run through a series of tests to observe the data domiciling and also run a compliance report query.

--- ObserveSHOW CREATE TABLE postal_codes;SHOW PARTITIONS FROM TABLE postal_codes;SHOW RANGES FROM TABLE postal_codes;SHOW CREATE TABLE users;SHOW PARTITIONS FROM TABLE users;SHOW RANGES FROM TABLE users;show zone configuration for table users;-- Look for constraint compliance (notice this doesnt tell if the row exists, only where it would be stored)SHOW RANGE FROM TABLE users FOR ROW ('eu-north-1',1);SHOW RANGE FROM TABLE users FOR ROW ('eu-west-1',1);SHOW RANGE FROM TABLE users FOR ROW ('us-east-1',1);SHOW RANGE FROM TABLE users FOR ROW ('us-east-2',1);SHOW RANGE FROM TABLE users FOR ROW ('us-west-1',1); SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0;WITH partition_violations AS (SELECT * FROM system.replication_constraint_stats WHERE violating_ranges > 0),     report AS (SELECT crdb_internal.zones.zone_id,                       crdb_internal.zones.subzone_id,                       target,                       database_name,                       table_name,                       index_name,                       partition_violations.type,                       partition_violations.config,                       partition_violations.violation_start,                       partition_violations.violating_ranges                FROM crdb_internal.zones,                     partition_violations                WHERE crdb_internal.zones.zone_id = partition_violations.zone_id)SELECT *FROM report;

Conclusion

Multi-region CockroachDB clusters must contain at least 3 regions to ensure that data replicated across regions can survive the loss of one region.

Data domiciling in multi-region configurations allow users to keep certain subsets of data in specific localities for privacy regulations and performance reasons.

For zone-level survival, this is achieved by using replica placement restrictions. For region-level survival, this is achieved by using super regions.

Super regions must contain at least 3 subregions to ensure that data replicated across the subregions can survive the loss of one region. This leads to a higher total node count for a cluster (18 minimum) but in return, you can get both region-level survival and data domiciling requirements satisfied.

Using the Inbox Pattern with CockroachDB

Kai Niemi — Sun, 15 Jan 2023 14:57:54 GMT

Introduction

In a previous article, we looked at using CDC projections and transformations in CockroachDB to implement the Outbox Pattern for keeping multiple copies of state in sync in a microservices-style architecture.

In this article, we are going to look at the other end of the pipe, where events arrive into a system for processing rather than being delivered downstream. For that, we'll use the Inbox pattern.

Both these patterns help to solve the same challenge: how to provide exactly-once processing semantics when the best you have is at-least-once guarantees between heterogeneous systems.

Source Code

The source code for examples of this article can be found on GitHub.

Delivery vs Processing Semantics

The two concepts of delivery and processing are often mixed up or incorrectly referred to as the same thing. The "at most once", "at least once" and notorious "exactly once" guarantees are often discussed in the context of message transport delivery between heterogeneous systems, out of which "exactly once" is still very much impossible to achieve. Unless disproving certain impossibility theorems or mixing the semantics of the terms.

Is this magical pixie dust I can sprinkle on my application?
No, not quite. Exactly-once processing is an end-to-end guarantee and the application has to be designed to not violate the property as well.

Message delivery refers to the passing of messages between components over an asynchronous network model full of traps, delays, re-ordered packets and all sorts of dangers (runners through the valley).

"Exactly-once delivery" is often defined as the message passing system filtering out duplicate events by maintaining state at the consumer side and adopting de-duplication at the producer side. There is still de-duplication of re-delivered messages involved, only it's encapsulated within a closed system (homogenous) sharing the same protocol. It's a misnomer to call it exactly-once delivery whereas exactly-once processing is a more accurate term for what goes on.

Why is this important? The taxonomy itself is not, but it's important to understand the implications when depending on certain guarantees to protect application invariants. If you simply assume you get a message delivered exactly once from a message-passing system without reading the fine print of what's required at your end, you could be in a world of hurt.

In regards to the Inbox Pattern, we are not talking about a closed system but instead separate systems that don't share the same agreement protocol for message passing. What you then can rely on if, for example, you have systems A and B with a message passing system C in between, are the "at-most-once" and "at-least-once" message "delivery" guarantees.

"At most-once" means a message can arrive one time only or not at all. It's not particularly useful except for fire-and-forget type of stuff. "At least-once" means a message can arrive potentially multiple times due to de-delivery caused by failure or uncertainty (handled by timeouts).

The re-delivery part is a tricky one since you could end up with multiple side effects from processing a message when the intended purpose is just a single effect. Sending multiple e-mails is one classic example, double charging payments is another. Multiple side-effects, partial outcomes or copies ending up out-of-sync in the context of messaging is the equivalent of inconsistency in a database context.

Exactly What?

Exactly-once processing semantics as an end-to-end guarantee is on the other hand fully possible. Put simply, it defines the process of ensuring exactly one visible side-effect also in the face of double processing. Either by using a sophisticated atomic commit protocol (XA/2PC) that all involved heterogeneous systems can participate in, using natural idempotency (immutable/append-only) or idempotency by de-duplication. For example by transactionally storing a consumer offset, or using UPSERTs.

If we can atomically commit both the passing of a message and the side effects of its processing, we have achieved effectively-once semantics. We want the side effect(s) to become visible, if and only if, the processing is successful. When 2PC is not an option to ensure atomicity, one alternative is to make sure that none of the effects become visible and all involved heterogeneous systems remain in sync in the presence of faults.

Let's look at a practical example:

When using a classic pub-sub system (like ActiveMQ) to publish domain events and consume these events with the at-least-once guarantee, we need to ensure idempotency in the consuming process if the broker and consumer don't share the same atomic commit protocol.

If part of processing is to write to the database like in the example below, and the transaction fails, the message delivery will not be ack: ed to the broker, which in turn means the broker will attempt to redeliver it until giving up and handing it over to its dead-letter-queue (DLQ).

If the database commit is indeed successful, but the acknowledgement to the broker gets lost due to the app server node crashing or quantum entanglement happening to delay the response, then the broker will again attempt a re-delivery.

When redelivery occurs, the event will be de-duplicated (as in a no-op) by the fact that we use UPSERTs when writing to the database. If the database commit is successful, but the commit "ack" to the consumer times out, the same outcome will unfold.

Timeouts are particularly nasty since the outcome is indeterminate. You can't tell if an operation took place or not. An ack means it did, a nack means it did not. When there's no answer, you just can't tell since the information is absent. You can always relax on the constraints and run with assumptions but it doesn't change the fact.

The Outbox Pattern

To recap, the transactional outbox pattern avoids the non-atomic, dual write problem where different systems risk ending up out-of-sync. It's straightforward to implement in CockroachDB for example by using a CDC webhook along with TTLs to clear out published events.

A variant of the outbox can be called "anti-outbox" for lack of a better name. Using CDC transformations on the source tables directly to produce the change events means you don't need an explicit outbox table storing the events. That way you can save storage costs and avoid the cleanup overhead that comes with managing outbox events.

The Inbox Pattern

The Inbox Pattern is very similar to the Outbox Pattern and it's the inverse of it. It refers to the concept of storing incoming messages or events from a message-passing system directly into a persistent storage like a database and deferring the processing to a later time (or to a different system). This is a natural fit for pub-sub systems that don't retain messages after delivery, which helps to reduce queue/topic size and backpressure against the publishers.

Using the Inbox Pattern, we store the event in the database first (as-is) and then acknowledge the message to the broker. If the database commit is unsuccessful, the message will be re-delivered. If the acknowledgements get lost or times out between the broker, app or database, the message will also be re-delivered.

Upon message delivery, we de-duplicate using the information in the event to check if it's been observed before by doing an INSERT INTO .. ON CONFLICT DO NOTHING aka UPSERT. This is preconditioned that the event contains an ID that can be used for this deduplication (like a UUID).

There's no atomic commit protocol across the message passing system, application and database, but it's not needed either due to the message de-duplication. There are just local database transactions.

The actual intended business processing for the event, with other potentially business-relevant side effects, can be achieved by hooking up a change feed on the inbox table. After the local database transaction commits, a change event is either emitted to Kafka for downstream processing or to a self-subscribing webhook endpoint that can trigger another business process.

That closes the cycle where we have achieved effectively-once processing semantics, end-to-end.

If you are still awake, you may ask the question: Isn't this just pushing the problem around? When the message arrives at the CDC endpoint, we are again dealing with the same issue, aren't we? In terms of at-least-once semantics, yes but at that stage, we have elided the risk of the ingress pub-sub system and egress database disagreeing on the outcome of that message exchange and its durably stored.

Use Case Example

The use case example is a customer registration workflow. We have a system that fires off customer registration domain events to a pub-sub system topic. These events are then received by an inbox subscriber that writes them to the database.

As far as this example goes, that's where things end. A continuation could be to hook a change feed to the inbox (journal) table and further curate the event as it's progressing through the registration journey.

Let's look at a few implementation details. First, the inbox event table schema, called journal:

CREATE SEQUENCE journal_seq START 1 INCREMENT 1;CREATE TABLE journal(    id          uuid primary key as ((payload ->> 'id')::UUID) stored,    event_type  varchar(15) not null,    status      varchar(64) not null,    sequence_no int         default nextval('journal_seq'),    payload     json,    tag         varchar(64),    updated_at  timestamptz default clock_timestamp(),    INVERTED INDEX event_payload (payload));CREATE INDEX idx_journal_main ON journal (event_type) STORING (status, sequence_no, payload, tag, updated_at);

The event publisher is straightforward:

@Componentpublic class RegistrationEventProducer {    protected final Logger logger = LoggerFactory.getLogger(getClass());    @Autowired    private JmsTemplate jmsTemplate;    @Value("${active-mq.topic}")    private String topic;    public void sendMessage(RegistrationEvent event) {               jmsTemplate.convertAndSend(topic, event);    }}

The consumer is also straightforward:

@Componentpublic class RegistrationConsumer {   ...    @JmsListener(destination = "${active-mq.topic}", containerFactory = "jmsListenerContainerFactory")    @Transactional(propagation = Propagation.REQUIRES_NEW)    public void receiveMessage(RegistrationEvent event) {        // Upsert to inbox table (journal) by de-duplicating on the event ID since best we get is at-least-once        // delivery. With JDBC we could use INSERT INTO .. ON CONFLICT DO NOTHING.        RegistrationJournal journal = registrationJournalRepository.findById(event.getId()).orElseGet(() -> {            RegistrationJournal newRegistration = new RegistrationJournal();            return newRegistration;        });        if (journal.isNew()) {            journal.setEvent(event);            registrationJournalRepository.save(journal);        }    }}

Then it's just a matter of configuring a change feed on the journal table and do processing at the business end of it (not included in the demo).

Implementation Tutorial

This tutorial assumes you run everything on a local machine/laptop.

Prerequisites

ActiveMQ 5
CockroachDB v22.1+
JDK 19+ (OpenJDK compatible)
Maven 3.1+ (optional)

ActiveMQ Setup

Although ActiveMQ 5 claims to be JDK 1.8 compatible, it's compiled with a class version beyond that so JDK 19 or higher is needed.

Linux:

wget https://downloads.apache.org/activemq/5.17.3/apache-activemq-5.17.3-bin.tar.gztar zxvf apache-activemq-5.17.3-bin.tar.gzcd apache-activemq-5.17.3/bin./activemq console

OSX:

brew install apache-activemq

The admin UI is available on: http://127.0.0.1:8161/admin/.

The default login is admin/admin.

Running

To run the demo, first clone the GitHub repo and build the component:

git clone git@github.com:kai-niemi/roach-spring-boot.gitcd roach-spring-inbox./mvnw clean installcd spring-boot-inbox

Create the databases in CockroachDB using the DB console:

CREATE database spring_boot;

Then start the inbox service:

java -jar target/spring-boot-inbox.jar &

When the service comes up, you can use your browser to inspect the API:

http://localhost:8090/

Next, either get and submit a form or inline the payload:

Using the form method:

curl http://localhost:8090/registration/form > form.jsoncurl -v -d "@form.json" -H "Content-Type:application/json" -X POST http://localhost:8090/registration/

Using the inlined method:

curl -v -d '{"name":"User","email":"user@email.com","jurisdiction":"mga","createdAt":"2023-01-12T09:21:04.571+00:00"}' -H "Content-Type:application/json" -X POST http://localhost:8090/registration/

To observe that the events were received and stored in the journal:

curl http://localhost:8090/journal/registration-events?jurisdiction=mga

Conclusion

In this article, we looked at the Inbox Pattern in contrast to the more commonly discussed Outbox Pattern for providing exactly-once processing semantics without using 2PC.

The source code for examples of this article can be found on GitHub.

JPA Best Practices - JSONB Mapping

Kai Niemi — Sun, 15 Jan 2023 14:55:29 GMT

Introduction

This article is part five of a series of data access best practices when using JPA and CockroachDB. Although most of these guidelines are fairly framework and database agnostic, its mainly targeting the following Java technology stack:

JPA 2 and Hibernate 5
Spring Boot 2.7
Spring Data 2.7
JDK 1.8
CockroachDB v22

Example Code

The code examples are available on GitHub.

Chapter 5: JSONB

CockroachDB supports storing, manipulating and indexing JSON documents through the JSONB data type. This is useful for storing and querying unstructured JSON documents alongside structured elements.

In this chapter, we will take a look at how to map and use the JSONB data type with Hibernate 5 and JPA. If you are using Hibernate 6, then there's a built-in type available via the @JdbcTypeCode annotation. For this article, we are going to use a custom UserType to achieve the same thing.

Let's first take a look at the table schema:

create table journal(    id         STRING PRIMARY KEY AS (payload ->> 'id') STORED,     event_type varchar(15) not null,    payload    json,    tag        varchar(64),    updated    timestamptz default clock_timestamp(),    INVERTED INDEX event_payload (payload));create index idx_journal_main on journal (event_type, tag) storing (payload);

This table stores journal entries in the form of JSON event payloads. Think of it like an event store as part of an event-driven system design. There are a few things to highlight. First off, we are using a computed primary index column (id) that projects into the JSONB document in the payload column. That way, we use a common string in the JSON documents as the table's primary key.

The JSON document for each row is stored in the payload column, on which we also apply a Generalized Inverted Index or GIN index. Generalized inverted indexes store mappings from values within a container column (such as the payload JSONB document) to the row that holds that value.

Finally, we use a Covering Index storing the payload to avoid index joins when filtering on event type and tag in queries.

All this indexing isn't needed for the Hibernate mapping example, but more to show a few indexing techniques for unstructured JSONB documents.

Next, let's look at the Journal entity mapped against this table, which is modeled as an abstract type:

@Entity@Table(name = "journal")@Inheritance(strategy = InheritanceType.SINGLE_TABLE)@DiscriminatorColumn(        name = "event_type",        discriminatorType = DiscriminatorType.STRING,        length = 15)@MappedSuperClasspublic abstract class Journal<T> {    @Id    @GeneratedValue(strategy = GenerationType.IDENTITY)     private String id;    @Column(name = "tag")    private String tag;    @Basic    @Column(name = "updated")    private LocalDateTime updated;    @Type(type = "jsonb")    @Column(name = "payload")    @Basic(fetch = FetchType.LAZY)    private T event;...}

The reason it's abstract is that we are using the JPA single inheritance strategy with the event_type column as a discriminator. That means we can use a single table to host a hierarchy of entity types. In this case, there are only two: The first one is a journal of accounts and the second is a journal of monetary transactions towards accounts. We are storing the account and transaction models serialized into binary JSON format, rather than decomposing the model into a normalized table structure like account, transaction and transaction_item.

JSONB using Hibernate 4 or 5

As mentioned you can use the @JdbcTypeCode(SqlTypes.JSON) since Hibernate 6 to map arbitrary objects to JSONB columns. For earlier Hibernate versions (4 and 5), you will need to use a custom UserType which we are going to do next.

First of all, you have to implement the methods sqlTypes() and returnedClass(),which tells Hibernate what SQL type and Java class to use for the mapping. In this case, we use different entity subclasses, so the returnedClass is left abstract. Later we'll define a concrete subclass per entity level.

public abstract class AbstractJsonDataType<T> implements UserType {    @Override    public int[] sqlTypes() {        return new int[] {Types.JAVA_OBJECT};    }    @Override    public abstract Class returnedClass();...}

The next methods are nullSafeGet and nullSafeSet, used for reading and writing respectively.

    @Override    public Object nullSafeGet(ResultSet rs, String[] names, SharedSessionContractImplementor session, Object owner)            throws HibernateException, SQLException {        final String cellContent = rs.getString(names[0]);        if (cellContent == null) {            return null;        }        TypeDef typeDef = AnnotationUtils.findAnnotation(owner.getClass(), TypeDef.class);        Class clazz = typeDef != null ? typeDef.defaultForType() : returnedClass();        try {            if (isCollectionType()) {                JavaType type = mapper.getTypeFactory().constructCollectionType(List.class, returnedClass());                return mapper.readValue(cellContent.getBytes("UTF-8"), type);            }            return mapper.readValue(cellContent.getBytes("UTF-8"), clazz);        } catch (Exception ex) {            throw new HibernateException("Failed to deserialize json to " + clazz.getName(), ex);        }    }    @Override    public void nullSafeSet(PreparedStatement ps, Object value, int index, SharedSessionContractImplementor session)            throws HibernateException, SQLException {        if (value == null) {            ps.setNull(index, Types.OTHER);            return;        }        try {            StringWriter w = new StringWriter();            mapper.writeValue(w, value);            w.flush();            ps.setObject(index, w.toString(), Types.OTHER);        } catch (Exception ex) {            throw new HibernateException("Failed to serialize " + value.getClass().getName() + " to json", ex);        }    }

Finally, the deepCopy method is important to implement in such a way it creates a separate byte-level copy of supplied object references. In this case by serializing and deserializing to/from JSON.

    @Override    public Object deepCopy(final Object value) throws HibernateException {        try {            if (isCollectionType()) {                JavaType type = mapper.getTypeFactory().constructCollectionType(List.class, returnedClass());                return mapper.readValue(mapper.writeValueAsString(value), type);            }            return mapper.readValue(mapper.writeValueAsString(value), returnedClass());        } catch (IOException ex) {            throw new HibernateException(ex);        }    }

There's some additional code to handle collection types which enables the use of List element type of fields.

The complete user type used for this example is available on GitHub.

Using the User Type

The final step is to register the user type with the @TypeDef annotation. We also create one concrete subclass for each JSON type.

@Entity@DiscriminatorValue("ACCOUNT")@TypeDef(name = "jsonb", typeClass = AccountJournal.JsonType.class, defaultForType = Account.class)public class AccountJournal extends Journal<Account> {    public static class JsonType extends AbstractJsonDataType<Account> {        @Override        public Class returnedClass() {            return Account.class;        }    }}@Entity@DiscriminatorValue("TRANSACTION")@TypeDef(name = "jsonb", typeClass = TransactionJournal.JsonType.class, defaultForType = Transaction.class)public class TransactionJournal extends Journal<Transaction> {    public static class JsonType extends AbstractJsonDataType<Transaction> {        @Override        public Class returnedClass() {            return Transaction.class;        }    }}

The event payload field is defined as a generic type in the superclass using the @Type annotation. The lazy fetch type is optional and the default is eager fetching:

..    @Type(type = "jsonb") @Column(name = "payload") @Basic(fetch = FetchType.LAZY) private T event; ..

Application Code

Now, from an application standpoint, all we need to do to use the custom user type is to assign the object references. Here we have an account value object which will be stored in binary JSON format in the mapped payload column.

// Event value object to be stored as JSONBAccount account = Account.builder()                .withGeneratedId()                .withAccountType("asset")                .withName("abc")                .withBalance(BigDecimal.valueOf(250.00))                .build();// JPA entity in transient stateAccountJournal journal = new AccountJournal();journal.setTag("asset");journal.setEvent(account);journal = accountJournalRepository.save(journal);

Projection and Aggregation Queries

To wrap up, let's take a quick look at how we use Spring Data JPA repositories for querying our journal entities with JSON documents.

In the following example, we are filtering journal entries for account events with a set balance between a lower and upper bound:

@Repositorypublic interface AccountJournalRepository extends JpaRepository<AccountJournal, UUID> {    @Query(value = "SELECT * FROM journal WHERE event_type='ACCOUNT'"            + " AND (payload ->> 'balance')::::decimal BETWEEN :lowerBound AND :upperBound",            nativeQuery = true)    List findAccountsWithBalanceBetween(            @Param("lowerBound") BigDecimal lowerBound, @Param("upperBound") BigDecimal upperBound);}

In the final example below, we are doing similar queries against the transaction journal event types:

@Repositorypublic interface TransactionJournalRepository extends JpaRepository<TransactionJournal, UUID> {    @Query(value = "SELECT j FROM Journal j WHERE j.tag=:tag")    List findByTag(@Param("tag") String tag);    @Query(value = "SELECT * FROM journal WHERE event_type='TRANSACTION'"            + " AND payload ->> 'transferDate' BETWEEN :startDate AND :endDate",            nativeQuery = true)    List findTransactionsInDateRange(@Param("startDate") String startDate,                                                         @Param("endDate") String endDate);    @Query(value =            "WITH x AS(SELECT payload from journal where event_type='TRANSACTION' AND tag=:tag),"                    + "items AS(SELECT json_array_elements(payload->'items') as y FROM x) "                    + "SELECT sum((y->>'amount')::::decimal) FROM items",            nativeQuery = true)    BigDecimal sumTransactionLegAmounts(@Param("tag") String tag);}

The last sumTransactionLegAmounts method executes an aggregation query on the amount field of the stored documents. It uses the following CTE in a more readable format:

WITH x AS (SELECT payload from journal where event_type = 'TRANSACTION' AND tag='tag'), items AS (SELECT json_array_elements(payload -> 'items') as y FROM x) SELECT sum((y ->> 'amount')::decimal)FROM items;

If we run an explain on that query, you can see the index idx_journal_main being used without an index join (since the payload is stored in the secondary index):

distribution: localvectorized: true group (scalar) estimated row count: 1  render          project set         estimated row count: 40,600                  render                          scan                  estimated row count: 4,060 (100% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)                  table: journal@idx_journal_main                  spans: [/'TRANSACTION'/'cashout' - /'TRANSACTION'/'cashout']

Summary

CockroachDB offers the JSONB data type to store unstructured JSON documents in binary format in the database. Hibernate 6 provides a standard annotation @JdbcTypeCode that can be used for JSONB columns. For Hibernate 4 and 5 you can implement the UserType interface and register it with a @TypeDef annotation. Mapping SQL queries against JSONB columns in Spring Data is easily done through native SQL queries.

Reference

CDC transformations and the Outbox pattern with CockroachDB

Kai Niemi — Sat, 07 Jan 2023 15:44:42 GMT

In a previous article, we looked at a common challenge in microservice style architectures, namely how to keep multiple copies of state in sync. One service calls another service and then keeps a locally cached copy maintained to both reduce chattiness over the network and increase fault tolerance.

Using the transactional outbox pattern avoids the dual write problem and it's straightforward to implement in CockroachDB using a CDC webhook along with TTLs to clear out published domain events.

In this article, we are going to improve on this concept by not using any outbox table but instead projections and transformations, a new feature added to CockroachDB in v22.2.

Source Code

The source code for examples of this article can be found on GitHub.

Introducing CDC Transformations

Changefeeds offers a powerful push-based, stream-oriented mechanism to drive data integration pipelines and also provides a backbone for event-driven architectures.

Rather than polling tables and extracting the information, you simply hook up a subscription on the changes that will be streamed by the database itself to a sink of choice when state changes.

CDC transformations add the ability to both filter and transform the data into the domain events that you want to be produced in the change stream. This reduces complexity since you don't need any outbox table to store the to-be-published, events which in turn means less duplicate storage and fewer cleanup efforts afterward. It reduces cost as well since there are lower storage requirements and fewer events sent over the wire, only to be filtered by the downstream systems.

Use Case Example

In the following example, we have two services. One is the Catalog Service representing the technical authority for a product catalog. It only contains one table - product.

Next, we have the Order Service which is the technical authority for order management and workflow. It needs domain knowledge of what a product is but doesn't want the ownership of it, just a shallow copy of the product entity to produce purchase orders. To avoid having the order service call the catalog service each time an order is added or changed, it hosts copies or a materialized view of the product catalog. This means whenever the catalog creates, updates or deletes an order, it should be reflected somehow in the downstream system, this time the order service.

Typically, you would use some sort of message broker (like Kafka or a pub/sub system) in between the services that decouple the systems completely and allows for multiple durable subscribers. There is however no hard dependency between these components. The main dependency is the changefeed that pushes the products from the catalog to the order service, but that dependency sits in the database tier.

Catalog Service

Assume the product table in the catalog service looks something like this:

create table product(    id               uuid           not null default gen_random_uuid(),    name             varchar(128)   not null,    description      varchar(256),    price            numeric(19, 2) not null,    currency         varchar(3)     not null,    sku              varchar(128)   not null,    inventory        int            not null default 0,    created_by       varchar(24),    created_at       timestamptz    not null default clock_timestamp(),    last_modified_by varchar(24),    last_modified_at timestamptz,    primary key (id));

The next step would be to create a webhook change feed using transformations to tailor the event structure:

CREATE CHANGEFEED INTO 'webhook-https://localhost:8443/order-service/cdc?insecure_tls_skip_verify=true'WITH schema_change_policy='stop', key_in_value, updated, resolved='15s', webhook_sink_config='{"Flush": {"Messages": 10, "Frequency": "5s"}, "Retry": {"Max": "inf"}}'AS SELECT    cdc_updated_timestamp()::int AS event_timestamp,    'v1' AS event_version,    'product' AS event_table,    IF(cdc_is_delete(),'delete',IF(cdc_prev()='null','create','update')) AS event_type,    cdc_prev() as event_before,    jsonb_build_object(        'id', id,        'name', name,        'description', description,        'price', concat(price::string, ' ', currency),        'sku', sku,        'inventory', inventory,        'created_by', created_by,        'created_at', created_at,        'last_modified_by', last_modified_by,        'last_modified_at', last_modified_at    ) AS event_afterFROM product;

There are a few new methods here that are helpful for both filtering and tailoring the events:

cdc_updated_timestamp
cdc_is_delete
cdc_prev

Order Service

When the catalog service creates new products, the receiving endpoint in the order service will see HTTP POST request bodies like this (note that TLS is required for the webhook sink):

{  "payload" : [ {    "__crdb__" : {      "key" : [ "8c8fbc48-b670-490d-9710-08b25500c314" ],      "topic" : "product",      "updated" : "1673104217274160332.0000000000"    },    "event_after" : {      "created_at" : "2023-01-07T15:06:15.122Z",      "created_by" : "bobby_tables",      "description" : null,      "id" : "8c8fbc48-b670-490d-9710-08b25500c314",      "inventory" : 412,      "last_modified_at" : "2023-01-07T15:06:45.13Z",      "last_modified_by" : "bobby_tables",      "name" : "q3hix3OUor0gqf-7sDP_rA",      "price" : "101.11 USD",      "sku" : "p-4"    },    "event_before" : {      "created_at" : "2023-01-07T15:06:15.122Z",      "created_by" : "bobby_tables",      "currency" : "USD",      "description" : null,      "id" : "8c8fbc48-b670-490d-9710-08b25500c314",      "inventory" : 408,      "last_modified_at" : "2023-01-07T15:06:15.274Z",      "last_modified_by" : "bobby_tables",      "name" : "q3hix3OUor0gqf-7sDP_rA",      "price" : 42.28,      "sku" : "p-4"    },    "event_table" : "product",    "event_timestamp" : 1673104187178673238,    "event_type" : "update",    "event_version" : "v1"  } ],  "length" : 1}

The endpoint receiving the change feed events (a bit shortened for illustration):

@RestController@RequestMapping(value = "/order-service/cdc")public class ChangeFeedController {..    @PostMapping(consumes = {MediaType.ALL_VALUE})    public ResponseEntity onChangeFeedEvent(@RequestBody String body) {        try {            String prettyJson = prettyObjectMapper                    .writerWithDefaultPrettyPrinter()                    .writeValueAsString(prettyObjectMapper.readTree(body));            logger.debug("onChangeFeedEvent ({}) body:\n{}", counter.incrementAndGet(), prettyJson);            // We could use the 'event_table' field to map against change event types, here we only have one type            ProductEnvelope envelope = objectMapper.readerFor(ProductEnvelope.class).readValue(body);            AbstractEnvelope.Metadata metadata = envelope.getMetadata();            if (metadata != null) {                metadata.getResolvedTimestamp().ifPresent(logicalTimestamp -> {                    logger.debug("Resolved timestamp: {}", logicalTimestamp);                });            }            envelope.getPayloads().forEach(e -> domainEventListener.onProductChangeEvent(e));        } catch (JsonProcessingException e) {            logger.error("", e);            return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)                    .body(e.toString());        }        return ResponseEntity.ok().build();    }}

It simply maps the request body to a ProductEnvelope object that is modeled around the CDC message structure:

public abstract class AbstractEnvelope<T extends Event<ID>, ID> {    public static class Metadata {        @JsonProperty("resolved")        private String resolved;        public String getResolved() {            return resolved;        }        public Optional getResolvedTimestamp() {            return resolved != null                    ? Optional.ofNullable(LogicalTimestamp.parse(resolved))                    : Optional.empty();        }    }    @JsonProperty("__crdb__")    private Metadata metadata;    @JsonProperty("payload")    private List> payloads = new ArrayList<>();    @JsonProperty("length")    private int length;    public Metadata getMetadata() {        return metadata;    }    public List> getPayloads() {        return payloads;    }    public int getLength() {        return length;    }}public class Payload<T extends Event<ID>, ID> {    public static class Metadata {        @JsonProperty("topic")        private String topic;        @JsonProperty("updated")        private String updated;        @JsonProperty("key")        private List key = new ArrayList<>();        public String getTopic() {            return topic;        }        public String getUpdated() {            return updated;        }        public List getKey() {            return key;        }    }    @JsonProperty("__crdb__")    private Metadata metadata;    @JsonProperty("event_table")    private String table;    @JsonProperty("event_timestamp")    private String timestamp;    @JsonProperty("event_type")    private String type;    @JsonProperty("event_before")    private T before;    @JsonProperty("event_after")    private T after;    public Metadata getMetadata() {        return metadata;    }    public T getBefore() {        return before;    }    public T getAfter() {        return after;    }    public String getTable() {        return table;    }    public String getTimestamp() {        return timestamp;    }    public Optional getLogicalTimestamp() {        return timestamp != null ? Optional.of(LogicalTimestamp.parse(timestamp)) : Optional.empty();    }    public Operation getOperation() {        if ("create".equals(type)) {            return Operation.insert;        } else if ("delete".equals(type)) {            return Operation.delete;        } else if ("update".equals(type)) {            return Operation.update;        } else {            return Operation.unknown;        }    }}

Finally, we have the product service that represents the transaction boundary. It maps the events to JPA entity operations, effectively using UPSERTs and DELETEs.

    @Override    @Transactional(propagation = Propagation.REQUIRES_NEW)    public void onProductChangeEvent(Payload payload) {        ProductEvent beforeEvent = payload.getBefore();        ProductEvent afterEvent = payload.getAfter();        // Deletes are a bit special        if (beforeEvent == null && afterEvent == null) {            Payload.Metadata metadata = payload.getMetadata();            metadata.getKey().forEach(key -> {                UUID id = UUID.fromString(key);                logger.debug("Delete product with ID [{}]", id);                productRepository.deleteById(id);            });            return;        }        switch (payload.getOperation()) {            case insert:            case update:                Product proxy = productRepository.findById(afterEvent.getId()).orElseGet(Product::new);                if (proxy.isNew()) {                    logger.debug("Create product with ID [{}]: {}", afterEvent.getId(), proxy);                    proxy.setId(afterEvent.getId());                } else {                    logger.debug("Update product with ID [{}]: {}", afterEvent.getId(), proxy);                }                Money m = Money.of(afterEvent.getPrice());                proxy.setPrice(m.getAmount());                proxy.setCurrency(m.getCurrency().getCurrencyCode());                proxy.setSku(afterEvent.getSku());                proxy.setName(afterEvent.getName());                proxy.setDescription(afterEvent.getDescription());                proxy.setInventory(afterEvent.getInventory());                productRepository.save(proxy);                break;            default:                throw new IllegalStateException("Unknown operation: " + payload.getOperation());        }    }

Demo Project

This tutorial assumes you run everything on a local machine/laptop.

To run the demo, first clone the GitHub repo and build the two components:

git clone git@github.com:kai-niemi/roach-spring-boot.gitcd roach-spring-boot./mvnw clean installcd spring-boot-cdc-parent

Create the databases in CockroachDB using the DB console:

CREATE database spring_boot_catalog;CREATE database spring_boot_order;

Then start the catalog and order services:

java -jar catalog-service/target/catalog-service.jar &java -jar order-service/target/order-service.jar &

When the services come up, you can use your browser to inspect the catalog and the order service.

The catalog service has a custom scheduling resource that will enable and disable periodic updates to product change events.

Conclusion

In this article, we are using CDC transformations with a webhook sink in CockroachDB to drive a data integration workflow between two independent services. It eliminates the need for modeling an outbox table in the source service and all lifecycle management of the domain events.

User defined functions in CockroachDB

Kai Niemi — Wed, 04 Jan 2023 21:02:25 GMT

User-defined functions, or UDFs, were introduced in CockroachDB 22.2 as part of a feature family called Distributed Functions.

UDFs are a simpler edition of stored procedures, yet share the same "controversy" around whether or not it's appropriate to push business logic closer to the data, all the way into the database itself. Nevertheless, it's been a highly requested feature to CockroachDB and is now available in preview (meaning it's just a start).

When to use UDFs

A purist would perhaps argue that a service that doesn't manage any state is a function and a service that doesn't process any business logic is a database. Combing both business processing and state/data management and what you get is a service.

These boundaries were strictly defined in the classic service-oriented architecture (SOA) manifesto which later evolved into what we commonly know as the microservice architecture style. Basically by removing all formalism and just describing systems from characteristics such as components, organization around business capabilities, smart endpoints dumb pipes, etc.

Before deviating too far, there are certainly performance and efficiency benefits of processing close to the proximity of the data since you save network roundtrips.

UDFs are ideal for replacing complicated expressions in queries that would otherwise make application code hard to read, reducing code duplication and promoting consistency. The question is rather when do you draw the line and what is data processing vs business processing logic?

It's easy to forget why these logical and physical tiers exist in the first place: separation of concern. Having most if not all business logic implemented in a stateless middle tier using a high-level language with version control, testing harness, security controls etc does have its benefits. Leave the distributed state problem to the other guy that's much better at it - the database.

How to use UDFs

Once you have overcome the philosophical dilemma of what defines a database, function or service, and are jumping on both feet ready to adopt UDFs, then using them is just a matter of calling functions. Before calling a function, we of course need to have one.

Let's create a simple UDF example in CockroachDB that generates random phone numbers. This could be useful to pre-populate test databases for example.

Here's the syntax:

CREATE OR REPLACE FUNCTION rand_phone_number() RETURNS STRING IMMUTABLE LEAKPROOF LANGUAGE SQL AS$$select concat('(',ceil(random()*9+1)::string, ceil(random()*10)::string, ceil(random()*10)::string, ')',                  ceil(random()*9+1)::string, ceil(random()*10)::string, ceil(random()*10)::string, '-',                  ceil(random()*10)::string,ceil(random()*10)::string,ceil(random()*10)::string,ceil(random()*10)::string)$$;

Let's break these elements down for completeness.

CREATE OR REPLACE FUNCTION rand_phone_number() specifies that (surprise) were creating or replacing a function, and the name of the function will be rand_phone_number. The function does not accept any arguments, but could.
RETURNS STRING specifies that the output will be the STRING datatype.
IMMUTABLE LEAKPROOF specifies the volatility of the function, which in this case, means the function will not mutate/change any data or have other side effects in the database post-execution.
LANGUAGE SQL specifies that the function body will be written in SQL.
AS $$select ...$$ specifies that the function body will execute the SQL code within the double dollar signs. The semicolon at the end signals the end of the full CREATE FUNCTION statement.

Running it is just like invoking any other built-in function:

select rand_phone_number();Time: 17ms total (execution 17ms / network 0ms)  rand_phone_number---------------------  (8910)582-2649(1 row)

Using Java, this UDF would be the equivalent of:

public static String randomPhoneNumber() {    StringBuilder sb = new StringBuilder()            .append("(")            .append(random.nextInt(9) + 1);    IntStream.range(0, 2).map(i -> random.nextInt(10)).forEach(sb::append);    sb.append(") ").append(random.nextInt(9) + 1);    IntStream.range(0, 2).map(i -> random.nextInt(10)).forEach(sb::append);    sb.append("-");    IntStream.range(0, 4).map(i -> random.nextInt(10)).forEach(sb::append);    return sb.toString();}

Let's take another example:

create table city ( name string not null, primary key (name) );INSERT into city VALUES ('new york'), ('boston'), ('washington dc'), ('miami'), ('charlotte'), ('atlanta'), ('chicago'), ('st louis'), ('indianapolis'), ('nashville'), ('dallas'), ('houston'), ('san francisco'), ('los angeles'), ('san diego'), ('portland'), ('las vegas'), ('salt lake city');CREATE OR REPLACE FUNCTION num_cities() RETURNS INT LANGUAGE SQL AS $$ select count (1) from city $$;CREATE OR REPLACE FUNCTION rand_city() RETURNS STRING LANGUAGE SQL AS$$select name from city offset ((public.num_cities()::float)*random())::int limit 1$$;select rand_city();

Calling a UDF through another is currently not supported in the preview release in CockrocachDB. This was just to add another example.

Calling UDFs using JPA

Calling UDFs is just the same as calling any other function like version():

@GetMapping(produces = {"text/plain", "application/json"})public String ping() {   return "Hello, youre number is " +                entityManager.createNativeQuery("select rand_phone_number()").getSingleResult();    }

Conclusion

UDFs are a useful mechanism to execute simple data processing functions close to the proximity of the data in the database itself. It eliminates the network roundtrip latency between the application/service tier and the database, and when used appropriately can add performance benefits at the expense of visibility and scattering business logic outside of the bounded context of a business component.

Column Families in CockroachDB

Kai Niemi — Wed, 04 Jan 2023 16:25:45 GMT

In this article, we will take a closer look at using column families in CockroachDB to reduce contention on concurrent updates.

Column families are a group of columns in a table that are stored as single key-value pairs in the storage layer. By default, there is a single implicit column family created for all columns.

Adding column families is a method to impose a different structure on the table where each row is represented by multiple key-value pairs. This can improve performance for write operations (INSERT, UPDATE and DELETE) and given certain constraints also reduce contention from concurrent transactions, which is the topic of this article.

The test project we're going to use is a basic Spring Boot application that uses plain JPA and Hibernate for data access.

Source Code

The source code for examples of this article can be found on GitHub.

Use Case

The demo application uses a single table to store purchase orders. It comes in two flavors: one with a single-column family and one with two families. Other than that, the tables are identical.

The operations we will perform will also be the same for both tables, but we are going to witness different transaction behavior due to the nature of the operations and the use of column families.

The table schema is as follows:

create table purchase_order1(    id                  integer        not null default unordered_unique_rowid(),    bill_address1       varchar(255),    bill_address2       varchar(255),    bill_city           varchar(255),    bill_country        varchar(16),    bill_postcode       varchar(16),    bill_to_first_name  varchar(255),    bill_to_last_name   varchar(255),    date_placed         date           not null default current_date(),    deliv_to_first_name varchar(255),    deliv_to_last_name  varchar(255),    deliv_address1      varchar(255),    deliv_address2      varchar(255),    deliv_city          varchar(255),    deliv_country       varchar(16),    deliv_postcode      varchar(16),    order_status        varchar(64),    total_price         decimal(18, 2) not null,    primary key (id));create table purchase_order2(    id                  integer        not null default unordered_unique_rowid(),    bill_address1       varchar(255),    bill_address2       varchar(255),    bill_city           varchar(255),    bill_country        varchar(16),    bill_postcode       varchar(16),    bill_to_first_name  varchar(255),    bill_to_last_name   varchar(255),    date_placed         date           not null default current_date(),    deliv_to_first_name varchar(255),    deliv_to_last_name  varchar(255),    deliv_address1      varchar(255),    deliv_address2      varchar(255),    deliv_city          varchar(255),    deliv_country       varchar(16),    deliv_postcode      varchar(16),    order_status        varchar(64),    total_price         decimal(18, 2) not null,    primary key (id),    FAMILY f1 (id, bill_address1, bill_address2, bill_city, bill_country, bill_postcode, bill_to_first_name, bill_to_last_name, date_placed, deliv_to_first_name, deliv_to_last_name, deliv_address1, deliv_address2, deliv_city, deliv_country, deliv_postcode, order_status),    family f2 (total_price));

The sequence of SQL operations is to first read a single row and then update the same row but use different columns. This will be done concurrently in separate sessions to visualize the effect.

The sequence of operations will be something like the following, for transaction T1 and T2 (time flows vertically):

Time	T1	T2
1	`begin;`	`begin;`
2	`select id,status from purchase_order1 where id=1;`
3	`(reads 1,'PLACED')`	`select id,total_price from purchase_order1 where id=1;`
4	`update purchase_order1 set status = 'CONFIRMED' where id = 1;`	`(reads 1,100.00)`
5		`update purchase_order1 set total_price = total_price + 5 where id = 1;`
6		`(blocked on T1)`
7	`commit;`
8		`commit; --- "ERROR: restart transaction.."`
9		(rollback)

The rollback for T2 is expected (yet undesired) behavior since we have contending read and write operations on the same row, even if the transactions write back to different columns.

Next, let's see how the same sequence works on the other table (purchase_order2) with a separate column family for the total_price column.

Time	T1	T2
1	`begin;`	`begin;`
2	`select id,status from purchase_order2 where id=1;`
3	`(reads 1,'PLACED')`	`select id,total_price from purchase_order2 where id=1;`
4	`update purchase_order2 set status = 'CONFIRMED' where id = 1;`	`(reads 1,100.00)`
5		`update purchase_order2 set total_price = total_price + 5 where id = 1;`
6	`commit;`	`commit;`

As we can see, both transactions can commit which is allowed in a serializable history since we are not writing to the same key, but to disjoint columns in different key-value pairs in the underlying key-value storage.

JPA and Hibernate Considerations

There are a few considerations in terms of Hibernate entity mappings and query operations to make the best use of CockroachDB column families to reduce contention.

One is to always use projection in the reading part of a transaction. This is the default, but the problem is that all columns are read like using a star projection. This is the case for example when using the entityManager#find method. If all columns would be read in the last example above, then it would fail with a transient SQL error.

For the demo, we are using named queries for the different order entities, as follows:

@Entity@Table(name = "purchase_order1")@NamedQuery(name = "Order1.findByIdForUpdateStatus",        query = "SELECT o.id,o.orderStatus,o.totalPrice FROM Order1 o WHERE o.id = ?1")@NamedQuery(name = "Order1.findByIdForUpdatePrice",        query = "SELECT o.id,o.totalPrice,o.orderStatus FROM Order1 o WHERE o.id = ?1")public class Order1 extends AbstractOrder {}.. and ..@Entity@Table(name = "purchase_order2")@NamedQuery(name = "Order2.findByIdForUpdateStatus",        query = "SELECT o.id,o.orderStatus FROM Order2 o WHERE o.id = ?1")@NamedQuery(name = "Order2.findByIdForUpdatePrice",        query = "SELECT o.id,o.totalPrice FROM Order2 o WHERE o.id = ?1")public class Order2 extends AbstractOrder {}

The downside with projection is that we can't use a typed query to read the persistent domain object (PDO) into its attached state, but only use scalar values. It's fine for this example though, where we just write the data again being selective of which columns to update:

entityManager        .createQuery("update " + orderType.getSimpleName()                + " o set o.orderStatus = :status"                + " where o.id = :id")        .setParameter("status", status)        .setParameter("id", orderId)        .executeUpdate();// ..and..entityManager        .createQuery("update " + orderType.getSimpleName()                + " o set o.totalPrice = o.totalPrice + :price"                + " where o.id = :id")        .setParameter("price", price)        .setParameter("id", orderId)        .executeUpdate();

Another method of being selective on which columns to update is to use the @DynamicInsert and @DynamicUpdate annotations in Hibernate:

@Entity@Table(name = "purchase_order1")@DynamicInsert@DynamicUpdatepublic class Order1 extends AbstractOrder {}

This works for the update part, but not for projection.

Running the Demo

For demo cloning and building instructions, see the Github project.

Single Column Family

Get a form to create a new order:

shell curl "http://localhost:8090/order/v1/template" > o1.json

Submit the order form:

curl "http://localhost:8090/order/v1/" -H "Content-Type:application/json" -X POST -d "@o1.json"

Take note of generated id value (which is 1 by default) and be prepared to rotate the order status with a commit delay of 5 sec to allow for another update to come in between:

curl "http://localhost:8090/order/v1/1/status?delay=5" -i -X PUT

Now, within 5 sec after the previous PUT, increment the price on the same order ID, which will cause a serialization conflict:

curl "http://localhost:8090/order/v1/1/price?price=5&delay=0" -i -X PUT

This request will fail with an expected 40001 error. That's fine, but let's try to avoid this contention effect.

Multiple Column Families

Submit the order form again but using a different URL:

curl "http://localhost:8090/order/v2/" -H "Content-Type:application/json" -X POST -d "@o1.json"

Let's fire the status update and delay the commit with 5 sec:

curl "http://localhost:8090/order/v2/1/status?delay=5" -i -X PUT

Now, within 5 sec, increment the price on the same order ID which will succeed:

curl "http://localhost:8090/order/v2/1/price?price=5&delay=0" -i -X PUT

That's it, no retry errors.

Conclusion

Column families are a simple technique that can help to reduce contention and improve write performance in CockroachDB. The key consideration to make this work in JPA and Hibernate is to use both projections and selective updates when writing entity state back to the database.

Transaction Retries using JavaEE and CDI with CMTs

Kai Niemi — Thu, 08 Dec 2022 07:34:00 GMT

In this post, we'll use the same concept as in Transaction Retries using JavaEE and CDI with BMTs. Only this time with container-managed transactions, or CMTs which is the default mode of operation with JTA.

Transaction Retries in JavaEE

This article demonstrates an AOP-driven retry strategy for JavaEE apps using Stateless Session Beans with container-managed transactions (CMT), using the same stack and demo use case as in Transaction Retries using JavaEE and CDI with BMTs.

Source Code

The source code for examples of this article can be found on GitHub.

What's the difference

To use bean-managed transactions, you would just add the @TransactionManagementannotation and set the transaction attributes accordingly:

@Stateless@TransactionManagement(TransactionManagementType.BEAN)@TransactionAttribute(REQUIRES_NEW)public class OrderService {    @TransactionBoundary    public Order placeOrder(Order order) {        Assert.isTrue(entityManager.isJoinedToTransaction(), "Expected transaction!");        entityManager.persist(order);        return order;    }}

With container-managed transactions (the default), you can either be explicit or leave out the @TranactionManagement annotation. Then add @TransactionAttribute(NOT_SUPPORTED)` alongside @TransactionBoundary in the boundary methods. This will inform the container to not start a new transaction when invoking this method:

@Stateless@TransactionManagement(TransactionManagementType.CONTAINER)public class OrderService {    @TransactionBoundary    @TransactionAttribute(NOT_SUPPORTED)    public Order placeOrder(Order order) {        Assert.isTrue(entityManager.isJoinedToTransaction(), "Expected transaction!");        entityManager.persist(order);        return order;    }}

So the interesting question now is: How can that assertion still be true, given that NOT_SUPPORTED propagation is used?

The answer sits in the @TransactionBoundary annotation which uses an @InterceptorBinding to wire in the TransactionRetryInterceptor, which in turn invokes a transaction service with REQUIRES_NEW propagation. The effect is that although it appears like that service boundary method is not transactional, it actually is but it's invocation is now deferred to a retry loop in the interceptor.

https://gist.github.com/kai-niemi/cd6b4182a596e8afe02fb6173221a739

This is a less intrusive approach to add retry logic to session beans and service activators (message listeners) when already invested in BMTs. No major refactoring efforts are needed.

Demo

To try this out, we'll use the same order system designed to produce unrepeatable read (aka read/write) conflicts to activate the retry mechanism.

Building

Prerequisites

JDK8+ with 1.8 language level (OpenJDK compatible)
Maven 3+ (optional, embedded)
CockroachDB v22.1+ database

Install the JDK (Linux):

sudo apt-get -qq install -y openjdk-8-jdk

Clone the project

git clone git@github.com/kai-niemi/retry-demo.gitcd retry-demo

Build the project

chmod +x mvnw./mvnw clean install

Setup

Create the database:

cockroach sql --insecure --host=localhost -e "CREATE database orders"

Create the schema:

cockroach sql --insecure --host=locahlost --database orders  < src/resources/conf/create.sql

Start the app:

../mvnw clean install tomee:run

The default listen port is 8090 (can be changed in pom.xml):

Usage

Open another shell and check that the service is up and connected to the DB:

curl http://localhost:8090/api

Get Order Request Form

This prints out an order form template that we will use to create new orders:

curl http://localhost:8090/api/order/template| jq

Alternatively, pipe it to a file:

curl http://localhost:8090/api/order/template > form.json

Submit Order Form

Create a new purchase order:

curl http://localhost:8090/api/order -i -X POST \-H 'Content-Type: application/json' \-d '{    "billAddress": {        "address1": "Street 1.1",        "address2": "Street 1.2",        "city": "City 1",        "country": "Country 1",        "postcode": "Code 1"    },    "customerId": -1,    "deliveryAddress": {        "address1": "Street 2.1",        "address2": "Street 2.2",        "city": "City 2",        "country": "Country 2",        "postcode": "Code 2"    },    "requestId": "bc3cba97-dee9-41b2-9110-2f5dfc2c5dae"}'

Or using the file:

curl http://localhost:8090/api/order -H "Content-Type:application/json" -X POST \-d "@form.json"

Produce a Read/Write Conflict

Assuming we have an order with ID 1 in status PLACED. We will now read that order and change the status to something else by using concurrent transactions. This is known as the unrepeatable read conflict, prevented by 1SR from happening.

To have a predictable outcome, we'll use two sessions with a controllable delay between the read and write operations.

Overview of SQL operations:

select * from purchase_order where id=1; -- T1 -- status is `PLACED`wait 5s -- T1 select * from purchase_order where id=1; -- T2wait 5s -- T2update status='CONFIRMED' where id=1; -- T1update status='PAID' where id=1; -- T2commit; -- T1commit; -- T2 ERROR!

Prepare to run the first command:

curl http://localhost:8090/api/order/1?status=CONFIRMED\&delay=5000 -i -X PUT

Open another session, and prepare to run a similar command in less than 5sec after the first one:

curl http://localhost:8090/api/order/1?status=PAID\&delay=5000 -i -X PUT

When both commands are executed serially, it will cause a serialization conflict like this:

ERROR: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write for key /Table/109/1/12/0 at timestamp 1669990868.355588000,0 too old; wrote at 1669990868.778375000,3: "sql txn" meta={id=92409d02 key=/Table/109/1/12/0 pri=0.03022202 epo=0 ts=1669990868.778375000,3 min=1669990868.355588000,0 seq=0} lock=true stat=PENDING rts=1669990868.355588000,0 wto=false gul=1669990868.855588000,0

The interceptor will however catch this error since it has a state code 40001, retry the business method and eventually succeed and deliver a 200 OK to the client.

Conclusion

In this article, we implemented a transaction retry strategy for JavaEE stateless session beans using container-managed transactions and a custom interceptor with interceptor bindings.

This reduces the amount of retry logic in ordinary service beans to simply add a @TransactionBoundary meta-annotation and change the transaction attribute to @TransactionAttribute(NOT_SUPPORTED).

Transaction Retries using JavaEE and CDI with BMTs

Kai Niemi — Sun, 04 Dec 2022 19:34:58 GMT

In a previous post, we demonstrated client-side transaction retries by using meta-annotations and Aspect Oriented Programming (AOP) in Spring Boot.

In this post, we'll use a similar concept for a different Java stack: JavaEE (or JakartaEE as it's known today) and interceptors using the Contexts and Dependency Injection (CDI) framework.

Transaction Retries in JavaEE

This article demonstrates an AOP-driven retry strategy for JavaEE apps using the following stack:

Stateless session beans with bean-managed transactions
@AroundAdvice interceptor for retries
@TransactionBoundary meta-annotation with the interceptor binding
JAX-RS REST endpoint for testing
TomEE 8 as an embedded container with the web profile
JPA and Hibernate for data access

Source Code

The source code for examples of this article can be found on GitHub.

Solution

The Interceptor or proxy pattern is generally used to add cross-cutting functionality or logic in an application without code duplication. Transaction management is a typical cross-cutting concern which is provided out of the box in JavaEE. Public bean methods in stateless/stateful session beans are automatically transactional, given that container-managed transactions (CMTs) are used.

In this example, the service method acting as a boundary should always use REQUIRES_NEW propagation.

@Statelesspublic class OrderService {    @PersistenceContext(unitName = "orderSystemPU")    private EntityManager entityManager;    @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)    public List findAllOrders() {        CriteriaQuery cq = entityManager.getCriteriaBuilder().createQuery(Order.class);        cq.select(cq.from(Order.class));        return entityManager.createQuery(cq).getResultList();    }}

A natural extension to transparent transaction management is transaction retries. If a transaction fails with a transient serialization error, it should be automatically retried several times. This behaviour fits well into an interceptor that we'll use through an interceptor binding (annotation).

In the same example, like this:

@Statelesspublic class OrderService {    @PersistenceContext(unitName = "orderSystemPU")    private EntityManager entityManager;    @TransactionBoundary    public Order updateOrder(Order order) {        entityManager.merge(order);        return order;    }}

To mark classes and methods as repeatable transaction boundaries, let's first create the interceptor binding:

@Inherited@InterceptorBinding@Target({TYPE, METHOD})@Retention(RUNTIME)public @interface TransactionBoundary {}

After creating the interceptor binding we need to create the actual interceptor implementation:

@TransactionBoundary@Interceptor@Priority(Interceptor.Priority.APPLICATION)public class TransactionRetryInterceptor {    public static final int MAX_RETRY_ATTEMPTS = 10;    public static final int MAX_BACKOFF_TIME_MILLIS = 15000;    private static final ThreadLocalRandom RAND = ThreadLocalRandom.current();    @PersistenceContext(unitName = "orderSystemPU")    private EntityManager entityManager;    @Inject    private TransactionService transactionService;    @Inject    private Logger logger;    @AroundInvoke    public Object aroundTransactionBoundary(InvocationContext ctx) throws Exception {        Assert.isFalse(entityManager.isJoinedToTransaction(), "Expected no transaction!");        logger.info("Intercepting transactional method in retry loop: {}", ctx.getMethod().toGenericString());        for (int attempt = 1; attempt < MAX_RETRY_ATTEMPTS; attempt++) {            try {                Object rv = transactionService.executeWithinTransaction(ctx::proceed);                if (attempt > 1) {                    logger.info("Recovered from transient error (attempt {}): {}",                            attempt, ctx.getMethod().toGenericString());                } else {                    logger.info("Transactional method completed (attempt {}): {}",                            attempt, ctx.getMethod().toGenericString());                }                return rv;            } catch (Exception ex) {                Throwable t = ExceptionUtils.getMostSpecificCause(ex);                if (t instanceof SQLException) {                    SQLException sqlException = (SQLException) t;                    if (PSQLState.SERIALIZATION_FAILURE.getState().equals(sqlException.getSQLState())) {                        long backoffMillis = Math.min((long) (Math.pow(2, attempt) + RAND.nextInt(0, 1000)),                                MAX_BACKOFF_TIME_MILLIS);                        logger.warn("Detected transient error (attempt {}) backoff for {}ms: {}",                                attempt, backoffMillis, sqlException);                        try {                            Thread.sleep(backoffMillis);                        } catch (InterruptedException e) {                            Thread.currentThread().interrupt();                        }                    } else {                        logger.info("Detected non-transient error (propagating): {}", t.getMessage());                        throw ex;                    }                } else {                    logger.info("Detected non-transient error (propagating): {}", t.getMessage());                    throw ex;                }            }        }        throw new SQLTransactionRollbackException("Too many serialization conflicts - giving up retries!");    }}

There are a few things to highlight here. First is that beans need to use bean-managed transactions (BMTs) to defer the transaction marker to a dedicated service, invoked from the interceptor. Otherwise, when using container-managed transactions, the interceptor is called within the scope of the same transaction which may fail with a serialization error. A retry at that point would do no good since this logic must be applied before transactions are created.

Hence the purpose of `TransactionService` which makes a callback to the original business method within a transaction scope.

@Statelesspublic class TransactionService {    @TransactionAttribute(REQUIRES_NEW)    public  T executeWithinTransaction(final Callable task) throws Exception {        return task.call();    }}

The actual stateless service bean declares its intent to use bean-managed transactions:

@Stateless@TransactionManagement(TransactionManagementType.BEAN)public class OrderService { ...

To try this out, we'll use a very unsophisticated order system designed to produce unrepeatable read (aka read/write) conflicts to activate the retry mechanism.

We will use multiple concurrent connections and read and write to the same key but with different values. Without client-side retries, one of the transactions would be rolled back with a serialization conflict. With retries, the failed transaction will be retried and eventually run to completion, transparently towards the application logic.

Building

Prerequisites

JDK8+ with 1.8 language level (OpenJDK compatible)
Maven 3+ (optional, embedded)
CockroachDB v22.1+ database

Install the JDK (Linux):

sudo apt-get -qq install -y openjdk-8-jdk

Clone the project

git clone git@github.com/kai-niemi/retry-demo.gitcd retry-demo

Build the project

chmod +x mvnw./mvnw clean install

Setup

Create the database:

cockroach sql --insecure --host=localhost -e "CREATE database orders"

Create the schema:

cockroach sql --insecure --host=locahlost --database orders  < src/resources/conf/create.sql

Start the app:

../mvnw clean install tomee:run

The default listen port is 8090 (can be changed in pom.xml):

Usage

Open another shell and check that the service is up and connected to the DB:

curl http://localhost:8090/api

Get Order Request Form

This prints out an order form template that we will use to create new orders:

curl http://localhost:8090/api/order/template| jq

Alternatively, pipe it to a file:

curl http://localhost:8090/api/order/template > form.json

Submit Order Form

Create a new purchase order:

curl http://localhost:8090/api/order -i -X POST \-H 'Content-Type: application/json' \-d '{    "billAddress": {        "address1": "Street 1.1",        "address2": "Street 1.2",        "city": "City 1",        "country": "Country 1",        "postcode": "Code 1"    },    "customerId": -1,    "deliveryAddress": {        "address1": "Street 2.1",        "address2": "Street 2.2",        "city": "City 2",        "country": "Country 2",        "postcode": "Code 2"    },    "requestId": "bc3cba97-dee9-41b2-9110-2f5dfc2c5dae"}'

Or using the file:

curl http://localhost:8090/api/order -H "Content-Type:application/json" -X POST \-d "@form.json"

Produce a Read/Write Conflict

To have a predictable outcome, we'll use two sessions with a controllable delay between the read and write operations.

Overview of SQL operations:

select * from purchase_order where id=1; -- T1 -- status is `PLACED`wait 5s -- T1 select * from purchase_order where id=1; -- T2wait 5s -- T2update status='CONFIRMED' where id=1; -- T1update status='PAID' where id=1; -- T2commit; -- T1commit; -- T2 ERROR!

Prepare to run the first command:

curl http://localhost:8090/api/order/1?status=CONFIRMED\&delay=5000 -i -X PUT

Open another session, and prepare to run a similar command in less than 5sec after the first one:

curl http://localhost:8090/api/order/1?status=PAID\&delay=5000 -i -X PUT

When both commands are executed serially, it will cause a serialization conflict like this:

ERROR: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write for key /Table/109/1/12/0 at timestamp 1669990868.355588000,0 too old; wrote at 1669990868.778375000,3: "sql txn" meta={id=92409d02 key=/Table/109/1/12/0 pri=0.03022202 epo=0 ts=1669990868.778375000,3 min=1669990868.355588000,0 seq=0} lock=true stat=PENDING rts=1669990868.355588000,0 wto=false gul=1669990868.855588000,0

The interceptor will however catch this error since it has a state code 40001, retry the business method and eventually succeed and deliver a 200 OK to the client.

Conclusion

In this article, we implemented a transaction retry strategy for JavaEE stateless session beans using bean-managed transactions and a custom interceptor with interceptor bindings. This reduces the amount of retry logic in ordinary service beans to simply add a @TransactionBoundary meta-annotation.

Archival Partitioning with CockroachDB

Kai Niemi — Mon, 31 Oct 2022 18:04:35 GMT

Overview

This article is an introduction to CockroachDB archival partitioning, which is a form of table partitioningthat allows storing infrequently-accessed data on specific nodes in a cluster with slower and cheaper storage.

Archival Use Cases

CockroachDB is crafted to provide strong consistency with high scalability, performance and fault-tolerance for OLTP workloads across all parts of the keyspace. For optimal performance it's recommend to use local SSDs or NVMe storage.

For high data volumes, its quite common to only access low figures of the entire keyspace while the rest just "sits around" for compliance reasons, much like the Pareto distribution principle. This is typically where data archiving solutions comes into play. The purpose of archival strategies is to move infrequently accesed, long-tail data to a separate storage in order to offload the primary database. In CockroachDB however, the long tail data is not necessarily moved off the cluster, but more relocated to nodes with a different hardware profile to reduce cost over time.

Archival in general is the process of moving long tail data to potentially separate and offline storage, but in CockroachDB it can still be online. This can be useful to support certain type of business operations with a long retention period, for example payment or deposit reversals.

Introduction to Partitioning

Manual table partitioning in CockroachDB can be applied in two different ways:

List / Geo-partitioning allows storing user data close to the proximity of access, which reduces the distance that the data needs to travel, thereby reducing latency.
Range / Archival-partitioning allows storing infrequently-accessed data on slower and cheaper storage, thereby reducing costs.

List partitioning is a good fit for geographic distribution where the partition keys are fairly small in numbers (like country codes, regions or jurisdictions). List partitioning gives control of both leaseholder andreplica placement, which in turn gives predictable read and write performance and the ability to pin data at row level.

Range or archival partitioning on the other hand is a good fit for moving long tail cold data ranges to slower/cheaper hardware for archival, based on some timestamp, date range or other interval criteria. For example, automatically moving financial payment transactions older than 90 days to archival-type of database nodes using slower, but bigger disks or even shared storage.

The most common method is list partitioning, which is semantically equivalent to the automated mechanisms used to provide CockroachDB's multi-region capabilities. These are declarative in nature and automatically handle geo-partitioning and other low-level details. It's more or less just about definiting the survival goal for a database:

ALTER DATABASE  SURVIVE REGION FAILURE;

This post is however referring to the "manual" approach of defining table partitions, which is most relevant for archival use cases.

Archival Partitioning Example

Let's take a simple example. Here's a simple payments table with a booking_date column being part of the composite primary index:

CREATE TABLE payments(    id UUID NOT NULL DEFAULT gen_random_uuid(),    booking_date DATE,    reference STRING,    amount DECIMAL NOT NULL,    currency STRING NOT NULL,    archived BOOL NOT NULL DEFAULT false,    PRIMARY KEY (booking_date, id));

Now, lets use the range partitioning syntax to qualify payment transaction for archival:

ALTER TABLE payments PARTITION BY RANGE (booking_date) (    PARTITION archived VALUES FROM (MINVALUE) TO ('2022-06-01'),    PARTITION recent VALUES FROM ('2022-06-01') TO (MAXVALUE));

Assuming we set the --store attribute when starting the nodes and labeling the type of storage on each node:

--store=path=/mnt/ssd01,attrs=ssd--store=path=/mnt/hda1,attrs=hdd

Start command examples, for illustration:

n1: --insecure --background --locality=region=europe-west1,datacenter=europe-west1a --store=path=datafiles/n1,size=15%,attrs=ssd --listen-addr=192.168.2.1:26257 --http-addr=192.168.2.1:7071 --join=192.168.2.1:26257n2: --insecure --background --locality=region=europe-west1,datacenter=europe-west1b --store=path=datafiles/n2,size=15%,attrs=ssd --listen-addr=192.168.2.1:26258 --http-addr=192.168.2.1:7072 --join=192.168.2.1:26257n3: --insecure --background --locality=region=europe-west1,datacenter=europe-west1c --store=path=datafiles/n3,size=15%,attrs=ssd --listen-addr=192.168.2.1:26259 --http-addr=192.168.2.1:7073 --join=192.168.2.1:26257n4: --insecure --background --locality=region=europe-west2,datacenter=europe-west2b --store=path=datafiles/n4,size=15%,attrs=hdd --listen-addr=192.168.2.1:26260 --http-addr=192.168.2.1:7074 --join=192.168.2.1:26257n5: --insecure --background --locality=region=europe-west3,datacenter=europe-west3a --store=path=datafiles/n5,size=15%,attrs=hdd --listen-addr=192.168.2.1:26261 --http-addr=192.168.2.1:7075 --join=192.168.2.1:26257n6: --insecure --background --locality=region=europe-west3,datacenter=europe-west3a --store=path=datafiles/n6,size=15%,attrs=hdd --listen-addr=192.168.2.1:26262 --http-addr=192.168.2.1:7076 --join=192.168.2.1:26257

Now we can easily pin each partition of the table matching the given store constraints, ensuring that range replicas holding payments older than 2022-06-01 get stored on nodes with hdds, otherwise ssds:

ALTER PARTITION recent OF TABLE payments CONFIGURE ZONE USING constraints='[+ssd]';ALTER PARTITION archived OF TABLE payments CONFIGURE ZONE USING constraints='[+hdd]';

Notice that these attributes are just arbitrary tags or labels, you could use any name really.

Let's verify the range distribution:

SHOW RANGES FROM TABLE payments;

You should see something to the effect of:

start_key | end_key | range_id | range_size_mb | lease_holder |            lease_holder_locality             | replicas |                                                               replica_localities------------+---------+----------+---------------+--------------+----------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------  NULL      | NULL    |       85 |             0 |            3 | region=europe-west1,datacenter=europe-west1b | {3,4,6}  | {"region=europe-west1,datacenter=europe-west1b","region=europe-west2,datacenter=europe-west2b","region=europe-west3,datacenter=europe-west3a"}(1 row)

Now let's configure the zones and apply them to the corresponding partitions. This is when the actual rebalancing take effect:

ALTER PARTITION recent OF TABLE payments CONFIGURE ZONE USING constraints='[+ssd]';ALTER PARTITION archived OF TABLE payments  CONFIGURE ZONE USING constraints='[+hdd]';

Let's verify the range distribution again:

SHOW RANGES FROM TABLE payments;

You should now see something like this:

start_key | end_key | range_id | range_size_mb | lease_holder |            lease_holder_locality             | replicas |                                                               replica_localities------------+---------+----------+---------------+--------------+----------------------------------------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------  NULL      | /19144  |       70 |             0 |            5 | region=europe-west3,datacenter=europe-west3a | {4,5,6}  | {"region=europe-west2,datacenter=europe-west2b","region=europe-west3,datacenter=europe-west3a","region=europe-west3,datacenter=europe-west3a"}  /19144    | NULL    |       71 |             0 |            3 | region=europe-west1,datacenter=europe-west1b | {1,2,3}  | {"region=europe-west1,datacenter=europe-west1a","region=europe-west1,datacenter=europe-west1c","region=europe-west1,datacenter=europe-west1b"}(2 rows)

Lets insert some data to the table to see how things get placed:

INSERT INTO payments(booking_date,reference,amount,currency,archived)SELECT    current_date()-170,    md5(random()::text),    (no::FLOAT * random())::decimal,    'SEK',    falseFROM generate_series(1,100) no;

So here we added payments which are old enough to sort into the archived partition. Let's add a similar volume that is more recent also:

INSERT INTO payments(booking_date,reference,amount,currency,archived)SELECT    current_date(),    md5(random()::text),    (no::FLOAT * random())::decimal,    'SEK',    falseFROM generate_series(1,150) no;

There should be 250 payments in total:

select count(1) from payments where booking_date < '2022-06-01'; -- 100select count(1) from payments where booking_date >= '2022-06-01'; -- 150

Last by not least, how do we tell where each payment row is stored? We can use SHOW RANGE FROM TABLE which asks the database where it would store a given key, but not that the row actually exist. First lets find a row we know exist:

select booking_date, id from payments where booking_date < '2022-06-01' limit 1;

Now lets find the range for that composite primary key:

select replicas,replica_localities from [                                                                                       SHOW RANGE FROM TABLE payments FOR ROW ('2022-05-14'::date,'078678a4-2448-4189-8e78-0968544bf3da'::uuid)];  replicas |                                                               replica_localities-----------+-------------------------------------------------------------------------------------------------------------------------------------------------  {4,5,6}  | {"region=europe-west2,datacenter=europe-west2b","region=europe-west3,datacenter=europe-west3a","region=europe-west3,datacenter=europe-west3a"}(1 row)

Now we can confirm the row key is stored in range replicas on nodes 4,5,6 which if you recall happens to be the nodes with [+hdd] locality attributes.

Let's run the same thing for the more recent payements:

select booking_date, id from payments where booking_date >= '2022-06-01' limit 1;

and:

select replicas,replica_localities from [                                                                                       SHOW RANGE FROM TABLE payments FOR ROW ('2022-10-31'::date,'00071e6b-cc7d-4217-ac92-2a7f5a3cb923'::uuid)];  replicas |                                                               replica_localities-----------+-------------------------------------------------------------------------------------------------------------------------------------------------  {1,2,3}  | {"region=europe-west1,datacenter=europe-west1a","region=europe-west1,datacenter=europe-west1c","region=europe-west1,datacenter=europe-west1b"}(1 row)

Now we have verified that this row key both exist and is stored in range replicas on nodes 1,2,3 which have the [+ssd] locality attribute.

Archival Solution

Eventually you may also want to move long tail range partitioned data to a separate, off-line cloud storage or similar and then purge or delete the source from the online transactional database. CockroachDB doesnt currently support dropping data associated with partitions (only the configs, in which case data gets rebalanced again). There are still methods to achieve a similar outcome.

One approach is to create a changefeed to a cloud storage sink (or use Kafka or a webhook to bridge between) and then issue an UPDATE to tables with rows to be permanently archived.

UPDATE payments SET archived=true WHERE booking_date < '2022-06-01' and not archived                                        RETURNING id;

This will return a list of payments marked for archiveal. In addition the update will emit a group of change events (CDC) describing each to-be-deleted row in either JSON or Avro format. There's an option to include the values in the change stream which means no information is lost in the archival store.

After that, the next step is to issue DELETE statements in batches for the archived rows using a predicate to filter out the rows that've been streamed to a downstream cloud storage.

DELETE FROM payments WHERE archived=true;

You would want do some checkpointing before to ensure that all rows have been sent to the downstream sink before deleting.

This type of archival cycle could be built into a shared domain-agnostic service, or become a per-service responsiblity. Data could also be deleded using Row-Level TTL.

Conclusion

We looked at CockroachDB's archival partitioning mechansim to move cold data to nodes with different locality attributes transparently. This can be part of an archival strategy to ultimately remove data after staging in nodes provisioned with more inexpensive storage.

Multi-region Deployments with CockroachDB - Part 2

Kai Niemi — Mon, 31 Oct 2022 18:02:49 GMT

Overview

This article is an introduction to the high-availability and multi-region capabilities of CockroachDB, with focus on region level survival. In the first part, we looked into the default zone-level survival goal.

Surviving region level failures

Region-level survival means that the database will remain fully available for reads and writes, even if an entire region (or a majority of it's AZs) goes down. In this mode, we get:

Region survival is guaranteed (at most 1 of 3 regions failing, for example)
Low-latency reads from all regions.
Higher-latency writes from all regions (at least as much as the round-trip time to the nearest region)
A choice of low-latency stale reads or high-latency fresh reads from other regions (and high-latency fresh reads is the default).

Region Level Survival Scenario

Let's use the following scenario to illustrate region level survival.

Cluster Configuration

3 nodes times 3 regions, nodes diversified across separate AZs (3x3 in total)

Multi-Region Configuration

In this configuration, we move from default zone survival to get:

Region level survival (for example, 1 of 3 regions failing at most).
Low-latency reads from all regions.
Higher-latency writes from all regions (due to region survival).

The above ^^ diagram illustrates of a global, 3-region deployment with REGION survival. The colored rectangles represent range replicas.

In this configuration, we changed both survival goals and table localities. Changing the survival goal to REGION increases the replication factor from three to five, which means three out of five replicas must achieve quorum for writes. Reads bypass raft quorums through the concept of leaseholders.

This write overhead is balanced out somewhat by CockroachDB using a parallel, atomic commit protocol that reduce the latency of transactions down to only a single round-trip of distributed consensus.

By placing two replicas in the write-optimized region (in the diagram above, thats the US region), and spreading the remaining replicas out over the other regions, the end result is that the need for consulting three out of the five replicas to achieve quorum can be handled by consulting only a single region aside from the write-optimized region.

Notable also is that CockroachDB replicates data at the level of ranges (leader + followers forming a raft group) rather than at node or database level. This allows for fine grained data placement controls using both built-in heuristics and operator constraints or restrictions (covered later).

Changing table localities is an optional but very powerful and flexible way to achieve predictable read and write latencies in global, multi-regional deployments. For example, by using REGIONAL BY ROW locality for specific tables with low read and write latency requirements.

For tables with locality REGIONAL BY TABLE, all data in a given table will be optimized for reads and writes in a single region (primary region unless specified) by placing voting replicas and leaseholders there. This is useful if your data is usually read/written in only one region, and the round trip across regions is acceptable in cases where data is accessed from a different region.

In a multi-region database, REGIONAL BY TABLE IN PRIMARY REGION is the default locality for tables if the LOCALITY is not specified during CREATE TABLE. You can specify which region you wish this data to live in by specifying REGIONAL BY TABLE IN.

For tables with locality REGIONAL BY ROW, individual rows can be homed to a region of your choosing. This is useful for tables where data should be localized for the given application. Think of REGIONAL BY ROW as the "home" region being defined at row level rather than at table level using a hidden or computed column.

A GLOBAL table is optimized for low-latency reads from every region in the database. The tradeoff is that writes will incur higher latencies from any given region, since writes have to be replicated across every region to make the global low-latency reads possible. Use global tables when your application has a "read-mostly" table of reference data that is rarely updated, and needs to be available to all regions.

Failure Scenarios

Below are a few high-level visualizations of different failure scenarios in terms of forward progress from a local application/service standpoint. These scenarios suggest there also being some form of GSLB in place that can seamlessly redirect/balance traffic across the globe.

In this (^^) scenario, two zones are down in the primary region. Since theres still one node available, requests can still be processed in that region, although with a degraded compute/IO capacity of 66%. Eventually, within <10s, the leases for any unavailable ranges (purple color) are transferred to the other two regions.

In this (^^) scenario, all three zones are down in the primary region. Requests from clients can quickly be routed to the other regions through some GSLB solution. Eventually (<10s) the leases for any unavailable ranges in the downed region are transferred through a raft operation to any of the other two regions and under-replicated ranges are up-replicated to meet the survival goal (five replicas).

This ^^ scenario is similar to the previous one, only in a different region.

Data Domiciling Option

Data domiciling in CockroachDB allow users to keep certain subsets of data in specific localities. For example: US data only across nodes in the US, EU data across nodes in EU and so forth for compliance and performance reasons. Its implemented by controlling the placement of specific row or table data using regional tables with the REGIONAL BY ROW and REGIONAL BY TABLE clauses. These placement constraints can be either through restrictions or super regions.

Placement Restrictions

A database can use PLACEMENT RESTRICTED to opt-out of non-voting replicas, which can be placed outside of the regions indicated in zone configuration constraints.

In addition to data domiciling, PLACEMENT RESTRICTED can be used to reduce the total amount of data in the cluster and reduce the overhead of replicating data across a large number of regions. Global tables are not affected by PLACEMENT RESTRICTED and will still be placed in all database regions.

Super Regions

Super regions were introduced in 22.1 primarily to enhance the data domiciling capability. Super regions allow a user to define a set of regions on the database, such that regional and regional-by-row tables located within the super region will have all of their replicas located within the super region.

Super regions take a different approach to data domiciling than PLACEMENT RESTRICTED. Specifically, super regions make it so that all replicas (both voting and non-voting) are placed within the super region, whereas PLACEMENT RESTRICTED makes it so that there are no non-voting replicas.

This means that with super regions, you get to have both data domiciling and region survivability, and region-local latencies on writes. The likelihood of a super-region having a full outage is significantly lower than a single region, but in case that would happen, only access to domiciled data would be be refused. An outage of that magnitude and you'd probably have bigger fish to fry.

In this ^^ diagram, there are three super-regions with domiciled data in each. In the event of a single region outage in any of these super-regions, it would still allow forward progress for domiciled data. In addition, domiciled data will have all its voting and non-voting replicas pinned to each super-region, allowing writes to reach a quorum without any cross-link coordination.

Configuration Example

To wrap things up, here is a basic cconfiguration example of using regional-by-row. It uses a computed column with the crdb_internal_region to map city names into region names, which is then used for replica placement.

create table users(    id         uuid   not null default gen_random_uuid(),    city       string not null,    first_name string not null,    last_name  string not null,    address    string not null,    primary key (id asc));create index users_last_name_index on users (city, last_name);insert into users (city, first_name, last_name, address)select 'stockholm',       md5(random()::text),       md5(random()::text),       md5(random()::text)from generate_series(1, 100);insert into users (city, first_name, last_name, address)select 'dublin',       md5(random()::text),       md5(random()::text),       md5(random()::text)from generate_series(1, 100);insert into users (city, first_name, last_name, address)select 'boston',       md5(random()::text),       md5(random()::text),       md5(random()::text)from generate_series(1, 100);alter database test primary region "eu-central-1";alter database test add region "eu-west-3";alter database test add region "us-east-2";alter table users    add column region crdb_internal_region as (        case            when city in                 ('stockholm', 'copenhagen', 'helsinki', 'oslo', 'riga', 'tallinn') then 'eu-central-1'            when city in                 ('dublin', 'belfast', 'london', 'liverpool', 'manchester', 'glasgow', 'birmingham', 'leeds', 'madrid',                  'barcelona', 'sintra', 'rome', 'milan', 'lyon', 'lisbon', 'toulouse', 'paris', 'cologne', 'seville',                  'marseille', 'naples', 'turin', 'valencia', 'palermo') then 'eu-west-3'            when city in                 ('new york', 'boston', 'washington dc', 'miami', 'charlotte', 'atlanta', 'chicago', 'st louis',                  'indianapolis', 'nashville', 'dallas', 'houston', 'san francisco', 'los angeles', 'san diego',                  'portland', 'las vegas', 'salt lake city') then 'us-east-2'            else 'eu-central-1'            end        ) stored not null;alter table users set locality regional by row as region;select id,city,region from users;show ranges from table users;

Conclusion

In this article, we looked at CockroachDB's multi-active availability model to survive region level failures. In addition, using "super-regions" enables data domiciling to benefit from both low latency reads and writes and region level survival. All using a declarative approach with a few SQL statements.

Multi-region Deployments with CockroachDB - Part 1

Kai Niemi — Mon, 31 Oct 2022 18:01:40 GMT

Overview

This article is an introduction to the high-availability and multi-region capabilities of CockroachDB, with focus on zone level survival, which is also the default mode.

First, let's take a closer look at how high-availability works in CockroachDB.

Multi-Active Availability

The high-availability (HA) model that CockroachDB uses is referred to as multi-active.

At a high-level, multi-active can be seen as an evolution over a more traditional active/passive and active/active paired models for disaster recovery. Its based on the principle that it's better from an HA standpoint to design zone and regional fault-tolerance into the system (end-to-end) and expect failures as the norm rather than an exception.

It's made possible by the consensus based replication model and strong consistency properties of the database. This effectively enables transparent load-balancing and re-routing cross region boundaries without having to consider traffic affinity rules, replication delays or risks of asynchronous replication.

When you build a service with business rules enforced by the stateful tier, you can rest assured that these rules are not violated because of the deployment topology used. This takes a lot of burden away from app developers to worry about transactional integrity and consistency and instead pushes these concerns down to the stateful tier.

One common understanding is that any larger distributed systems is in a constant state of failure, so why not embrace the fact and design for it. In this regard, keeping the business processing layer stateless and the stateful tier multi-active will take you a long way:

By contrast, failover-based HA models are typically based on the opposite; assuming things will not fail and if they do, its considered "exceptional" events where you need stand-by capacity ready to kick in at short notice.

In the meantime, that standby capacity is either sitting idle or is heavily under-utilized. This model also suffers from the drawbacks of asynchronous replication and complexities of later failing-back (to use an odd term) and restore normal operation afterwards. Not to mention the difficulties of testing and verifying that it all works in a production environment.

When instead adopting system design principles that allow business services to operate from different sites/locations (cross-region) simultaneously, large blast-radius outages can be handled seamlessly by load balancing and traffic re-routing. Another effect is that resource utilization becomes more cost efficient since you don't need the same hardware footprint in terms of stand-by capacity blowing hot air and waiting for a disaster to happen.

One prerequisite for adopting the multi-active model end-to-end, is to have a stateful tier (database) able to act as a control plane with the ability to span different failure domains (regions) while also supporting transactional and strong consistency guarantees towards services/apps.

Cluster Survival Goals

For a CockroachDB cluster stretching across region boundaries, for example us-east-1, europe-west1 and asia-northeast1, it's important to consider the survival goals and read/write latency expectations in each region.

To satisfy different survival, performance and data locality compliance goals for the database, there are different multi-region capabilities such as regional tables, regional by row and global tables.

Regional tables provide low-latency reads and writes for an entire table from a single region.
Regional by row tables provide low-latency reads and writes for one or more rows of a table from a single region. Different rows in the table can be optimized for access from different regions.
Global tables are optimized for low-latency reads from all regions.

Surviving zone level failures

Zone-level survival is the default configuration for a multi-region deployment, where:

Zone survival is guaranteed (for example, at most 1 of 3 AZs failing).
Low-latency reads and writes from a single region (primary).
A choice of low-latency stale reads or high-latency fresh reads from other regions (and high-latency fresh reads is the default).

Surviving region level failures

Region-level survival means that the database will remain fully available for reads and writes, even if an entire region (or a majority of it's AZs) goes down. In this mode, we get:

Region survival is guaranteed (at most 1 of 3 regions failing, for example)
Low-latency reads from all regions.
Higher-latency writes from all regions (at least as much as the round-trip time to the nearest region)
A choice of low-latency stale reads or high-latency fresh reads from other regions (and high-latency fresh reads is the default).

Zone Level Survival Scenario

Let's look at the following scenario to illustrate zone level survival.

Cluster Configuration

3 nodes times 3 regions, nodes diversified across separate AZs (3x3 in total)

Multi-Region Configuration

In this configuration, we use the default zone survival to get:

Zone level survival, assuming at least 3 zones in one region.
Low-latency reads and writes from a single region.
A choice of low-latency stale reads or high-latency fresh reads from other regions (and high-latency fresh reads is the default)

By default, a database is created with ZONE survivability. This means that the system will create a zone configuration with at least three replicas, and will spread these replicas out amongst the available regions defined in the database.

In the figure above (^^) we have a database configured with ZONE survivability, which has placed three replicas in three separate availability zones in the primary region A and two non-voting replicas in the other regions B and C. The colored rectangles represent range replicas (just a massive number of two ranges here for simplicity). In a zone level survival configuration, voting replicas and lease preferences are in a primary region with non-voting replica in others for local stale reads.

This means that writes in the A region will be fast as they wont need to replicate out of the region, but writes from the B and C region will need to consult the A region. Writes must go through the leaseholder for each range in a transaction and these are only located in the primary region. The system will also place additional non-voting replicas to guarantee that stale reads from all regions can be served locally.

Non-voting replicas follow the Raft log and are thus able to serve follower reads, but do not participate in quorum, with almost no impact on write latencies. A follower read is a historical read at a given timestamp in the past with either exact or bounded staleness guarantees. This allows reads to scale efficiently without having to chase down the leaseholder.

Failure Scenario

Let's look a few failure scenarios.

In this (^^) scenario, two nodes are down but in different regions. Only one zones holding the voting replicas are down in the primary region, not affecting forward progress.

In this (^^) unhappy scenario however, two zones holding voting replicas are down in the primary region. Because the voting replicas and leases are pinned to the primary region in ZONE survival mode, forward progress (on writes) is stopped even though the remaining nodes are reachable.

Using follower reads with bounded staleness however, still provides the ability to serve reads from local replicas even in the presence of a network partition or other failure event that prevents the SQL gateway from communicating with the leaseholder.

Summary

To increase availability beyond zone level, the next step is to define a region level survival goal for a database. Let's contrast zone survival against region survival in a follow-up article. Just to give an overview of how these survival goals differ from a quality attribute standpoint:

Non-Functional Property	Region Survival	Zone Survival (default)
Failures a database can survive	Region (1 of 3 etc)	Zone (1 of 3 etc)
Consistency	ACID/1SR, consistent stale reads	ACID/1SR, consistent stale reads
Multi-region Configuration	REGION survival, regional by row and global table localities as option	ZONE survival, global table localities as option
Performance	Low latency reads in all regions, high latency writes for regional tables	Low latency reads and writes in primary region, low latency stale reads in others

Regional by row table localities offer both low read and write latency. Global tables offer low latency reads in all regions at the expense of higher write latency.

Conclusion

We looked at CockroachDB's multi-active availability model to survive zone level failures in a multi-regional deployment. By default, the database optimize for reads and writes in a primary region. This comes at the expense of survival. If a majority of zones/nodes in the primary region are offline, then it affects all other regions as well. The next level is region-level survival, which in combination with "super-regions" enable data domiciling to benefit from both low latency reads and writes and region survival.

Exploring Fault Tolerance in CockroachDB

Kai Niemi — Sat, 29 Oct 2022 14:45:02 GMT

CockroachDB Fault Tolerance

As the name reveals, CockroachDB is designed to survive all sorts of failure scenarios. From individual nodes crashing to asymmetric network partitions, zone and region level failures. Caused perhaps by non-mundane events, like a melted power supply, flooding, fire, a wider power outage or perhaps sharks taking a bite off the fiber link:

Make no mistake, these incidents do happen like for this rather business critical rock island in the Mediterranean a few years back:

Database survival can be defined as the ability to make forward progress on both reads and writes during a service disruption. If a client can reach a node with a request, it should either allow or refuse it in respect to liveness and safety. A safety property asserts that nothing bad happens during execution. A liveness property asserts that something good eventually happens. Another way of putting it: liveness means that a specific event must occur, while safety means that an unintended event must not occur. This is essential for things like failure detection and consensus algorithms.

To use CAP terminology, CockroachDB is CAP-consistent, which means it can be partially unavailable in the event of a network partition in order to not compromise consistency. Partially unavailable means that some requests may succeed while others may not, depending on factors like which side of a network partition the client and node sits on, leaseholder status for the range that a key sorts into, etc. Keep in mind that the client (application) is also an actor in any distributed system.

^^ CAP simplified: A system is either (C)onsistent or (A)vailable when (P)artitioned

But how can you actually demonstrate that this works? That a node does not return a non-authoritative read when its not allowed to, or worse, a dirty or phantom read or allowing a write to succeed which later disappears?

It's like Lenin said: "Trust is good, control is better", amongst a few other things..

There are many ways to achieve this, as always. Either by running a simulation of some kind, or using an actual system under load while monitoring invariants or a chaos type testing scenario like the Jepsen framework. The latter option requires more upfront planning and tooling, but its a good way to find edge cases and gain confidence that there's are no gaps between expected and actual outcomes under chaos.

The rest of this article will narrow things down to a much simpler approach (yet more limited) by focusing on verifying expected vs actual outcome at the level of individual read and write operations when nodes are taken down violently. No special tooling is needed besides the CLI and ssh.

Introduction to CockroachDB Fault Tolerance

For further insights to CockroachDB HA properties and a glossary of the terms used in the article, see:

https://www.cockroachlabs.com/docs/stable/multi-active-availability.html
https://www.cockroachlabs.com/docs/stable/architecture/glossary.html

CockroachDB stores user data in 512 MiB sized ranges that are replicated three times (by default) across failure domains for maximum diversification. A failure domain can be a machine, rack, datacenter, an availability zone, region or even a cloud provider. For the database to function properly, a majority of replicas must be available at all times ().

The number of failures that can be tolerated in CockroachDB is equal to (replication factor - 1)/2. For example, with 3x replication (the default), one failure can be tolerated; with 5x replication, two failures, and so on. The replication factor can be controlled at cluster, database, table and index level using replication zones. The replication factor is automatically adjusted based on defined database survival goals in a multi-region deployment (zone or region survival).

It is recommended to run CockroachDB across a minimum of three failure domains for optimal resilience. To guarantee a zero RPO (no data loss) and near-zero RTO (<10s), it is imperative that these failure domains be sufficiently isolated from each other such that the risk of simultaneously losing a majority of them is extremely rare.

Running a CockroachDB cluster across two failure domains is therefore considered a CockroachDB anti-pattern because the loss of a single failure domain could lead to unavailability. Avoid the number 2 and you are in a better spot, pretty much.

Test Setup

Prerequisites

The test scenarios expects a pre-provisioned CockroachDB cluster with the following outline:

6 node cluster, single region, 2 nodes per zone
Using default replication factor of 3
One database with 2 tables, populate with some data
Self-hosted for the ability to kill nodes
v22.1 or later

Example localities, for illustration:

--locality=region=eu-west-1,datacenter=eu-west-1a--locality=region=eu-west-1,datacenter=eu-west-1b--locality=region=eu-west-1,datacenter=eu-west-1c--locality=region=eu-west-1,datacenter=eu-west-1a--locality=region=eu-west-1,datacenter=eu-west-1b--locality=region=eu-west-1,datacenter=eu-west-1c

Schema

Table schema with two tables and some data used in the scenarios:

create table users(    id         int    not null default unique_rowid(),    city       string not null,    first_name string not null,    last_name  string not null,    address    string not null,    primary key (id asc));create index users_last_name_index on users (city, last_name);insert into users (id,city,first_name,last_name,address)select n,       'london',       md5(random()::text),       md5(random()::text),       md5(random()::text)from generate_series(1,10000) n;

Test Scenarios

The test scenarios have different assertions, principles in use and steps for validation.

Scenario A - Reads during Failure

Assertion

A read for a key must not complete when the leaseholder for that range is unavailable.

Principles

Only the leaseholder replica is allowed to service reads and writes for a range unless it's afollower read (opt-in feature), in which case follower replicas can serve potentially stale reads.
The leaseholder can service reads until it's epoch expires due to failure to update itsliveness record stored in a system range (these are replicated 5 times by default).
A follower node or gateway node that cannot reach the leaseholder for a range will not complete.

Resources

https://www.cockroachlabs.com/docs/stable/architecture/reads-and-writes-overview
https://www.cockroachlabs.com/docs/stable/architecture/replication-layer#epoch-based-leases-table-data

Steps to verify

1) Take note of the range distribution and leaseholder for the first key in users table, designated K1

   SHOW RANGE FROM TABLE users FOR ROW (1);   start_key | end_key | range_id | lease_holder |      lease_holder_locality       | replicas |                                             replica_localities   ------------+---------+----------+--------------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------------   NULL      | NULL    |       45 |            1 | region=eu-west-1,zone=eu-west-1a | {1,5,6}  | {"region=eu-west-1,zone=eu-west-1a","region=eu-west-1,zone=eu-west-1c","region=eu-west-1,zone=eu-west-1b"}   (1 row)

The SHOW RANGE FROM TABLE statement asks the database where it would store the row key. In this case across replicas 1, 5 and 6 with 1 being the leaseholder. Keep in mind this does not tell you if the row actually exist. For that you can use a point lookup in combination.

2) Kill two nodes holding K1 replicas, one being the current leaseholder. In the above output, lets pick 1 and 5:

   (ssh to node 1)   killall -9 cockroach   (ssh to node 5)   killall -9 cockroach

3) Connect with a SQL client to any other node that holds the last reachable replica for K1 and execute a point lookup. In the above example, its node 6:

   (ssh to node 6)   SELECT * from users where id=1;   (blocks until timeout)

If you wait too long you may also get:

ERROR: replica unavailable: (n6,s6):2 unable to serve request to r45:/Table/106{-/2} [(n1,s1):1, (n6,s6):2, (n5,s5):5, next=6, gen=17]: lost quorum (down: (n1,s1):1,(n5,s5):5); closed timestamp ...[truncated]

4) Restart the 2 failed nodes

5) Observe that the blocked or failing read in step 3 now completes

Scenario B - Writes during Failure

Assertion

A write to a key must not complete if the leaseholder for that range is unavailable.

Principles

A transaction can only commit if a majority of range replicas are in agreement (raft consensus).
The range leaseholder, aka raft group leader ensures this.

Resources

https://www.cockroachlabs.com/docs/stable/architecture/life-of-a-distributed-transaction.html
https://www.cockroachlabs.com/blog/parallel-commits/

Steps to verify

1) Take note of the range distribution and leaseholder for the first key in users table, designated K1 (same as in previous scenario)

   SHOW RANGE FROM TABLE users FOR ROW (1);   start_key | end_key | range_id | lease_holder |      lease_holder_locality       | replicas |                                             replica_localities   ------------+---------+----------+--------------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------------   NULL      | NULL    |       45 |            1 | region=eu-west-1,zone=eu-west-1a | {1,5,6}  | {"region=eu-west-1,zone=eu-west-1a","region=eu-west-1,zone=eu-west-1c","region=eu-west-1,zone=eu-west-1b"}   (1 row)

2) Kill two nodes holding K1 replicas, one being the current leaseholder. In the above output, lets pick 1 and 6:

   (ssh to node 1)   killall -9 cockroach   (ssh to node 6)   killall -9 cockroach

3) Connect to any node that does not hold a replica for K1, lets pick 2:

    (ssh to node 2)    SELECT * from users where id=1;    (blocks until timeout)

4) Connect to any node in yet another session and update K1:

    (ssh to node 3)    UPDATE users SET last_name = 'xxx' where id=1;    (blocks until timeout)

6) Restart the two nodes

7) Observe that both blocked/failing operations (3 and 4) complete

Note: If the UPDATE in step 4 is done in an explicit transaction and the client then terminates while waiting, the update will not become visible (aborted transaction).

Conclusion

This article touched on the surface on CockroachDBs survival properties. We used a simplicity yet limited approach to verify read and write behavior under failure. As a next step and follow-up, we'll look at the HA characteristics in a global, multi-region deployment. That's typically when things get really interesting.

Model Tree Structures with CockroachDB

Kai Niemi — Sat, 29 Oct 2022 10:02:46 GMT

Overview

Modeling hierarchical tree structures in SQL databases has traditionally been difficult. At least until the introduction of Recursive Common Table Expressions (Recursive CTEs) that made it much simpler to craft queries to traverse a recursive path within a hierarchy, aka a tree structure.

A tree is an undirected, connected graph with no cycle, so when modeling product catalogs with categories for example, it's formally a special form of graph:

You don't need to use a graph database for that though. There are two main approaches to modeling tree structures in relational SQL databases:

Adjacency List Model
Nested Set Model

The Adjacency List Model with recursive CTEs provides the easiest approach. It's much less taxing on mutations of the tree structure, more performant and supports arbitrarily deep nestings / levels. The SQL is also much simpler and client code cleaner as well.

The Nested Set Model requires tracking two references per item instead of a single one in the adjacency model, making things less efficient and more difficult to implement in client code and SQL.

The Nested Set Model is not commonly used in SQL databases anymore since the introduction of recursive CTEs decades ago. That doesn't make it less interesting though, and it's likely still in use in many systems, in particular MySQL. Although completely unrelated to this, it's like exploring shortest path algorithms like Dijkstra's. Just of those things interesting to know a bit more about.

Let's look at an implementation of the nested set model using Spring Boot and CockroachDB to follow the historical order. We could have used the adjacency list model with recursive CTEs instead, but that's for a follow up post.

Source Code

The source code for this example is available in GitHub.

The Adjacency List Model

The adjacency list model is pretty straightforward. It's a parent-child relationship where each item in the table contains a pointer to its parent. The benefits are that it's simple, provides referential integrity and its also easy and efficient to update the structure using SQL.

The main drawback is that it performs poorly with deeply nested tree's (many levels), unless using recursive CTEs. Combined with CTEs, this is the preferred and portable approach to use in modern databases over the nested set model.

create table category(    id            int          not null default unordered_unique_rowid(),    name          varchar(64)  not null,    description   varchar(256),    parent_id     int,    primary key (id));

Listing all the categories:

select id,       name,       parent_idfrom category order by name;

id	name	parent_id
109012213791195137	0-60	5980017278022057985
1854157069397262337	101-250	5980017278022057985
3771564610750251009	251-500	5980017278022057985
1342998511690711041	61-80	5980017278022057985
8686117704118304769	81-100	5980017278022057985
9138447991692328961	Anis	1692308957788635137
7371629562879541249	Argentina	9184328412896165889
4597412192419315713	Australia	9184328412896165889
7882788120586092545	Austria	3785638359585783809
1066449347072491521	Available in store	1942399474596052993
3447586912556285953	Box	1468043770094419969
5176406219513135105	Bread	1692308957788635137
888979374256422913	Butter	1692308957788635137
3175119135100370945	Champagne	1468043770094419969
5261939428061085697	Chardonnay	2664488342975152129
9124655717833506817	Country	null
939081920110919681	Dessert	1468043770094419969
3785638359585783809	Europe	9124655717833506817
7803834389618753537	Findings under 150	1942399474596052993
2379389375939346433	France	3785638359585783809
6464154237964386305	Germany	3785638359585783809
2664488342975152129	Grape	null
7704051510374825985	Honey	1692308957788635137
412442238685282305	Italy	3785638359585783809
4060217199367028737	Kanonkop	3346396658428805121
1692308957788635137	Label	null
9197698474289922049	Masi	3346396658428805121
4113521523081609217	Merlot	2664488342975152129
5410171187671334913	Miguel Torres	3346396658428805121
9184328412896165889	New World	9124655717833506817
5980017278022057985	Price	null
3346396658428805121	Producer	null
6128495328236929025	Recent News	1942399474596052993
4539709822193631233	Red	1468043770094419969
3694686757736153089	Riesling	2664488342975152129
2098758824158822401	Rose	1468043770094419969
2381641175753031681	South Africa	9184328412896165889
2032612204631818241	Spain	3785638359585783809
6512286458981908481	Sparkling	1468043770094419969
1942399474596052993	Special	null
1468043770094419969	Types	null
3501067158131310593	Vanilla	1692308957788635137
7836344749428834305	White	1468043770094419969
4807111050068754433	Young	1692308957788635137

Tree listing using recursive CTEs

Just to show the usefulness of recursive CTEs, this query will list the entire tree structure in one single query:

with recursive category_tree (id, name, parent_id, level, path) as                   (select id,                           name,                           parent_id,                           0,                           concat('/', name, '/')                    from category                    where parent_id is null -- anchor                    union all                    select c.id,                           c.name,                           c.parent_id,                           ct.level + 1,                           concat(ct.path, c.name, '/')                    from category c                             join category_tree ct                                  on c.parent_id = ct.id)select *from category_treeorder by path;

To filter down on a particular sub-tree name, just change the CTE anchor query:

with recursive category_tree (id, name, parent_id, level, path) as                   (select id,                           name,                           parent_id,                           0,                           concat('/', name, '/')                    from category                    where parent_id is null                      and name = 'Label' -- anchor                    union all                    select c.id,                           c.name,                           c.parent_id,                           ct.level + 1,                           concat(ct.path, c.name, '/')                    from category c                             join category_tree ct                                  on c.parent_id = ct.id)select *from category_treeorder by path;

Now, forget about the CTE approach and let's do it the hard way.

The Nested Set Model

In the Nested Set Model, each item in the table contains two pointers to track the containment of sublevels. This allows for a query to extract sub-trees in one single query rather than use recursion or one self-join per level, which is the drawback of the adjacency model without a recursive CTE.

As such, a tree hierarchy is not expressed as vertices and edges in a parent-child relationship, but instead as nested containers. Instead, picture the wine catalog like this:

Let's update the schema:

create table category(    id            int          not null default unordered_unique_rowid(),    name          varchar(64)  not null,    description   varchar(256),    lft           int          not null,    rgt           int          not null,    parent_id     int,    primary key (id));

select id,       name,       lft,       rgtfrom category order by name;

id	name	lft	rgt
109012213791195137	0-60	2	3
1854157069397262337	101-250	4	5
3771564610750251009	251-500	6	7
1342998511690711041	61-80	8	9
8686117704118304769	81-100	10	11
9138447991692328961	Anis	14	15
7371629562879541249	Argentina	65	66
4597412192419315713	Australia	67	68
7882788120586092545	Austria	53	54
1066449347072491521	Available in store	28	29
3447586912556285953	Box	36	37
5176406219513135105	Bread	16	17
888979374256422913	Butter	18	19
3175119135100370945	Champagne	38	39
5261939428061085697	Chardonnay	82	83
9124655717833506817	Country	51	72
939081920110919681	Dessert	40	41
3785638359585783809	Europe	52	63
7803834389618753537	Findings under 150	30	31
2379389375939346433	France	55	56
6464154237964386305	Germany	57	58
2664488342975152129	Grape	81	88
7704051510374825985	Honey	20	21
412442238685282305	Italy	59	60
4060217199367028737	Kanonkop	74	75
1692308957788635137	Label	13	26
9197698474289922049	Masi	76	77
4113521523081609217	Merlot	86	87
5410171187671334913	Miguel Torres	78	79
9184328412896165889	New World	64	71
5980017278022057985	Price	1	12
3346396658428805121	Producer	73	80
6128495328236929025	Recent News	32	33
4539709822193631233	Red	42	43
3694686757736153089	Riesling	84	85
2098758824158822401	Rose	44	45
2381641175753031681	South Africa	69	70
2032612204631818241	Spain	61	62
6512286458981908481	Sparkling	46	47
1942399474596052993	Special	27	34
1468043770094419969	Types	35	50
3501067158131310593	Vanilla	22	23
7836344749428834305	White	48	49
4807111050068754433	Young	24	25

The left and right values for each item is determined by traversing the graph starting from the root (there are multiple roots here) and descending to each leaf node, before assigning a number to the right and moving on. Gradually progressing from left to right. This is known as the modified preorder tree traversal (MPTT) algorithm.

Category structure example with the left and right numbers laid out:

Tree listing using Nested Set Model

To show the nested set model in action, this is the query that will list the entire tree structure using a cross join:

select category0_.name as col_0_0_, count(category1_.name) - 1 as col_1_0_from category category0_         cross join category category1_where category0_.lft between category1_.lft and category1_.rgtgroup by category0_.name, category0_.lftorder by category0_.lft;

col_0_0_	col_1_0_
Price	0
0-60	1
101-250	1
251-500	1
61-80	1
81-100	1
Label	0
Anis	1
Bread	1
Butter	1
Honey	1
Vanilla	1
Young	1
Special	0
Available in store	1
Findings under 150	1
Recent News	1
Types	0
Box	1
Champagne	1
Dessert	1
Red	1
Rose	1
Sparkling	1
White	1
Country	0
Europe	1
Austria	2
France	2
Germany	2
Italy	2
Spain	2
New World	1
Argentina	2
Australia	2
South Africa	2
Producer	0
Kanonkop	1
Masi	1
Miguel Torres	1
Grape	0
Chardonnay	1
Riesling	1
Merlot	1

To filter down on a particular sub-tree name:

SELECT node.name AS name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depthFROM category AS node,     category AS parent,     category AS sub_parent,     (SELECT node.name, (COUNT(parent.name) - 1) AS depth      FROM category AS node,           category AS parent      WHERE node.lft BETWEEN parent.lft AND parent.rgt        AND node.name = 'Label'      GROUP BY node.name, node.lft      ORDER BY node.lft) AS sub_treeWHERE node.lft BETWEEN parent.lft AND parent.rgt  AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt  AND sub_parent.name = sub_tree.nameGROUP BY node.name, node.lft, depthORDER BY node.lft;

name	depth
Label	0
Anis	1
Bread	1
Butter	1
Honey	1
Vanilla	1
Young	1

The subquery could also be extracted to a CTE to improve readability a bit by rewriting the query:

with sub_tree as (SELECT node.name, (COUNT(parent.name) - 1) AS depth                  FROM category AS node,                       category AS parent                  WHERE node.lft BETWEEN parent.lft AND parent.rgt                    AND node.name = 'Label'                  GROUP BY node.name, node.lft                  ORDER BY node.lft)SELECT node.name AS name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depthFROM category AS node,     category AS parent,     category AS sub_parent,     sub_treeWHERE node.lft BETWEEN parent.lft AND parent.rgt  AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt  AND sub_parent.name = sub_tree.nameGROUP BY node.name, node.lft, depthORDER BY node.lft;

Demo Project

To explore the nested set model in a semi-realistic context, let us use a service with the following product catalog schema:

The entity model contains of products, product variations with attributes and categories organized into a tree structure. A fairly common layout for an online webshop and similar.

The main artifacts of interest are:

Category - Domain entity that models a hierarchy of categories.
JpaCategoryRepository - The JPA implementation using Criteria API to craft the nested set model operations.
ProductCatalogTest - Functional tests of the product catalog.

This implementation is using JPA with Hibernate and partly also the dreaded but type safe Criteria API which happens to be fairly suitable for this problem domain.

To take one query example, finding all products within a given non-leaf category such as "Europe" that has several country categories below it:

Client code:

    @Test    @Order(11)    @Transactional    @Commit    public void whenFindingBeveragesByInheritedCategory_expectResults() {        // Expects beverages under 'Europe'        Category category = categoryRepository.getByTypeAndName(DistrictCategory.class, "Europe");        List products = productRepository.findByCategory(category);        Assertions.assertEquals(products.size(), 1);    }

SQL:

First for the category:

select distinct districtca0_.id          as id2_1_,                districtca0_.description as descript3_1_,                districtca0_.lft         as lft4_1_,                districtca0_.name        as name5_1_,                districtca0_.parent_id   as parent_i8_1_,                districtca0_.rgt         as rgt6_1_,                districtca0_.country     as country7_1_,                categorize1_.category_id as category1_0_0__,                categorize1_.expires_at  as expires_2_0_0__,                categorize1_.product_id  as product_3_0_0__from category districtca0_         left outer join categorized_product categorize1_ on districtca0_.id = categorize1_.category_idwhere districtca0_.category_type = 'DISTRICT'  and districtca0_.name='Europe';

Then for the products in that category:

select product2_.id          as id1_2_,       product2_.description as descript2_2_,       product2_.name        as name3_2_,       product2_.sku_code    as sku_code4_2_from category category0_         inner join categorized_product categorize1_ on category0_.id = categorize1_.category_id         inner join product product2_ on categorize1_.product_id = product2_.idwhere category0_.lft between 52 and 63;

Conclusion

Modeling tree structures in SQL databases used to be a pain in the past until recursive CTEs were introduced. Before that, the nested set model offered a more efficient method in contrast to the adjacency list model without CTEs.

Running CockroachDB TPC-C benchmark on GKE

Kai Niemi — Sat, 29 Oct 2022 09:59:18 GMT

Overview

This article will demonstrate how to run a TPC-C 2.5K benchmark on a self-hosted, 3-node, single-region CockroachDB cluster on Google Kubernetes Engine (GKE).

The TPC-C workload is modeled around the concept of a warehouse which is used as the throughput "knob" to measure scalability. The number of warehouses used maps to a given data volume baseline where a warehouse count of 2,500 translates to approximately a 200 GiB dataset.

The demo cluster is using 3x c2-standard-16 machines to match that volume, along with provisioned IOPS to follow CockroachDB recommended production guidelines on hardware ratios. The workload is run through an internal client pod with an option to run through either and internal or external load-balancer as well.

About the TPC-C Benchmark

The CockroachDB's built-in TPC-C workload is based on official TPC-C, the industry standard benchmark for On-line Transaction Processing (OLTP) performance. It simulates an industry-agnostic business with an OLTP database that manages, sells, or distributes a product.

The TPC-C workload measures databases across two different metrics:

Throughput: Measured as throughput-per-minute (tpm), which in practical terms measures the number of orders processed per minute.
Scale: Measured as the total number of warehouses supported. Each warehouse is of a fixed data size and has a max amount of tpm that it is allowed to support, so the total data size of the benchmark is scaled proportionally to throughput.

The CockroachDB TPC-C implementation can be found here and the schema can be found here.

The TPC-C workload is constructed to validate that the efficiency rate can be sustainedwhen aiming for an increasingly higher tpmC (max throughput). Efficiency is measured inan explicit way. There's a limit to the number of tpmC allowed per warehouse, which is 12.86. The data amount per warehouse is about 200MiB, so for 2,500 warehouses, the maximum throughput is 2500 x 12.86 tpmC, which is 32,150.

Because TPC-C is constrained to a maximum amount of throughput per warehouse (12.86 tpmC), we often discuss TPC-C performance as the maximum number of warehouses for which a database can maintain the maximum throughput per minute (tpmC). In TPC-C, the required minimum to qualify is P9585%.

To take a few examples, assume:

100 warehouses at 200MiB 100 gives 1240 tpmC (max is 1286), that's an efficiencyrate of 96.4% or (1240/(100 12.86)).
1000 warehouses at 200MiB 1000 gives 12,500 tpmC, that's an efficiencyrate of 97.2% (12500/(1000 12.86)).
2500 warehouses at 200MiB 2500 gives 30,837 tpmC, that's an efficiencyrate of 95.9% (30837/(2500 12.86)).
100,000 warehouses at ~20TiB gives 1,200,000 tpmC, that's an efficiencyrate of 93.3%.

The largest [published result (https://www.cockroachlabs.com/docs/v22.1/performance#benchmarks-used) for CockroachDB is 1.7M tpmC with 140,000 warehouses on 81 nodes, resulting in an efficiency score of 95%.

TPC-C Test Setup

Overview of the cluster setup for a TPC-C workload size small (2.5K warehouses) on a 3 node cluster in a single region. For more details, seePerformance Benchmarking with TPC-C Small. and also Deploy CockroachDB with Kubernetes.

Layout:

Single region: europe-west-1
Machines: c2-standard-16
3 CockroachDB nodes + 1 client node
Secure cluster
Manual StatefulSet configuration

Optional:

External load balancer service
1x client outside of k8s for controlling the tpcc workload

Setup Steps

Step 1 - Start GKE cluster

gcloud container clusters create cockroachdb --machine-type c2-standard-16 --region europe-west1 --num-nodes 1

Step 2 - Create RBAC roles

kubectl create clusterrolebinding $USER-cluster-admin-binding --clusterrole=cluster-admin --user=<email>

Step 3 - Configure cluster

curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/bring-your-own-certs/cockroachdb-statefulset.yaml

Edit cockroachdb-statefulset.yaml and update:

Resource requests / limits to reflect c2-standard-16 machines:

    resources:      requests:        cpu: "15"        memory: "55Gi"      limits:        cpu: "15"        memory: "55Gi"

Add a custom pd-ssd storage class:

---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:    name: gocrazyprovisioner: kubernetes.io/gce-pdparameters:    type: pd-ssd

Add storageClassName and change storage size to 2TiB:

volumeClaimTemplates:- metadata:      name: datadir  spec:      accessModes:      - "ReadWriteOnce"      storageClassName: gocrazy      resources:          requests:              storage: 2Ti

pd-ssd is recommended for pods < 32 vCPU and a minimum of 500 IOPS per vCPU is needed for optimal performance.

Step 4 - Create certificates

mkdir certs my-safe-directorycockroach cert create-ca --certs-dir=certs --ca-key=my-safe-directory/ca.keycockroach cert create-client root --certs-dir=certs --ca-key=my-safe-directory/ca.keykubectl create secret generic cockroachdb.client.root --from-file=certscockroach cert create-node localhost 127.0.0.1 cockroachdb-public cockroachdb-public.default cockroachdb-public.default.svc.cluster.local "*.cockroachdb" "*.cockroachdb.default" "*.cockroachdb.default.svc.cluster.local" --certs-dir=certs --ca-key=my-safe-directory/ca.keykubectl create secret generic cockroachdb.node --from-file=certskubectl get secrets

Step 5 - Initialize cluster

kubectl create -f cockroachdb-statefulset.yamlkubectl get podskubectl get pvkubectl exec -it cockroachdb-0 -- /cockroach/cockroach init --certs-dir=/cockroach/cockroach-certskubectl get pods

Step 6 - Create secure pod for SQL cli

kubectl create -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/bring-your-own-certs/client.yamlkubectl exec -it cockroachdb-client-secure -- ./cockroach sql --certs-dir=/cockroach-certs --host=cockroachdb-public

Create user in CLI:

CREATE USER roach WITH PASSWORD '123456'; -- maintains the rank..GRANT admin to roach;

Step 7 - Access DB console (optional)

Setup port forwarding:

kubectl port-forward service/cockroachdb-public 8080open https://localhost:8080/#/overview/list

Step 8 - Add external Load Balancer (optional)

kubectl get serviceskubectl expose service cockroachdb --port=26257 --target-port=26257 --name=cockroachdb-external --type=LoadBalancerkubectl get services(wait for external ip)

Example:

NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)              AGEcockroachdb            ClusterIP      None                    26257/TCP,8080/TCP   11mcockroachdb-external   LoadBalancer   10.3.248.15    34.140.51.58   26257:30143/TCP      48scockroachdb-public     ClusterIP      10.3.244.188            26257/TCP,8080/TCP   11mkubernetes             ClusterIP      10.3.240.1              443/TCP              26m

Try connecting which should fail:

cockroach sql --url "postgres://root@34.140.51.58:26257" --certs-dir=certsERROR: x509: certificate is valid for 127.0.0.1, not 34.140.51.58

Update certificates with external IP:

cockroach cert create-node 34.140.51.58 localhost 127.0.0.1 cockroachdb-public cockroachdb-public.default cockroachdb-public.default.svc.cluster.local "*.cockroachdb" "*.cockroachdb.default" "*.cockroachdb.default.svc.cluster.local" --certs-dir=certs --ca-key=my-safe-directory/ca.key --overwritekubectl delete secret cockroachdb.node --ignore-not-foundkubectl create secret generic cockroachdb.node --from-file=certs

Restart pods and reconnect with success:

cockroach sql --url "postgres://root@34.140.51.58:26257" --certs-dir=certs

Benchmark Steps

Step 1 - Import dataset

2,500 warehouses is about 200GiB of data - see jobs in db console.

Use either alternative:

Option 1 - via public ip:

cockroach workload fixtures import tpcc --warehouses=2500 'postgres://root@:26257?sslmode=verify-full&sslrootcert=certs/ca.crt&sslcert=certs/node.crt&sslkey=certs/node.key'

Option 2 - via client pod and public service:

kubectl exec -it cockroachdb-client-secure -- ./cockroach workload fixtures import tpcc --warehouses=2500 'postgres://root@cockroachdb-public:26257?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt&sslcert=/cockroach-certs/client.root.crt&sslkey=/cockroach-certs/client.root.key'

Step 2 - Run TPC-C workload for 30m

Use either alternative:

Option 1 - via external lb:

ulimit -n 100000 && cockroach workload run tpcc --tolerate-errors --warehouses=2500 --ramp=1m --duration=15m 'postgres://root@34.140.51.58:26257?sslmode=verify-full&sslrootcert=certs/ca.crt&sslcert=certs/node.crt&sslkey=certs/node.key'

Option 2 - via client pod and public service:

kubectl exec -it cockroachdb-client-secure -- ./cockroach workload run tpcc --tolerate-errors --warehouses=2500 --ramp=1m --duration=30m 'postgres://root@cockroachdb-public:26257?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt&sslcert=/cockroach-certs/client.root.crt&sslkey=/cockroach-certs/client.root.key'

Option 3 - via client pod directly to pods:

create an addrs file:

postgres://root@cockroachdb-0.cockroachdb.default.svc.cluster.local:26257?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt&sslcert=/cockroach-certs/client.root.crt&sslkey=/cockroach-certs/client.root.key postgres://root@cockroachdb-1.cockroachdb.default.svc.cluster.local:26257?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt&sslcert=/cockroach-certs/client.root.crt&sslkey=/cockroach-certs/client.root.key postgres://root@cockroachdb-2.cockroachdb.default.svc.cluster.local:26257?sslmode=verify-full&sslrootcert=/cockroach-certs/ca.crt&sslcert=/cockroach-certs/client.root.crt&sslkey=/cockroach-certs/client.root.key

run:

kubectl exec -it cockroachdb-client-secure -- ./cockroach workload run tpcc --tolerate-errors --warehouses=2500 --ramp=1m --duration=30m $(cat addrs)

Step 3 - Review results

Ex:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)1800.1s    30837.5  95.9%    326.2    302.0    604.0    704.6    973.1   3623.9

Benchmark passing criteria for our derivative TPC-C results:

P90 Latency < 5 Seconds
Efficiency rate over 95%.

TPC-C requirements are P95<10s and efficiency rate over 85%.

Cleanup Steps

Ensure the storage claims are deleted as well since it's not automatic in GKE.

kubectl delete pods,statefulsets,services,poddisruptionbudget,jobs,rolebinding,clusterrolebinding,role,clusterrole,serviceaccount -l app=cockroachdbkubectl delete pod cockroachdb-client-securegcloud container clusters delete cockroachdb --region europe-west1

Conclusion

This was a tutorial of running the TPC-C workload in CockroachDB on a self-hosted GKE cluster.

https://www.cockroachlabs.com/guides/2022-cloud-report/
https://www.cockroachlabs.com/docs/v22.1/kubernetes-performance.html
https://www.cockroachlabs.com/docs/v22.1/operate-cockroachdb-kubernetes.html
https://www.cockroachlabs.com/docs/v22.1/recommended-production-settings.html
https://cloud.google.com/compute/docs/disks
https://www.cockroachlabs.com/docs/v22.1/performance-benchmarking-with-tpcc-small
https://www.cockroachlabs.com/docs/v22.1/performance#scale
http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf

Trading Engine using CockroachDB

Kai Niemi — Wed, 26 Oct 2022 16:00:42 GMT

Overview

This article describes a simple online trading engine sandbox implemented using Spring Boot and CockroachDB. The purpose is to showcase this technology stack in a slightly more involved use case than ordinary "hello world". It demonstrates a few common mechanisms and microservice architecture patterns such as retryable transactions and event-driven data redundancy via CDC.

Example Code

The code for this service is available in Github.

Design

The system provides the ability to place buy and sell orders on stock products and review order history. It models customer and market accounts. Accounts must not be overdrawn, i.e. have a negative balance at any given time and the total balance must be constant in the system.

The system uses double entry bookkeeping for maintaining a transaction history, even though all orders (transactions) are two-legged. When a buy order is placed, the market (trading) account is debited and the customer (system) account credited. When a sell order is placed, the market account is credited and the customer account debited. Portfolios containing the holdings are created when accounts are created.

Structure

The system is composed by the following artifacts:

product-server - Technical authority for stock products, feeds trading engine via CDC
trading-api - REST API value objects
trading-client - Spring shell application for submitting order requests
trading-server - Main trading engine service

The code is organized by feature rather than domain responsibility. Each feature such as products, orders, accounts and portfolios have dedicated packages that include API controllers, business services, persistent domain entities and repositories.

Product Server

This is a very small service and the authority for stock products. It stores products and uses an internal scheduler to simulate creation and updates of products. It uses CDC via the webhook sink to feed the trading service that keeps a materialized view of the product inventory.

Trading Server

This service provides an API for placing buy and sell orders. It receives CDC events from the product service to keep a materialized view of the product inventory updated.

It uses the following entity model:

table	description
account	Trading and system account balance with currency
booking_order	Buy and sell order header
booking_order_item	Buy and sell order item or leg (2-legged per order)
limits	Market buy and sell limits
portfolio	Product portfolio per market
portfolio_item	Product item for a portfolio (one item per placed order where the order sum is aggregated)
product	Stock products with buy and sell prices (shallow copy)

Architectural Mechanisms

An architectural mechanism represents a common solution to a frequently encountered architectural problem that is not specific to a project or business domain. Architectural mechanisms can be divided into three categories: Analysis, design and implementation. For instance, if persistence is needed (analysis) and ACID properties are relevant for protecting invariants, then a RDBMS (design) should be used where CockroachDB (implementation) is a suitable option.

Interface

The service uses a hypermedia driven API for accepting buy and sell orders, review order history and products.

Persistence

The service uses JPA with Hibernate and Spring Data JPA. Both CockroachDB and PostgreSQL is supported. Flyway is used to version the database scheme and load initial data.

Transactions

The application relies on database ACID properties and uses local transactions. Pessimistic locking is used in critical sections to reduce likelihood for retries due to transient serialization rollback errors.

Retries

Server-side transaction retries of transient rollback errors (code 40001) are done through AOP, mainly at order placement.

Eventing

The system does not use any messaging system but instead CockroachDB's integrated CDC feature with the webhook sink to drive keep a materialized view of stock products up-to-date.

Technology Stack

Summary of used technology stack:

JDK 1.8+
CockroachDB 22.1
PostgreSQL 10+
JDBC 4
JPA 2 with Hibernate 5
Flyway
SLF4J and Logback
Spring Boot with Jetty
Spring Data JPA
Spring Hateoas
Junit5 and Mockito

Conclusion

This article demonstrated a simple online trading system using CockroachDB, full source code available in Github.

Cluster Singletons using ShedLock

Kai Niemi — Tue, 25 Oct 2022 16:00:45 GMT

Overview

Imagine you have a service deployed in multiple instances across the entire globe that need to run scheduled tasks, exclusively one at a time, as clustersingletons.

How would you ensure that such a task deployed on multiple processors only run exclusively on one node at any given time, in a distributed environment?

This is fairly common requirement for batch ingest, sending user notifications, performing cleanups and other data housekeeping operations.

It suggests using some kind of distributedlocking mechanism to help coordinate actions across a network of machines (nodes). Distributed locks are unfortunately an unsafe concept in a system of independent processors interacting over an asynchronous network. The problem is you can't tell the difference between a slow and failed processor which again is conceptualized in the two-general paradox.

There are only two hard things in Computer Science: cache invalidation and naming things. Phil Karlton

Cache invalidation, naming things, exactly-once delivery and distributed locks are ultimately about the same tradeoffs in the end which are conceptualized by the CAP theorem.

A distributed lock service would need to provide for both safety and liveness. A safety property asserts that nothing bad happens during execution. A liveness property asserts that something good eventually happens. When mapped to lock semantics, a contradicting unsafe property would be to hand out the same lock to multiple clients. Then it wouldn't really be a lock anymore, in violation of contract.

The lack of liveness could be exemplified by having to wait indefinitely for a claimed lock to be released. For example, a client can hold a lock for an arbitrary long period of time and prevent other clients from acquiring the lock and make forward progress. That's the expected behavoir. However, what if that client is just gone fishing or stuck doing nothing? Then no-body can make any progress until something eventually happens.

Adding a lock timeout or putting a TTL (auto-timeout) on the lock doesn't fix the problem since (again) we can't tell the difference between a client gone fishing or just taking a very long time doing it's task.

To preserve safety and deliver a fault-tolerant locking service, one option is to rely on a storage system with a transactional guarantee and a monotonically increasing fencing token. A fencing token is simply a number that always increase monotonically whenever a lock is aquired. This token can then be used in a CAS operation to reject expired tokens that may come back later and haunt the lock service.

This would also involve the component that the lock is intended for, to become a stateful observer and part of the protocol and responsible to reject old tokens. It is the type of approach you find in Google's Chubby, etcd and recently also Hazelcast, with slightly different terminology.

In summary, if you need a scheduled tasks to run exclusively as singletons rather than concurrently, then a cluster singleton lock that can span beyond a single processor is your friend.

For the rest of this article we are going to look at ShedLock with CockroachDB, but without fencing tokens since ShedLock doesn't have it.

A geo-distributed database such as CockroachDB have the necessary mechanisms to manage consensus across a cluster, and that's what we are going to leverage for task scheduling locks.

Introduction to ShedLock

In this example we are going to use a small Java library called ShedLock along with Spring's built-in scheduling mechanism. Spring scheduling only provides scheduling and not any type of locking. For that, we will use ShedLock that will provide a simple locking mechanism needed to ensure cluster-singleton task execution. ShedLock on the other hand does not do any scheduling, only locking.

ShedLock provides an API for acquiring and releasing locks as well as connectors to a wide variety of lock providers such as Etcd, Zookeeper, Consul, Cassandra, Redis, Hazelcast and virtually any SQL database through JDBC.

As a side note, there are plently of distributed schedulers also like Quartz and Chronos on Mesos that achieve similar goals.

These mechanisms combined, along with CockroachDB as a distributed SQL database, provides for a simple and easy-to-use cluster singleton task framework. CockroachDB:s strong consistency, fault-tolerance and distributed coordination provides the perfect foundation for representing the underpinning mutex mechanism. A CockroachDB cluster can span multiple regions and it provides the same transactional guarantees and strong consistency properties regardless of the deployment topology.

Limitations

One caveat with ShedLock is that it relies on lock timeouts. If a method holding a lock exceeds the set time limit and the lock expires, then it can be claimed to another thread / process. Although it's documented behavior, it fails on the safety property and can hand out the same lock to multiple threads.

It means that a paused or busy method that exceed the timeout can come back live again and cause multiple side-effects since its no longer holding the lock.

There is a KeepAliveLockProvider that can extend the lock period if needed, but it doesn't solve the underlying problem, just extends the time expiry. Lock timeouts is a problem that can be solved by using fencing tokens as described earlier.

Rule of thumb for ShedLock:

You have to set lockAtMostFor to a value which is much longer than normal execution time. If the task takes longer than lockAtMostFor the resulting behavior may be unpredictable (more than one process will effectively hold the lock).

So be aware of this limitation before using ShedLock for anything serious in a production environment.

Source Code

The source code for this example is available in GitHub.

Configuration

To use SchedLock with Spring Boot, first add the Maven dependency:

        <dependency>            <groupId>net.javacrumbs.shedlockgroupId>            <artifactId>shedlock-springartifactId>            <version>4.42.0version>        dependency>        <dependency>            <groupId>net.javacrumbs.shedlockgroupId>            <artifactId>shedlock-provider-jdbc-templateartifactId>            <version>4.42.0version>        dependency>

Next, create a database table that will store information about the locks:

create table shedlock(    name       varchar(64)  not null,    lock_until timestamp    not null,    locked_at  timestamp    not null,    locked_by  varchar(255) not null,    primary key (name));

Next, define the lock provider which will be using JDBC:

@Configuration@EnableScheduling@EnableSchedulerLock(defaultLockAtMostFor = "10m")public class LockingConfiguration {    @Bean    public LockProvider lockProvider(DataSource dataSource) {        return new JdbcTemplateLockProvider(builder()                .withJdbcTemplate(new JdbcTemplate(dataSource))                .usingDbTime()                .build()        );    }}

In the configuration above, notice @EnableScheduling and @EnableSchedulerLock which are the annotations used to enable scheduling and tailor default values for lock keep-alive. 10 min in this example.

Usage

Now we are ready to annotate the methods we would like to run following a schedule, and execute as cluster singletons. These methods will run exclusively one at a time (per method) no matter how many application instances or threads per instance we use.

For this purpose we are using the @Scheduled and SchedulerLock annotations.

@Servicepublic class ProductNewsFeed {    private final Logger logger = LoggerFactory.getLogger(getClass());    @Autowired    private ProductService productService;    @Scheduled(cron = "0/30 * * * * ?")    @SchedulerLock(lockAtLeastFor = "1m", lockAtMostFor = "5m", name = "publishNews_clusterSingleton")    public void publishNews() {        logger.info(">> Entered cluster singleton method");        // gone fishing        logger.info("<< Exiting cluster singleton method");    }}

The @Scheduled makes Spring create an underlying scheduled task to execute this method at a specified interval. It uses a cron expression which in this case means every 30 second.

The @SchedulerLock is used by ShedLock to acquire a lock when this method goes into scope. The name must be unique since it's used as a key. The parameter lockAtLeastFor is optional and means this method will hold the lock for 1 minute at a minimum. The parameter lockAtMostFor is also optional and means this method will hold the lock for at most 5 minutes.

In case the lock holder or method execution goes beyond this time, the lock is up for grabs for another session/thread. Otherwise the lock is released when the method goes out of scope.

Demo

First build the demo app:

git clone git@github.com:kai-niemi/roach-spring-boot.git ./mvnw clean install cd spring-boot-locking/target

Start first instance:

nohup java -jar spring-boot-locking.jar --server.port=8090 > locking-1-stdout.log 2>&1 &

Start second instance:

nohup java -jar spring-boot-locking.jar --server.port=8091 > locking-2-stdout.log 2>&1 &

Start third instance:

nohup java -jar spring-boot-locking.jar --server.port=8092 > locking-3-stdout.log 2>&1 &

You should then see "Gone fishing" only in one of the logs at any given time. You can tailor the timeout so that the task exceeds the time limit, at which point the lock guarantee goes out of the window.

Conclusion

In this article, we learned how to create and synchronize cluster singleton scheduled tasks using ShedLock and CockroachDB. ShedLock does come with a caveat that it relies on lock timeouts that is an unsafe construct.

Designing Idempotent REST APIs

Kai Niemi — Wed, 19 Oct 2022 18:17:57 GMT

Overview

This article outlines common techniques for implementing REST API idempotency using the Spring Boot stack, or to be specific: idempotent POST methods.

The POST method is not safe or idempotent by specification, yet it's frequently used when creating new resources. How do you guarantee that double posting doesn't result in duplicate side-effects, as in exactly/effectively once semantics?

One way of turning a POST method idempotent is by using a generated ID or token as a pre-condition control element to check whether the request has been previously processed or not. If that is the case, any additional request will just be de-deduplicated as no-ops. This way, the client can safely re-submit a POST request any number of times without concerning itself about causing multiple side-effects in the output of the operation.

Before jumping into the weeds on how, lets first do a brief primer on REST, HTTP and why idempotency is an important and useful API design property.

What is a REST API?

Representational State Transfer (REST) coined by Roy Fielding two decades ago is an architectural style for distributed hypermedia systems described by a set of constraints named the uniform interface, client-server, stateless, cacheable, layered system and code-on-demand.

REST provides a set of architectural constraints that, when applied as a whole, emphasizes scalability of component interactions, generality of interfaces, independent deployment of components, and intermediary components to reduce interaction latency, enforce security, and encapsulate legacy systems.

In this style, every piece of information is a resource, and resources are addressed using an URI, typically links on the Web. Each unique URI refers to a representation of some object or resource that only exist on the server. The resources are acted upon by using a set of simple, uniform and well-defined operations or methods.

Operations on a REST resource follows the HTTP verbs used on the web. You can get the contents ofa resource using GET, update it with PUT or PATCH, create a new resource with POST or delete itwith a DELETE.

Implementing REST is like everything else not immune against bad practises and anti-patterns. Such as breaking the safety and idempotence properties of HTTP methods like tunneling updates via GET or passing all operations via POST.

Method safety and idempotence are key for a successful REST implementation, besides adopting one of the most overlooked sub-constraint of them all: hypermedia as the engine of application state, part of the uniform interface. The demo service used in this article is hypermedia driven, much because Spring HATEOAS makes it quite straightforward. We'll come back to hypermedia driven APIs at a later time since this article is about idempotency.

Method Safety and Idempotence

Below is a table of the most common HTTP methods or verbs and their semantics:

Method	Description	Safe	Idempotent
GET	Returns a representation of a resource	Yes	Yes
PUT	Create or replace a resource with the given representation	No	Yes
DELETE	Delete an identified resource	No	Yes
POST	Create a resource with the given representation as a subordinate to an identified resource	No	No
HEAD	Same as GET but only retrieves headers and not the body	Yes	Yes
OPTIONS	Returns the methods/verbs supported by an identified resource	Yes	Yes
PATCH	Apply partial modifications to an identified resource	Yes	No

Safe means that a method call will have no side effects that the client is accountable for. A safe method is typically a read operation without any significance other than retrieval.

Idempotent means that the side effects of numerous identical requests is the same as for a single request. A unary operation (function) is idempotent if, whenever it is applied twice or more to any value, gives the same result as if it were applied once, for example: abs(abs(a))=abs(a).

Why is idempotency important?

Idempotency is an important and useful design property in distributed systems because it helps to maintain consistency and integrity across integration boundaries. It also defers that responsibilities to the server rather than burdening the clients. A client should be allowed to be rather ignorant and able to send a sequence of requests multiple times over an unreliable network without worrying about multiple side effects or consistency.

Consider a scenario where a client submits a request to move funds between accounts. The request is accepted by the server and it's completed by writing to the database with a commit, which in turn emits a change event that cause other downstream systems to also act on the business event. These are the visible side effects of POSTing this particular request, the so called post-conditions.

There's a problem here however, where the acknowledgement to the client gets delayed and lost due to an I/O error. The client is left hanging trying to figure out what actually happened. Did the request succeed or not?

Network errors does not always mean failure but rather absence of information. This points back to the two-generals paradox, that it's impossible to tell the difference between a failed and slow response over an unreliable channel. The client may decide to re-submit the request after a timeout, which in this case will cause multiple side effects and that is not the desired outcome or a valid post-condition.

Idempotency in APIs is ultimately a quality of service provided towards clients, where the burden of ensuring a correct outcome from double processsing is contained by the server.

Hypermedia also serves the purpose of complexity containment, yet in a different way. Hypermedia controls are used to guide a client throughout a series of workflow steps and thereby protecting it from business rules, domain knowledge and binding entirely to out-of-band information.

Theres always a cost involved to provide idempotency unless an operation is naturally idempotent or immutable. Service idempotency ensures that if an operation is called multiple times with the same deterministic input (parameters), the post-conditions are unaffected. A read operation is naturally idempotent in most cases but not always without side-effects. A read-only GET request may result in audit log entries on the server which, when put into a different definition context not be considered idempotent.

For service idempotency however, only the post-conditions must hold true. It is not relevant for a service consumer whether a log entry is created or not on the server for auditing. The choice of the correct HTTP method signals what a service supports in terms of idempotency and safety, which is why GET is always a really bad idea for tunneling writes.

Even though the POST and PATCH methods are not idempotent by specification, they can be made idempotent by the service implementation. POST methods are pretty much "allowed" to do anything, even deleting resources.

That's it for the primer, lets now see this implemented in practice.

Idempotency in Action

Lets go through a tutorial of using idempotent POSTs from the clients point of view using only cURL and jq for formatting.

Example Code

The code examples are available in Github.

Use Case

The use case is to move funds between accounts. Each account holds a current balance which must stay positive (invariant). A transfer is a single synchronous operation where funds are moved between different accounts expressed as legs. Each leg represents a single account balance update for which the total sum of all legs must equal zero. Once a transfer is completed, each leg will have acorresponding transaction describing the balance update.

The transfer operation is what we intend to make idempotent. Because the entire business operation can be expressed in a single TransferRequest its just a matter of making that HTTP POST controller endpoint idempotent, meaning that POSTing the same request once will have the same side-effect as posting it multiple times.

Implementation Options

To implement this, we are going to use two slightly different approaches which mainly differs in workflow steps and operations involved. The client input and service outcome is the same in both approaches.

The options are:

Conditional POST Requests - Conditional POSTs request using generated One-Time URIs
Post-once-exactly Method - Storing idempotency keys for de-duplication

Conditional POST Request with One-Time URIs

In this method we will use weak entity tags (ETag) of the accounts to generate a token which will be encoded into the URI. The generated token is only valid for the current state of the accounts targeted for the transfer, which will be used as a pre-condition for POST requests to succeed.

ETags are typically used for two purposes; caching and conditional requests. Conditional requests can be applied in an optimistic locking strategy and for idempotency. Strong ETags are commonly a cryptographic hash of the resource representation, where even the smallest change will result in a new ETag. Weak ETags are a softer version where the semantic equivalence is compared. Caching can also be combined with a Last-Modified header which is the last modified date of the resource.

The client supplies the account IDs involved in the transfer in an initial GET request, and the server returns a link with a hash representing the state of the accounts to initiate the transfer.

Request to get a URI to make a conditional request. Responds with a one-time URI and current state of resources.
Request to complete the transfer with the hash token as pre-condition. Responds with a 201 and the resources created.
Attempt to re-post the same transfer, using the now expired URI.
Responds with a pre-condition failed response since the URI is expired.

In this demo we are using a SHA-256 checksum of the current account balances and IDs. To prevent URI tempering, we could also include a digital signature in the URI. The token does not need to be stored since it can be recomputed from the current entity tags of the resources. If the hashed values does not match, it means the URI is already used.

Let's go through each of these steps using the demo service.

First get a transfer form template:

curl http://localhost:8090/transfer/form | jq

Form templates are used in REST APIs to pre-populate forms at the clients convenience. In this demo service we use the HAL+forms media type.

In the response there are four account legs which we are going to use (formatted below). The underscore-prefixed elements in the response are hypmermedia controls that can be ignored for now.

{  "legs": [    {      "id": 1,      "amount": 10.0    },    {      "id": 2,      "amount": -10.0    },    {      "id": 3,      "amount": -15.0    },    {      "id": 4,      "amount": 15.0    }  ]}

Next, sign the form by following the roach-spring:transfer-signature link rel in the previous response:

curl -v -d '{"legs":[{"id":1,"amount":10.0},{"id":2,"amount":-10.0},{"id":3,"amount":15.0},{"id":4,"amount":-15.0}]}' -H "Content-Type:application/json" -X GET http://localhost:8090/transfer/signature | jq

In the response you will find a X-transfer header which represents a hash of the current state of the accounts 1, 2, 3 and 4:

X-transfer: 0ed104255363925a54790b0e11eac725a5f66caf4d8d244421c4c53485bb1c85

Next, post the transfer form by following the roach-spring:transfer link rel:

curl -v -d '{"legs":[{"id":1,"amount":10.0},{"id":2,"amount":-10.0},{"id":3,"amount":15.0},{"id":4,"amount":-15.0}]}' -H "Content-Type:application/json" -X POST http://localhost:8090/transfer/signature/0ed104255363925a54790b0e11eac725a5f66caf4d8d244421c4c53485bb1c85 | jq

If all goes well, expect a 201 in return:

HTTP/1.1 201 CreatedDate: Sun, 16 Oct 2022 07:41:09 GMTContent-Type: application/prs.hal-forms+jsonTransfer-Encoding: chunked

The rest of the response contains a resource representation of the transactions created as a result of the fund transfer (the side-effect).

The generated hash 0ed104255363925a54790b0e11eac725a5f66caf4d8d244421c4c53485bb1c85 is now considered consumed and no longer valid, so attempting to re-post the same request will fail with a 412:

HTTP/1.1 412 Precondition FailedDate: Sun, 16 Oct 2022 07:43:16 GMTContent-Type: application/problem+jsonTransfer-Encoding: chunked

And there we have it, idempotent POSTs using one time generated URIs without storing any keys or tokens. For a production-grade service there a few more security considerations like using digital signatures to prevent URI tampering, but the concept is the same.

The main drawback with this approach is that its taxing on validating the pre-condition for therequest. The accounts must be read from the database and a hash created in a separate GET request before a POST request is possible. The other drawback is that its not straightforward to return the same response as the original request when the precondition fails since neither the token or the response is stored. Lastly, this method depends on the fact that there is some entity tags to use as a base for the hash function, like pre-existing accounts in this example.

Post-Once-Exactly

Another solution very similar to the conditional requests method (its also conditional) is referred to as POST once exactly or POE, for which there is an expired internet [draft] (https://datatracker.ietf.org/doc/html/draft-nottingham-http-poe-00).

The principle is to generate a token based on a timestamp, random number or by using an UUID. This token is then stored and used for de-duplication by the server. It mean the server must store the token for a period of time to be able to tell if an URI has been used or not. This however leans well into tagging the original response with the token, so that when de-duplication happens, the server can return the same response but with a 200 OK code rather than 201 Created.

Request to get a URI to make a conditional request. Response with a POE-Link header containting idempotency token/key. This step is optional as the client can use a generated token or UUID just as well.
Request with a pre-condition to complete the transfer.
Response with a 201 code and created resources.
Attempt to re-post using expired token.
Responds either with a 200 OK and the original response payload, or a 405 to signal pre-condition failed.

Let's walk through this example as well (note: this deviates a bit from the expired POE spec).

First get a transfer form template:

curl http://localhost:8090/transfer/form | jq

If you look in the _links section, you will find a rel named transfer-once with a UUID token encodedinto the URI:

{  "legs": [    {      "id": 1,      "amount": 0    },    {      "id": 2,      "amount": 0    },    {      "id": 3,      "amount": 0    },    {      "id": 4,      "amount": 0    }  ],  "_links": {    "roach-spring:transfer-signature": {      "href": "http://localhost:8090/transfer/signature",      "title": "Sign request with current account states"    },    "roach-spring:transfer-once": {      "href": "http://localhost:8090/transfer/07f18b6d-9bfd-4a38-af0e-781f21963fcf",      "title": "Submit transfer request using POE tag"    },    "curies": [      {        "href": "http://localhost:8090/rels/{rel}",        "name": "roach-spring",        "templated": true      }    ]  },  "_templates": {    "default": {      "method": "POST",      "properties": [        {          "name": "legs",          "readOnly": true        }      ],      "target": "http://localhost:8090/transfer/07f18b6d-9bfd-4a38-af0e-781f21963fcf"    }  }}

Next, we will use the following transfer amounts with a zero sum:

{  "legs": [    {      "id": 1,      "amount": 10.0    },    {      "id": 2,      "amount": -10.0    },    {      "id": 3,      "amount": -15.0    },    {      "id": 4,      "amount": 15.0    }  ]}

Let's follow the roach-spring:transfer-once link rel in the previous response:

curl -v -d '{"legs":[{"id":1,"amount":10.0},{"id":2,"amount":-10.0},{"id":3,"amount":15.0},{"id":4,"amount":-15.0}]}' -H "Content-Type:application/json" -X POST http://localhost:8090/transfer/07f18b6d-9bfd-4a38-af0e-781f21963fcf | jq

If all goes well, expect a 201 in return:

HTTP/1.1 201 CreatedDate: Sun, 16 Oct 2022 14:58:38 GMTPOE-Link: 07f18b6d-9bfd-4a38-af0e-781f21963fcfContent-Type: application/prs.hal-forms+jsonTransfer-Encoding: chunked

In the response, you will find a POE-Link header which represents the UUID token used as idempotency key:

POE-Link: 07f18b6d-9bfd-4a38-af0e-781f21963fcf

The rest of the response is a resource representation of the transactions created as a result of the transfer (the side-effect). The generated token 07f18b6d-9bfd-4a38-af0e-781f21963fcf is considered consumed, so attempting to re-post the same request will return a 200 OK to signal deduplication:

HTTP/1.1 200 OKDate: Sun, 16 Oct 2022 15:01:07 GMTPOE-Link: 07f18b6d-9bfd-4a38-af0e-781f21963fcfContent-Type: application/prs.hal-forms+jsonTransfer-Encoding: chunked

In addition, the same response used for the original request will be returned in the body.

The main drawback with this approach is that the tokens must be stored along with the responsepayloads, either with a retention period or indefinitely.

Implementation Notes

The demo application is a pretty typical Spring Boot application with a hypermedia/REST API.

It uses the following stack:

Spring Boot with Jetty
Spring Data JPA and Hibernate with:
- Custom JSONB user type
Spring Hateoas
Flyway
CockroachDB with:
- JSONB for storing response bodies
- TTLs to expire POE tags

The schema used:

https://gist.github.com/kai-niemi/f011d83e9ca46a848afc280ad8e98241

The key features of CockroachDB to support our idempotency implementation is when storing POE tags and responses in JSONB format. It's also leveraging the TTL feature to clean out tags after 5 minutes. Effectively this means the idempotency guarantee lasts for 5 minutes.

Conclusions

Idempotency is an important design property for REST APIs. We explored two implementation options for idempotent POST methods and demonstrated the pros and cons of each:

Conditional Requests
- Pros: no token storage
- Cons: read before write + hashing + signing
Post-once-exactly
- Pros: generated idempotency key w/o client involvement, retention of response bodies
- Cons: token and response storage

Connection pooling with Spring Boot and CockroachDB

Kai Niemi — Fri, 14 Oct 2022 14:00:42 GMT

Overview

Hikari is a battle-proven, lightweight, high performance connection pool library for Java. It's also the default connection pool in Spring Boot. This article will dive into some configuration settings that are relevant for CockroachDB.

Why a pool?

Connection pooling is fundamental for high performance since opening and closing database connections are expensive operations. Connections are opened and closed for each transaction, whether its an implicit (auto-commit) or explicit (begin+commit/rollback) transaction. Since transactional SQL databases should strive for short-lived transactions, this overhead would be significant unless using a common technique to mange expensive resources: resource pooling.

How does pooling work?

Whenever you open a JDBC connection through a pooled javax.sql.DataSource interface, its actuially claimed from a pool of already pre-opened connections and borrowed to the calling thread. When you close the connection (wrapped in a proxy) it's returned back to the pool rather than being actually closed.

If you happen to drain or exhaust the pool of available connections, the calling thread will have to wait until a connection becomes available offering simple back-pressure mechanism to control resource usage. If connections get trashed, the pool will do housekeeping and backfill with new valid connections. In addition, idle connections in the pool will have a maximum lifetime until they are closed and the pool gets backfilled.

There are different settings for tweaking the pool behavior for different workload characteristics, so lets look into that next.

Configuring Hikari

HikariCP is included automatically with Spring Boot, so there's no extra Maven configuration needed. You can however override the dependency to use a more recent version.

Maven dependency

<dependency>    <groupId>com.zaxxergroupId>    <artifactId>HikariCPartifactId>    <version>5.0.1version>dependency>

Configuration Parameters

In Spring, the Hikari configuration settings are located under spring.datasource.hikari.*.

The default values are optimized for short-lived, high-frequency transactions that benefits from a fixed-sized pool. These settings are often good enough but sometimes you may want to tweak things, and then it's good to know what the different knobs mean.

These are the most important ones, as described in the Hikari github repo:

spring.datasource.hikari.autoCommit

This property controls the default auto-commit behavior of connections returned from the pool. It is a boolean value. Default: true

If you change this to false, then you should also set hibernate.connection.provider_disables_autocommitto true. This tells Hibernate that auto-commit is already disabled when a connection is acquired, and some operations can be avoided for performance.

One reason you would want to do this is when always using explicit transactions through @Transactional. If you also use many implicit, read-only transactions then its better to stick with the default.

spring.datasource.hikari.connectionTimeout

This property controls the maximum number of milliseconds that a client will wait for a connection from the pool. If this time exceeds without a connection becoming available, a SQLException will be thrown. Default: 30000 (30 seconds)

A shorter timeout is also possible, like 10 seconds.

spring.datasource.hikari.idleTimeout

This property controls the maximum amount of time that a connection is allowed to sit idle in the pool. A value of 0 means that idle connections are never removed from the pool. The minimum allowed value is 10000ms (10 seconds). Default: 600000 (10 minutes)

This only applies if minimumIdle is less than maximumPoolSize, when it's not a fixed-size pool.

spring.datasource.hikari.keepaliveTime

This property controls how frequently HikariCP will attempt to keep a connection alive, in order to prevent it from being timed out by the database or network infrastructure. This value must be less than the maxLifetime value. A keepalive will only occur on an idle connection. The minimum allowed value is 30000ms (30 seconds), but a value in the range of minutes is most desirable. Default: 0 (disabled)

Setting this to a value higher or equal to 30 seconds will make the pool periodically call the JDBC4 connection method isValid. This method in turn is implemented in the PostgreSQL JDBC driver by passing an empty statement.

It's advised also to align the keep alive time with the load balancer's TCP client keep-alive timeout. In HAProxy for example, a timeout of 5min is common (timeout client).

spring.datasource.hikari.maxLifetime

This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)

spring.datasource.hikari.minimumIdle

This property controls the minimum number of idle connections that HikariCP tries to maintain in the pool. If the idle connections dip below this value and total connections in the pool are less than maximumPoolSize, HikariCP will make a best effort to add additional connections quickly and efficiently. Default: same as maximumPoolSize

For a bursty workload, this value should be set to the same value as maximumPoolSize to form a fixed-sized connection pool.

spring.datasource.hikari.maximumPoolSize

This property controls the maximum size that the pool is allowed to reach, including both idle and in-use connections. Basically this value will determine the maximum number of actual connections to the database backend. Default: 10

For a bursty workload, this value should be set to the same value as minimumIdle to form a fixed-sized connection pool.

The maximumPoolSize should reflect the total number of vCPUs for the CockroachDB cluster multiplied by 4 and divided by number of pool instances. The formula is cluster_total_vcpus * 4 / num_pool_instances.

For example, a minimum CockroachDB cluster size of 3 nodes x 4 vCPUs would yield 12 * 4 with a single app instance / pool. If there are instead 4 connection pools, then its 12 * 4 / 4.

This is just a rule of thumb assuming that each VM and connection pool is evenly utilized. Ideally, the total number of active connections shouldn't open more than 4 times the total vCPU count of the CockroachDB cluster.

spring.datasource.hikari.poolName

This property represents a user-defined name for the connection pool and appears mainly in logging and JMX management consoles to identify pools and pool configurations. Default: auto-generated.

YAML Configuration

Example configuration for a spring boot application.yml:

spring:  datasource:    hikari:      maximum-pool-size: 12      minimum-idle: 12      max-lifetime: 1800000      connection-timeout: 10000

See common application properties for more details.

Programmatic Configuration

It's also possible to configure Hikari programmatically, perhaps in combination with the YAML and only override values if they need more dynamic settings.

    @Bean    public DataSourceProperties dataSourceProperties() {        return new DataSourceProperties();    }    @Bean    @ConfigurationProperties("spring.datasource.hikari")    public HikariDataSource hikariDataSource() {        HikariDataSource ds = dataSourceProperties()                .initializeDataSourceBuilder()                .type(HikariDataSource.class)                .build();        // Configured via application.yml and CLI override        ds.setMaximumPoolSize(50);        ds.setMinimumIdle(25);        // Applies if min idle < max pool size        ds.setKeepaliveTime(60000);        ds.setMaxLifetime(1800000);        ds.setConnectionTimeout(10000);        ds.setPoolName("spring-boot-pooling");        // Paired with Environment.CONNECTION_PROVIDER_DISABLES_AUTOCOMMIT=true        ds.setAutoCommit(false);        // Batch inserts (PSQL JDBC driver specific, case-sensitive)        ds.addDataSourceProperty("reWriteBatchedInserts", "true");        // For observability in DB console        ds.addDataSourceProperty("application_name", "Spring Boot Pooling");        return ds;    }

CLI Configuration

When using the YAML approach its easy to override the default settings through the CLI:

java -jar app.jar \--spring.datasource.hikari.maximum-pool-size=45 \--spring.datasource.hikari.minimum-idle=25 \--spring.datasource.hikari.max-lifetime=1800000 \"$@"

Explicit vs Implicit Transactions

When dealing with both implicit and explicit transactions in the same application, it's better to stick with the default autoCommit=true connection setting. All read-only transactions will then be implicit rather than explicit and can benefit from CockroachDB's server-side retries and time-travel queries.

Implicit transaction example

    @Transactional(propagation = Propagation.NOT_SUPPORTED)    public Long sumTotalInventory() {        Assert.isTrue(!TransactionSynchronizationManager.isActualTransactionActive(), "Tx active");        return productRepository.sumTotalInventory();    }

Hint: When using JDBC, you can also execute writes in implicit transactions.

Explicit transaction example

    @Transactional(propagation = Propagation.REQUIRES_NEW, readOnly = true)    public ProductEntity findById(UUID id) {        Assert.isTrue(TransactionSynchronizationManager.isActualTransactionActive(), "Tx not active");        return productRepository.findById(id).orElseThrow(                () -> new IllegalArgumentException("No such product: " + id));    }

Demo

The following sample application in Github is using Spring Boot with HikariCP.

It provides a few REST API endpoints for managing product's, available via http://localhost:8090/. lt also exposes Spring actuator endpoints that can be used to monitor the Hikario pool stats. To interact with the API the easiest way is to use cURL.

curl -s http://localhost:8090/actuator/metrics | jq{  "names": [    "application.ready.time",    "application.started.time",    "disk.free",    "disk.total",    "executor.active",    "executor.completed",    "executor.pool.core",    "executor.pool.max",    "executor.pool.size",    "executor.queue.remaining",    "executor.queued",    "hikaricp.connections",    "hikaricp.connections.acquire",    "hikaricp.connections.active",    "hikaricp.connections.creation",    "hikaricp.connections.idle",    "hikaricp.connections.max",    "hikaricp.connections.min",    "hikaricp.connections.pending",    "hikaricp.connections.timeout",    "hikaricp.connections.usage",    "http.server.requests",    "jdbc.connections.max",    "jdbc.connections.min",    "jetty.connections.bytes.in",    "jetty.connections.bytes.out",    "jetty.connections.current",    "jetty.connections.max",    "jetty.connections.messages.in",    "jetty.connections.messages.out",    "jetty.connections.request",    "jetty.threads.busy",    "jetty.threads.config.max",    "jetty.threads.config.min",    "jetty.threads.current",    "jetty.threads.idle",    "jetty.threads.jobs",    "jvm.buffer.count",    "jvm.buffer.memory.used",    "jvm.buffer.total.capacity",    "jvm.classes.loaded",    "jvm.classes.unloaded",    "jvm.gc.live.data.size",    "jvm.gc.max.data.size",    "jvm.gc.memory.allocated",    "jvm.gc.memory.promoted",    "jvm.gc.overhead",    "jvm.gc.pause",    "jvm.memory.committed",    "jvm.memory.max",    "jvm.memory.usage.after.gc",    "jvm.memory.used",    "jvm.threads.daemon",    "jvm.threads.live",    "jvm.threads.peak",    "jvm.threads.states",    "logback.events",    "process.cpu.usage",    "process.files.max",    "process.files.open",    "process.start.time",    "process.uptime",    "spring.data.repository.invocations",    "system.cpu.count",    "system.cpu.usage",    "system.load.average.1m"  ]}

To zoom in on one of the Hikari metrics:

curl -s http://localhost:8090/actuator/metrics/hikaricp.connections.active | jq

Hint: install json processor (jq) via homebrew:

brew install jq

Conclusion

In this article, we configured the Hikari connection pool DataSource implementation in a Spring Boot application. We also learned about the main configuration parameters and how to optimize for different transaction patterns.

Bulk Update/Upserts with Spring Data JDBC

Kai Niemi — Thu, 22 Sep 2022 10:36:58 GMT

In a previous post Batch Statements with Spring Boot and Hibernate we used the PostgreSQL JDBC driver's reWriteBatchedInserts setting to batching INSERT statements for better performance (about 30%). This driver-level "rewrite" however only works for INSERTs which leads to the question: is it possible to do a similar thing for UPDATE or INSERT on CONFLICT, aka UPSERT statements? Let's find out.

Example Code

The code examples in this post are available on GitHub.

Introduction

Batch statements have a big performance impact since it reduces the number of roundtrips needed for the database. When creating records using JPA and Hibernate, we can use the Hibernate batch size setting to enable batch INSERTs. When using plain JDBC, we can use plain batch update statements. The PostgreSQL JDBC driver also requires setting reWriteBatchedInserts=true to translate batched INSERTs to multi-value inserts.

To perform bulk/batch UPDATEs or UPSERTs similarly, we can't rely on the JPA provider or JDBC driver to do any magic for us. Similar to INSERTs without rewrites, using the JDBC-prepared statement batch methods (addBatch/executeBatch) still means passing singleton statements over the wire.

Solution

One solution for UPDATEs is to use SQL values in bulk format and pass individual statements in batches of array values. The bulk array approach also works for INSERT on CONFLICT, aka UPSERTs.

Update Example

Rather than using JPA native queries, let's use JDBC through the Spring Data JDBC abstraction for simplicity. As always, it's important to use prepared statements with placeholders and parameter binding.

UPDATE products SET inventory=data_table.new_inventory, price=data_table.new_price FROM (select unnest(?) as id, unnest(?) as new_inventory, unnest(?) as new_price) as data_table WHERE products.id=data_table.id

The ARRAY type works well in JDBC for parameter binding of List/Collection values. You just create one ordered list collection for each of the statement bind parameters populated with the values. If you have a very large collection, then you can either use pagination queries to narrow things or something like chunkedStream() below to split the stream into chunks matching the appropriate batch size.

Code example:

    private static  Stream> chunkedStream(Stream stream, int chunkSize) {        AtomicInteger idx = new AtomicInteger();        return stream.collect(Collectors.groupingBy(x -> idx.getAndIncrement() / chunkSize))                .values().stream();    }

Let's put this into the context of a test method:

    @Order(2)    @ParameterizedTest    @ValueSource(ints = {16, 32, 64, 128, 256, 512, 768, 1024})    public void whenUpdatingProductsUsingValues_thenObserveBatchUpdates(int batchSize) {        Assertions.assertFalse(TransactionSynchronizationManager.isActualTransactionActive(), "TX active");        logger.info("Finding all products..");        Stream> chunked = chunkedStream(productRepository.findAll().stream(), batchSize);        logger.info("Updating products in batches of {}", batchSize);        // This does send a single statement batch over the wire        chunked.forEach(chunk -> {            transactionTemplate.executeWithoutResult(transactionStatus -> {                int rows = jdbcTemplate.update(                        "UPDATE products SET inventory=data_table.new_inventory, price=data_table.new_price "                                + "FROM "                                + "(select unnest(?) as id, unnest(?) as new_inventory, unnest(?) as new_price) as data_table "                                + "WHERE products.id=data_table.id",                        ps -> {                            List qty = new ArrayList<>();                            List price = new ArrayList<>();                            List ids = new ArrayList<>();                            chunk.forEach(product -> {                                qty.add(product.addInventoryQuantity(1));                                price.add(product.getPrice().add(new BigDecimal("1.00")));                                ids.add(product.getId());                            });                            ps.setArray(1, ps.getConnection()                                    .createArrayOf("UUID", ids.toArray()));                            ps.setArray(2, ps.getConnection()                                    .createArrayOf("BIGINT", qty.toArray()));                            ps.setArray(3, ps.getConnection()                                    .createArrayOf("DECIMAL", price.toArray()));                        });                Assertions.assertEquals(chunk.size(), rows);            });        });    }

Upsert Example

Let's use the same concept for bulk UPSERTs:

INSERT INTO products (id,inventory,price,name,sku) select unnest(?) as id,        unnest(?) as inventory,        unnest(?) as price,        unnest(?) as name,        unnest(?) as sku ON CONFLICT (id) do nothing

Test code example:

    @Order(4)    @ParameterizedTest    @ValueSource(ints = {16, 32, 64, 128, 256, 512, 768, 1024})    public void whenUpsertingProducts_thenObserveBulkUpdates(int batchSize) {    ...        transactionTemplate.executeWithoutResult(transactionStatus -> {            int rows = jdbcTemplate.update(                    "INSERT INTO products (id,inventory,price,name,sku) "                            + "select unnest(?) as id, "                            + "       unnest(?) as inventory, "                            + "       unnest(?) as price, "                            + "       unnest(?) as name, "                            + "       unnest(?) as sku "                            + "ON CONFLICT (id) do nothing",                    ps -> {                        List qty = new ArrayList<>();                        List price = new ArrayList<>();                        List ids = new ArrayList<>();                        List name = new ArrayList<>();                        List sku = new ArrayList<>();                        products.forEach(product -> {                            qty.add(product.getInventory());                            price.add(product.getPrice());                            ids.add(product.getId());                            name.add(product.getName());                            sku.add(product.getSku());                        });                        ps.setArray(1, ps.getConnection()                                .createArrayOf("UUID", ids.toArray()));                        ps.setArray(2, ps.getConnection()                                .createArrayOf("BIGINT", qty.toArray()));                        ps.setArray(3, ps.getConnection()                                .createArrayOf("DECIMAL", price.toArray()));                        ps.setArray(4, ps.getConnection()                                .createArrayOf("VARCHAR", name.toArray()));                        ps.setArray(5, ps.getConnection()                                .createArrayOf("VARCHAR", sku.toArray()));                    });        });    }

Performance

In a simple performance test updating 50,000 products, there's a 5x speed improvement of using bulk updates over normally prepared statement batch updates.

Conclusion

We looked at how to provide an equivalent for batch INSERTs with rewrites in the PostgreSQL JDBC driver for UPDATEs and UPSERTs. Using the bulk approach and array values can yield a 5x performance improvement.

JPA Best Practices - Identity Generators

Kai Niemi — Tue, 13 Sep 2022 07:14:34 GMT

Introduction

This article is part four of a series of data access best practices when using JPA and CockroachDB. The goal is to help reduce the impact of workload contention and to optimize performance. Although most of the principles are fairly framework and database agnostic, its mainly targeting the following technology stack:

JPA 2.x and Hibernate 5.+
CockroachDB v22+
Spring Boot 2.7.+
Spring Data JPA
Spring AOP with AspectJ

Example Code

The code examples are available on Github.

Chapter 4: Identity Generators

This chapter goes into more detail around JPA/Hibernate identity generators also covered in part III of this series.

4.1 How to use ID generators

Problem

You want to know how to use different primary key generators with JPA/Hibernate in the context of CockroachDB.

Solution

The JPA specification offers four different primary key generation strategies defined in the @javax.persistence.GenerationType annotation:

AUTO - The persistence provider attempts to figure out the best strategy based on database dialect and key type (default).
IDENTITY - The persistence provider depends on a database-generated ID.
SEQUENCE - The persistence provider depends on a database sequence.
TABLE - Legacy method to simulate sequences.

One alternative to primary key generators is to assign the IDs to the entities directly, like UUIDs.

AUTO

This strategy is the default. It allows the persistence provider to choose a strategy that lands in either SEQUENCE or IDENTITY depending on database dialect and ID column data type.

Using a UUID type for example maps to Hibernate's internal UUIDv4 generator:

@Id@Column(updatable = false, nullable = false)@GeneratedValue(strategy = GenerationType.AUTO)private UUID id;

The AUTO example is therefore equal to:

@Id@Column(name = "id", updatable = false, insertable = true)@GeneratedValue(generator = "UUID")@GenericGenerator(        name = "UUID",        strategy = "org.hibernate.id.UUIDGenerator")private UUID id;

In contrast, a numeric ID data type maps to SEQUENCE (see below for details):

@Id@Column(updatable = false, nullable = false)@GeneratedValue(strategy = GenerationType.AUTO)private Long id;

Identity

Caution: Using this strategy with Hibernate will disable batch INSERTs which severely impacts performance.

This strategy relies on a database ID generator method to create the IDs. The method name can be provided through the Hibernate dialect's org.hibernate.dialect.identity.IdentityColumnSupport#getIdentityInsertString method, in which case it will be injected into the SQL statements. If the return value is null then Hibernate will rely on the column default defined in the schema.

create table account(    id             int          not null default unordered_unique_rowid(),    balance        float        not null,    primary key (id));

In CockroachDB, this could be a globally unique, ordered 64-bit integer provided via unique_rowid() or an unordered int via unordered_unique_rowid(). The latter is slightly better for key/range distribution since it doesn't depend on ordering that incurs write hotspots similar to sequences.

If the primary key is a UUID, then this strategy will not work since that maps to the pg-uuid type which is not compatible. There is still a way to use database-generated UUIDs rather than JVM generated if that would be preferred. The JDK however uses the same UUID specification (v4) that CockroachDB uses, so there's not much point in doing so. Still, here's how:

@Entity@Table(name = "account_uuid_db")@TypeDefs({@TypeDef(name = "crdb-uuid", typeClass = CockroachUUIDType.class)})@DynamicInsert@DynamicUpdatepublic class DatabaseUUIDAccountEntity extends AccountEntity<UUID> {    @Id    @Column(updatable = false, nullable = false)    @GeneratedValue(strategy = GenerationType.IDENTITY)    @Type(type = "crdb-uuid", parameters = @Parameter(name = "column", value = "id"))    private UUID id;    @Override    public UUID getId() {        return id;    }}

In the entity above, we have declared a custom UUID type for the ID column which maps to the standard UUID type. The type of implementation is quite straightforward:

public class CockroachUUIDType extends PostgresUUIDType        implements ResultSetIdentifierConsumer, ParameterizedType {    private String idColumnName = "id";    @Override    public String getName() {        return "crdb-uuid";    }    @Override    public void setParameterValues(Properties params) {        idColumnName = params.getProperty("column");    }    @Override    public UUID consumeIdentifier(ResultSet resultSet) throws IdentifierGenerationException {        try {            return nullSafeGet(resultSet, idColumnName, wrapperOptions());        } catch (SQLException e) {            throw new IdentifierGenerationException("Error converting type", e);        }    }    private WrapperOptions wrapperOptions() {        return new WrapperOptions() {            @Override            public boolean useStreamForLobBinding() {                return false;            }            @Override            public LobCreator getLobCreator() {                return null;            }            @Override            public SqlTypeDescriptor remapSqlTypeDescriptor(final SqlTypeDescriptor sqlTypeDescriptor) {                return PostgresUUIDSqlTypeDescriptor.INSTANCE;            }            @Override            public TimeZone getJdbcTimeZone() {                return TimeZone.getDefault();            }        };    }}

Sequence

This strategy uses a database sequence to generate IDs. In CockroachDB, this is not recommended for optimal performance since indexing on sequential keys will cause range hotspots.

One option is to use a virtual sequence that provides unique_rowid() values, but these values are still sequential and therefore also result in hotspots. There is a feature request 87290 however to provide unordered_unique_rowid() instead which will provide for better key distribution.

Sequence strategy example:

@Id@Column(updatable = false, nullable = false)@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "account_generator")@SequenceGenerator(name = "account_generator", sequenceName = "account_seq")private Long id;

In this example, we are using a sequence named account_seq with an increment of 50 to match JPA defaults and also cached. This can be further tailored with the @SequenceGenerator annotation.

Example SQL for the sequence:

create sequence if not exists account_seq increment by 50 cache 10;

Example SQL when inserting a batch of 8 entities (notice BatchSize:8):

20:49:40.443 TRACE [SQL_TRACE] Name:, Connection:1006, Time:4, Success:TrueType:Prepared, Batch:False, QuerySize:1, BatchSize:0Query:["select nextval ('account_seq')"]Params:[()]

The second statement:

20:49:40.448 TRACE [SQL_TRACE] Name:, Connection:1006, Time:5, Success:TrueType:Prepared, Batch:True, QuerySize:1, BatchSize:8Query:["insert into account_sequence (balance, closed, creation_time, currency, description, name, id) values (?, ?, ?, ?, ?, ?, ?)"]Params:[(2911.81453383622,false,2022-09-12 20:49:40.435252,USD,yzytQ2Ea5FpwTNu3mpoOOfdTiSt7T6PQOtyLM8WfOx5oK-fRbEkIjkkOTSephTMA4eSXOWZQhTXsTYzgRfmEnw,fm6N_uu2YORFXqNOkG3z_K2qxtEBe0MGG9n-tPJcebU,48902),(772.3476591134778,false,2022-09-12 20:49:40.435259,USD,66_-C_R0f-e87GJEl5ZC0jtjOwKLf7IO9ueG_WvvFfQchHbewVGQJg_55W1TZdJ99jS-ZdOdx0Lagm4Xib921g,lldO4yKGhH24p0EufXXRbueScH0j9x_dwyMkLjwh2H0,48903),(812.3109117127407,false,2022-09-12 20:49:40.435259,USD,GvXDU_pZ813sykrIN687gjRBxCtpvSCEzPRgmwxskKLqeOroJuMpCCHWwjmJVxoQjmIt4SdTwNPZ3MX0bsdJag,84oBbEYmNpFPOkq4mNLSmxDbNo6Z9J5fGK4MOnVq3jQ,48904),(117.05857312158308,false,2022-09-12 20:49:40.43526,USD,XvQGd-KM32EdmoOdAUP2suEeb78GbCzbcFSNg0gWs8U-86W4EM7xPVCMQfkEDl_l-2Oij42-v2hvRiJ-KTvVUA,7C_f4orhcv_BL_Tqs5h2O0QQBrrk3ZzC45X2Sul1tvM,48905),(1062.0079187656315,false,2022-09-12 20:49:40.435261,USD,vqaG7maWqE46-yKGP1Kxezj_f0Ln_Lbm3RDOZ_On2e94TWJ7HizifPen_mfO0ag3ZSep-ebqU6A3vsaXudUlHw,tQK-JKkFm-wOKEJt90a16yTQXwi3dR-uyBVEmHNgwbA,48906),(1794.0608391535081,false,2022-09-12 20:49:40.435261,USD,KgdU0UmMVjn5gJdFa_6RHt8bbUvuUSDWbExoWsopKdQ3enFNXddksovabEa1GDx0n24B4L7-hAjTFPZYkyh-7A,FRE8gbGCRfKevZRIAKAe7Ek0BzAbQ4WqJL00-udxGYE,48907),(4679.81994020478,false,2022-09-12 20:49:40.435262,USD,Z62QaX7BKGFmTVx-mFKbpYKNOn8-aMa9pVPkMD9QikyCXHi2iAwvwHyI2XqETjvXRExnrrUH9vi0iCkEOuJBnQ,erbBpAappWiZTmvOsnmBrKYhsZzb2PRMukOfaGwAxbI,48908),(1689.8465103685626,false,2022-09-12 20:49:40.435263,USD,huFjltN3XcLsD78nggj3Glw7xEL6BWNy-4xwm4fncEHZeQNew_koiC_FHtgDCCI_MOSEnz5UBRaIhaU4Oolt8g,SaTUThr5VuaVDo5zAu__B6nhZu_9OSMM6OAd7RLL3JE,48909)]

Custom

Last but not least, we can also create our very own custom ID generator. One example of this could be to use a numeric primary key with auto-generated IDs that can also be batched. Batch inserts are critical for performance if creating entities through JPA.

Remember we can't use numeric IDs and IDENTITY which narrows the option to sequences, which again isn't great for performance due to hotspots.

So what gives then if we want the following:

Numeric primary keys
Support for JDBC batch inserts
Database generated keys
Not using a sequence

First, let's define our custom ID generator strategy:

@Id@Column(updatable = false, nullable = false)@GeneratedValue(generator = "custom-generator")@GenericGenerator(name = "custom-generator",        parameters = @Parameter(name = "batchSize", value = "64"),        strategy = "io.roach.spring.identity.config.hibernate.CustomIDGenerator")private Long id;

Next, let's look at the custom ID generator:

public class CustomIDGenerator implements IdentifierGenerator {    private final Deque cachedIds = new LinkedList<>();    private int batchSize;    private String idQuery;    @Override    public void configure(Type type, Properties properties,                          ServiceRegistry serviceRegistry) throws MappingException {        this.batchSize = Integer.parseInt(properties.getProperty("batchSize", "32"));        StringBuilder sb = new StringBuilder("select ");        IntStream.rangeClosed(1, batchSize).forEach(value -> {            sb.append("unordered_unique_rowid() as id").append(value);            if (value < batchSize) {                sb.append(",");            }        });        this.idQuery = sb.toString();    }    @Override    public Serializable generate(SharedSessionContractImplementor session, Object obj)            throws HibernateException {        if (cachedIds.isEmpty()) {            Stream ids = session.createNativeQuery(idQuery).stream();            ids.collect(Collectors.toList()).forEach(arr -> {                Arrays.stream(arr).forEach(o1 -> {                    BigInteger bi = (BigInteger) o1;                    this.cachedIds.add(bi.longValue());                });            });        }        return cachedIds.poll();    }}

This strategy will fetch batches of unique, unordered IDs using the unordered_unique_rowid() method and then consume these for each ID generation method call. Ordering doesn't matter and the IDs are globally unique so caching them in each JVM instance is safe.

Table

This strategy simulates a sequence by using a custom database table. It's not used anymore, in particular when real sequences are available.

Recommendations

For best performance, use UUID primary keys and the AUTO generation type. The UUIDs will then be generated by the JVM and batch inserts are fully supported.

If you prefer a numeric primary key, consider either using sequences or the custom strategy above with unordered 64-bit integers that also support batch inserts.

Conclusion

In this article, we looked at primary key generation strategies for Hibernate and best practices for CockroachDB.