Introduction

In this article, we'll demonstrate creating a simple streaming data pipeline using a small micro-batching tool and CockroachDB's CDC Webhook sink.

Setup

Prerequisites:

CockroachDB, with a trial enterprise license.
Pipeline, an open-source Java tool built on top of Spring Batch
Java 17+ Runtime
Linux / macOS

CockroachDB Setup

Initially, create a local cluster of three nodes (or one node, not important):

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --background

cockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --background

cockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --background

cockroach init --insecure --host=localhost:26257

Next, setup a source database called tpcc and a target database called tpcc_copy:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc"

cockroach sql --insecure --host=localhost:26257 -e "CREATE database tpcc_copy"

Lastly, load a small TPC-C workload fixture (schema and data) to the source database:

cockroach workload fixtures import tpcc --warehouses=10 'postgres://root@localhost:26257?sslmode=disable'

The objective is to have a few of these tables copied and mirrored in the target database.

Pipeline Setup

Initially, clone the repo and build it locally:

git clone git@github.com:kai-niemi/roach-pipeline.git pipeline
cd pipeline
chmod +x mvnw
./mvnw clean install

The executable jar is now available under the target folder.

Try it out with:

java -jar target/pipeline.jar --help

Configure the Pipeline

We are now ready to create cdc2sql jobs for each TPC-C table we want to be streamed from the source to the target database.

We'll use the REST API of Pipeline. If you remember, REST is really about following links and POSTing forms (rather than concatenating URIs). The tool provides a real hypermedia-driven API, but we'll not going to build any smart client app for it, just use cURL instead.

Generate Form Templates

First off, we get a form template for each table. The form is going to be pre-populated with SQL statements for the table. We are only using a subset of the TPC-C workload tables, but the process is the same for all tables *).

*) There's also an option to get a form bundle with all tables in one go, but it's WIP.

curl -X GET http://localhost:8090/cdc2sql/form?table=warehouse > warehouse-cdc2sql.json

curl -X GET http://localhost:8090/cdc2sql/form?table=district > district-cdc2sql.json

curl -X GET http://localhost:8090/cdc2sql/form?table=customer > customer-cdc2sql.json

Feel free to inspect the JSON files which should give you an idea of how the batch jobs are configured and run. At this point, we haven't started anything yet. The JSON files typically don't need much editing if the template settings are properly set (everything defaults to using localhost).

Submit Batch Jobs

The next step is to POST the forms back, which will register the jobs and start them up. The jobs need to be registered in the sorted topology order of the foreign key constraints (warehouse <- district <- customer) since we'll be creating tables on-the-fly.

curl -d "@warehouse-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sql

curl -d "@district-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sql

curl -d "@customer-cdc2sql.json" -H "Content-Type:application/json" -X POST http://localhost:8090/cdc2sql

Take note of the CREATE CHANGEFEED statements in the responses. We will need them for the next step, which is to configure the change feeds.

Connect to the source database and execute (after changing URIs):

CREATE CHANGEFEED FOR TABLE warehouse 
    INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true' 
    WITH updated, resolved='15s',

CREATE CHANGEFEED FOR TABLE district 
    INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true' 
    WITH updated, resolved='15s',

CREATE CHANGEFEED FOR TABLE customer 
    INTO 'webhook-https://localhost:8443/cdc2sql/5803c5a2-707a-4fb1-8faf-615d95896664?insecure_tls_skip_verify=true' 
    WITH updated, resolved='15s',

You should now see the target database starting to fill up and eventually reach the same state as the source database. If you would also run the TPC-C workload, you will see any changes reflected also in the target.

Conclusion

In this article, we looked at creating a simple streaming data pipeline at table level between two separate CockroachDB databases using the CDC webhook sink.

Using the CDC Webhook Sink in CockroachDB

Introduction

Setup

CockroachDB Setup

Pipeline Setup

Configure the Pipeline

Generate Form Templates

Submit Batch Jobs

Conclusion

More from this blog

Handling Null Values Efficiently with Spring Named Parameter Binding

Multi-active systems

User defined composite types

Transaction timeouts in CockroachDB

One-Phase Commit Transaction Strategy

Command Palette

Introduction

Setup

CockroachDB Setup

Pipeline Setup

Configure the Pipeline

Generate Form Templates

Submit Batch Jobs

Conclusion

More from this blog