Using Spring Batch to migrate to CockroachDB

Using Spring Batch to migrate to CockroachDB

Build a simple table by table data migration flow using Spring Batch

·

5 min read

Introduction

Two previous articles covered the creation of simple data pipelines to keep systems in sync by using change data capture (CDC).

In this article, we'll use the same technique to migrate both PostgreSQL schema and data to CockroachDB in a streamlined manner by using Spring Batch. It's a batch framework wrapped in the same pipeline tool used in the previous posts. The main difference is that it's not using CDC but just batches of SQL DDL and DML statements.

Setup

Prerequisites:

  • PostgreSQL

  • CockroachDB

  • Pipeline, an open-source Java tool built on top of Spring Batch

  • Java 17+ Runtime

  • Linux / macOS

PostgreSQL Setup

First, create a sample schema in PostgreSQL with FK constraints and load it with some test data:

DDL

create table account
(
    id             int,
    balance        numeric(19, 2) not null,
    currency       varchar(64)    not null,
    name           varchar(128)   not null,

    primary key (id)
);

create table transaction
(
    id               int             not null,
    booking_date     date             null,
    primary key (id)
);

create table transaction_item
(
    transaction_id  int           not null,
    account_id      int           not null,
    amount          numeric(19, 2) not null,
    currency        varchar(64)    not null,
    running_balance numeric(19, 2) not null,
    note            varchar(255),

    primary key (transaction_id, account_id)
);

alter table transaction_item
    add constraint fk_txn_item_ref_transaction
        foreign key (transaction_id) references transaction (id);

alter table transaction_item
    add constraint fk_txn_item_ref_account
        foreign key (account_id) references account (id);

DML

insert into account (id,balance,currency,name)
select no,
       500.00 + random() * 500.00,
       'USD',
       md5(random()::text)
from generate_series(1, 10) no;

insert into transaction (id,booking_date)
select no,
       now()::date
from generate_series(1, 1000) no;

insert into transaction_item (transaction_id,account_id,amount,currency,running_balance,note)
select no,
       round(1 + random() * 9),
       500.00 + random() * 500.00,
       'USD',
       500.00 + random() * 500.00,
       'Cockroaches can eat anything'
from generate_series(1, 1000) no;

CockroachDB Setup

Create a local cluster of three nodes (or one, it doesn't matter for this article):

cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --background

cockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --background

cockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --background

cockroach init --insecure --host=localhost:26257

Next, set up a database called crdb_test that will become populated with the PostgreSQL schema and data:

cockroach sql --insecure --host=localhost:26257 -e "CREATE database crdb_test"

Pipeline Setup

Pipeline is a tool that wraps Spring Batch and it will do all the work for us. Initially, clone the repo and build it locally:

git clone git@github.com:kai-niemi/roach-pipeline.git pipeline
cd pipeline
chmod +x mvnw
./mvnw clean install

The executable jar is now available under the target folder.

Start it up with (change URL and credentials to match your PSQL setup):

java -jar target/pipeline.jar \
--pipeline.template.source.url=jdbc:postgresql://localhost:5432/crdb_test \
--pipeline.template.source.username=postgres \
--pipeline.template.source.password=**** \
--pipeline.template.target.url=jdbc:postgresql://localhost:26257/crdb_test ?sslmode=disable

This configures the source and target datasources which are used to pre-populate batch job forms.

Configure the Pipeline

Now we are ready to copy data from PostgreSQL to CockroachDB using the sql2sql REST endpoint.

It works by first requesting a form for each table to be copied. The form will be pre-filled with information (from pipeline.template.source.*) except for the CREATE TABLE statement that you need to enter manually. The reason is that there's no support for SHOW CREATE TABLE in PostgreSQL, which is what this tool uses for introspection.

Generate Form Templates

Let's get a form template for each table in three separate REST API calls.

curl -X GET http://localhost:8090/sql2sql/form?table=account > account.json

curl -X GET http://localhost:8090/sql2sql/form?table=transaction > transaction.json

curl -X GET http://localhost:8090/sql2sql/form?table=transaction_item > transaction_item.json

Add a CREATE TABLE statement to each form JSON file. It will be used to create the tables in CockroachDB. You can copy the statements from PSQL using pg_dump:

pg_dump --dbname=crdb_test --table=account --schema-only
pg_dump --dbname=crdb_test --table=transaction --schema-only
pg_dump --dbname=crdb_test --table=transaction_item --schema-only

Then add the DDL parts to the createQuery parameter in the JSON form files. If you also add IF NOT EXISTS to createQuery, then the batch jobs will be fully repeatable.

Example:

"CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n    ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"

Shows just the account.json file:

{
  "_links" : {
    "self" : {
      "href" : "http://localhost:8090/sql2sql/form?table=account"
    }
  },
  "table" : "account",
  "restartExecutionId" : 0,
  "sourceUrl" : "jdbc:postgresql://localhost:5432/crdb_test",
  "sourceUsername" : "postgres",
  "sourcePassword" : "****",
  "targetUrl" : "jdbc:postgresql://localhost:26257/crdb_test?sslmode=disable",
  "targetUsername" : "root",
  "targetPassword" : "",
  "concurrency" : 8,
  "chunkSize" : 32,
  "linesToSkip" : 0,
  "pageSize" : 32,
  "sortKeys" : "id ASC",
  "selectClause" : "SELECT *",
  "fromClause" : "FROM account",
  "whereClause" : "WHERE 1=1",
  "insertQuery" : "UPSERT INTO account(id,balance,currency,name) VALUES (:id,:balance,:currency,:name)",
  "createQuery" : "CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n    ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"
}

Submit Batch Jobs

The next step is to POST the forms back, which will register and start the jobs. Because we use foreign keys in PSQL, the jobs will need to be registered in a sorted topology order based on the foreign key constraints.

To find out the order, we can simply ask the tool:

curl -X GET http://localhost:8090/datasource/source-tables | grep topologyOrder

Which tells you something like:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left    "topologyOrder" : "transaction,account,transaction_item"

All we need to do now is submit the forms in that given order:

curl -d "@transaction.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql

curl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql

curl -d "@transaction_item.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql

On each of these POST requests, you will get a 202 (Accepted) response with a pipeline:execution link relation and an HREF to a resource if the job was successfully submitted. A 202 only means that the request was accepted for async processing. If you follow that link URI, it takes you to the registered job with the current status.

$ curl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1342    0   336  100  1006   1648   4935 --:--:-- --:--:-- --:--:--  6610{
  "_links" : {
    "pipeline:execution" : {
      "href" : "http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6"
    },
    "curies" : [ {
      "href" : "http://localhost:8090/rels/{rel}",
      "name" : "pipeline",
      "templated" : true
    } ]
  },
  "message" : "SQL2SQL Job Accepted"
}
$ curl -X GET http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6

Once complete, you should see a complete copy of the PostgreSQL database in CockroachDB.

Conclusion

In this article, we looked at creating a simple batch-oriented SQL to SQL pipeline at the table level between PostgreSQL and CockroachDB.