Using Spring Batch to migrate to CockroachDB
Build a simple table by table data migration flow using Spring Batch
Introduction
Two previous articles covered the creation of simple data pipelines to keep systems in sync by using change data capture (CDC).
https://blog.cloudneutral.se/using-the-cdc-webhook-sink-in-cockroachdb
https://blog.cloudneutral.se/using-the-cdc-kafka-sink-in-cockroachdb
In this article, we'll use the same technique to migrate both PostgreSQL schema and data to CockroachDB in a streamlined manner by using Spring Batch. It's a batch framework wrapped in the same pipeline tool used in the previous posts. The main difference is that it's not using CDC but just batches of SQL DDL and DML statements.
Setup
Prerequisites:
PostgreSQL
Pipeline, an open-source Java tool built on top of Spring Batch
Java 17+ Runtime
Linux / macOS
PostgreSQL Setup
First, create a sample schema in PostgreSQL with FK constraints and load it with some test data:
DDL
create table account
(
id int,
balance numeric(19, 2) not null,
currency varchar(64) not null,
name varchar(128) not null,
primary key (id)
);
create table transaction
(
id int not null,
booking_date date null,
primary key (id)
);
create table transaction_item
(
transaction_id int not null,
account_id int not null,
amount numeric(19, 2) not null,
currency varchar(64) not null,
running_balance numeric(19, 2) not null,
note varchar(255),
primary key (transaction_id, account_id)
);
alter table transaction_item
add constraint fk_txn_item_ref_transaction
foreign key (transaction_id) references transaction (id);
alter table transaction_item
add constraint fk_txn_item_ref_account
foreign key (account_id) references account (id);
DML
insert into account (id,balance,currency,name)
select no,
500.00 + random() * 500.00,
'USD',
md5(random()::text)
from generate_series(1, 10) no;
insert into transaction (id,booking_date)
select no,
now()::date
from generate_series(1, 1000) no;
insert into transaction_item (transaction_id,account_id,amount,currency,running_balance,note)
select no,
round(1 + random() * 9),
500.00 + random() * 500.00,
'USD',
500.00 + random() * 500.00,
'Cockroaches can eat anything'
from generate_series(1, 1000) no;
CockroachDB Setup
Create a local cluster of three nodes (or one, it doesn't matter for this article):
cockroach start --port=26257 --http-port=8080 --advertise-addr=localhost:26257 --join=localhost:26257 --insecure --store=datafiles/n1 --background
cockroach start --port=26258 --http-port=8081 --advertise-addr=localhost:26258 --join=localhost:26257 --insecure --store=datafiles/n2 --background
cockroach start --port=26259 --http-port=8082 --advertise-addr=localhost:26259 --join=localhost:26257 --insecure --store=datafiles/n3 --background
cockroach init --insecure --host=localhost:26257
Next, set up a database called crdb_test
that will become populated with the PostgreSQL schema and data:
cockroach sql --insecure --host=localhost:26257 -e "CREATE database crdb_test"
Pipeline Setup
Pipeline is a tool that wraps Spring Batch and it will do all the work for us. Initially, clone the repo and build it locally:
git clone git@github.com:kai-niemi/roach-pipeline.git pipeline
cd pipeline
chmod +x mvnw
./mvnw clean install
The executable jar is now available under the target
folder.
Start it up with (change URL and credentials to match your PSQL setup):
java -jar target/pipeline.jar \
--pipeline.template.source.url=jdbc:postgresql://localhost:5432/crdb_test \
--pipeline.template.source.username=postgres \
--pipeline.template.source.password=**** \
--pipeline.template.target.url=jdbc:postgresql://localhost:26257/crdb_test ?sslmode=disable
This configures the source and target datasources which are used to pre-populate batch job forms.
Configure the Pipeline
Now we are ready to copy data from PostgreSQL to CockroachDB using the sql2sql
REST endpoint.
It works by first requesting a form for each table to be copied. The form will be pre-filled with information (from pipeline.template.source.*
) except for the CREATE TABLE
statement that you need to enter manually. The reason is that there's no support for SHOW CREATE TABLE
in PostgreSQL, which is what this tool uses for introspection.
Generate Form Templates
Let's get a form template for each table in three separate REST API calls.
curl -X GET http://localhost:8090/sql2sql/form?table=account > account.json
curl -X GET http://localhost:8090/sql2sql/form?table=transaction > transaction.json
curl -X GET http://localhost:8090/sql2sql/form?table=transaction_item > transaction_item.json
Add a CREATE TABLE
statement to each form JSON file. It will be used to create the tables in CockroachDB. You can copy the statements from PSQL using pg_dump
:
pg_dump --dbname=crdb_test --table=account --schema-only
pg_dump --dbname=crdb_test --table=transaction --schema-only
pg_dump --dbname=crdb_test --table=transaction_item --schema-only
Then add the DDL parts to the createQuery
parameter in the JSON form files. If you also add IF NOT EXISTS
to createQuery
, then the batch jobs will be fully repeatable.
Example:
"CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"
Shows just the account.json file:
{
"_links" : {
"self" : {
"href" : "http://localhost:8090/sql2sql/form?table=account"
}
},
"table" : "account",
"restartExecutionId" : 0,
"sourceUrl" : "jdbc:postgresql://localhost:5432/crdb_test",
"sourceUsername" : "postgres",
"sourcePassword" : "****",
"targetUrl" : "jdbc:postgresql://localhost:26257/crdb_test?sslmode=disable",
"targetUsername" : "root",
"targetPassword" : "",
"concurrency" : 8,
"chunkSize" : 32,
"linesToSkip" : 0,
"pageSize" : 32,
"sortKeys" : "id ASC",
"selectClause" : "SELECT *",
"fromClause" : "FROM account",
"whereClause" : "WHERE 1=1",
"insertQuery" : "UPSERT INTO account(id,balance,currency,name) VALUES (:id,:balance,:currency,:name)",
"createQuery" : "CREATE TABLE IF NOT EXISTS public.account (\nid integer NOT NULL,\nbalance numeric(19,2) NOT NULL,\ncurrency character varying(64) NOT NULL,\nname character varying(128) NOT NULL\n); ALTER TABLE ONLY public.account\n ADD CONSTRAINT IF NOT EXISTS account_pkey PRIMARY KEY (id);"
}
Submit Batch Jobs
The next step is to POST the forms back, which will register and start the jobs. Because we use foreign keys in PSQL, the jobs will need to be registered in a sorted topology order based on the foreign key constraints.
To find out the order, we can simply ask the tool:
curl -X GET http://localhost:8090/datasource/source-tables | grep topologyOrder
Which tells you something like:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left "topologyOrder" : "transaction,account,transaction_item"
All we need to do now is submit the forms in that given order:
curl -d "@transaction.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql
curl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql
curl -d "@transaction_item.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql
On each of these POST requests, you will get a 202 (Accepted) response with a pipeline:execution
link relation and an HREF to a resource if the job was successfully submitted. A 202 only means that the request was accepted for async processing. If you follow that link URI, it takes you to the registered job with the current status.
$ curl -d "@account.json" -H "Content-Type:application/json" -X POST http://localhost:8090/sql2sql
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1342 0 336 100 1006 1648 4935 --:--:-- --:--:-- --:--:-- 6610{
"_links" : {
"pipeline:execution" : {
"href" : "http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6"
},
"curies" : [ {
"href" : "http://localhost:8090/rels/{rel}",
"name" : "pipeline",
"templated" : true
} ]
},
"message" : "SQL2SQL Job Accepted"
}
$ curl -X GET http://localhost:8090/jobs/execution/future/19353fac-8578-46b0-b417-7ba5bcdd03e6
Once complete, you should see a complete copy of the PostgreSQL database in CockroachDB.
Conclusion
In this article, we looked at creating a simple batch-oriented SQL to SQL pipeline at the table level between PostgreSQL and CockroachDB.