Cluster Singletons using ShedLock

Cluster Singletons using ShedLock

How to execute scheduled tasks as cluster singletons using ShedLock and CockroachDB

·

7 min read

Overview

You have a service deployed in multiple instances across the globe that must run scheduled tasks exclusively, as cluster singletons. How do you ensure that a scheduled task in multiple processors only run once at any given time in a distributed environment?

This is fairly common requirement for batch data ingestion, sending notifications, performing cleanups and other data housekeeping operations. It leans towards using some kind of locking mechanism to help coordinate actions across the network of machines. Distributed locks are however a fundamentally unsafe concept in a system of independent processors interacting over an asynchronous network. The problem is we can't tell the difference between a slow and failed processor. Put in other words:

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

image.png

Add distributed locks to the list, or don't, since it's ultimately about the same tradeoffs in the end, conceptualized by the CAP theorem. Besides naming things which is covered by some other theorem.

A distributed lock service would need to provide both safety and liveness. To recap, a safety property asserts that nothing bad happens during execution. A liveness property asserts that something good eventually happens. Mapped to lock semantics, a contradicting "unsafe" property would be to hand out the same lock to multiple clients. It wouldn't really be a lock then anymore providing mutual exclusion (mutex).

A lack of "liveness" would be to wait indefinitely for a claimed lock to be released. For example, if a client holding a lock has gone away fishing for an arbitrary long period of time (unreachable) and thereby preventing other clients from acquiring the lock and make forward progress. Adding a timeout or putting a TTL on the lock guarantee itself doesn't fix the problem since again, we can't tell if the client on the fishing trip eventually returns back with a catch.

To preserve safety, one option is to rely on the storage system itself with a transactional guarantee, or use a monotonically increasing fencing token in the lock service. This would involve the component that the lock is intended for to become a stateful observer and part of the protocol and responsible to reject old tokens. This is the type of approach you find in Google's Chubby, etcd and recently also Hazelcast, with slightly different terminology.

OK, what has this got to do with task scheduling then? Answer is nothing, but if you need the scheduled tasks to run exclusively as singletons rather than concurrently, then a mutex is your friend and then you need a mutex that can span beyond a single processor.

For the remainder of this article, we are not going to use a lock service with fencing tokens but instead CockroachDB, which is a distributed SQL database with strong consistency. A distributed database like CockroachDB have all the necessary underlying mechanisms to manage consensus across a cluster, and that's what we are going to leverage for task scheduling locks (with a caveat).

Introduction to ShedLock

In this example component, we are going to leverage a small Java library called ShedLock along with Spring's built-in scheduling mechanism. Spring scheduling only does scheduling and not any locking. To that end, ShedLock can provide a simple locking mechanism needed to ensure cluster-singleton task execution. ShedLock on the other hand does not do any scheduling, only locking.

ShedLock provides an API for acquiring and releasing locks as well as connectors to a wide variety of lock providers such as Etcd, Zookeeper, Consul, Cassandra, Redis, Hazelcast and virtually any SQL database through JDBC.

Combining these two mechanisms and using a distributed database as lock provider for managing the lock lifecycles, provides for a simple and easy-to-use cluster singleton task framework. There are distributed schedulers like Quartz and Chronos on Mesos that achieve similar goals, but its a separate topic.

CockroachDB:s strong consistency, fault-tolerance and distributed coordination provides the perfect foundation for representing the underpinning mutex mechanism. You could build a global task execution service offering the same guarantees. A CockroachDB cluster can span multiple regions and it provides the same transactional guarantees and strong consistency properties regardless of the deployment topology.

SchedLock does have its limitations though when it fails to provide lock semantics and it's important to be aware of. Let's move on.

Limitations

The caveat with ShedLock is that it relies on lock timeouts. If a method exceeds the set time limit and the lock expires, then it can be claimed to another thread / process. Although it's documented, it fails on the safety property and can hand out the same lock to multiple threads.

This in turn means that a paused or delayed method that comes back live again must avoid causing side-effects that will no longer be in the scope of the lock. But how can it tell? There is no safe way to make that determination. There is however a KeepAliveLockProvider that can extend the lock period if needed, but it doesn't solve the problem, just extends the time. This is the problem ultimately solved by using fencing tokens described earlier.

Rule of thumb:

You have to set lockAtMostFor to a value which is much longer than normal execution time. If the task takes longer than lockAtMostFor the resulting behavior may be unpredictable (more than one process will effectively hold the lock).

Bottom line, you need to be aware of this limitation before using ShedLock for anything serious.

Source Code

The source code for this example is available in GitHub.

Configuration

To use SchedLock with Spring Boot, first add the Maven dependency:

        <dependency>
            <groupId>net.javacrumbs.shedlock</groupId>
            <artifactId>shedlock-spring</artifactId>
            <version>4.42.0</version>
        </dependency>
        <dependency>
            <groupId>net.javacrumbs.shedlock</groupId>
            <artifactId>shedlock-provider-jdbc-template</artifactId>
            <version>4.42.0</version>
        </dependency>

Next, we need to create a database table that will store information about the locks:

create table shedlock
(
    name       varchar(64)  not null,
    lock_until timestamp    not null,
    locked_at  timestamp    not null,
    locked_by  varchar(255) not null,
    primary key (name)
);

Next, we define the lock provider bean which will be using JDBC:

@Configuration
@EnableScheduling
@EnableSchedulerLock(defaultLockAtMostFor = "10m")
public class LockingConfiguration {
    @Bean
    public LockProvider lockProvider(DataSource dataSource) {
        return new JdbcTemplateLockProvider(builder()
                .withJdbcTemplate(new JdbcTemplate(dataSource))
                .usingDbTime()
                .build()
        );
    }
}

In the configuration above, notice @EnableScheduling and @EnableSchedulerLock which are the annotations used to enable scheduling and tailor default values for lock keep-alive (10 min in this example).

Usage

We are now ready to annotate the methods we would like to run following a schedule as cluster singletons. Meaning that these methods will run exclusively one at a time (per method) no matter how many application instances we use. For this purpose we are using the @Scheduled and SchedulerLock annotations.

@Service
public class ProductNewsFeed {
    private final Logger logger = LoggerFactory.getLogger(getClass());

    @Autowired
    private ProductService productService;

    @Scheduled(cron = "0/30 * * * * ?")
    @SchedulerLock(lockAtLeastFor = "1m", lockAtMostFor = "5m", name = "publishNews_clusterSingleton")
    public void publishNews() {
        logger.info(">> Entered cluster singleton method");
        // gone fishing
        logger.info("<< Exiting cluster singleton method");
    }
}

The @Scheduled makes Spring create an underlying scheduled task to execute this method at a specified interval. It uses a cron expression which in this case means every 30 second.

The @SchedulerLock is used by ShedLock to acquire a lock when this method goes into scope. The name must be unique since it's used as a key. The parameter lockAtLeastFor is optional and means this method will hold the lock for 1 minute at a minimum. The parameter lockAtMostFor is also optional and means this method will hold the lock for at most 5 minutes.

In case the lock holder or method execution goes beyond this time, the lock is up for grabs for another session/thread. Otherwise the lock is released when the method goes out of scope.

Demo

First build the demo app:

git clone git@github.com:kai-niemi/roach-spring-boot.git
./mvnw clean install
cd spring-boot-locking/target

Start first instance:

nohup java -jar spring-boot-locking.jar --server.port=8090 > locking-1-stdout.log 2>&1 &

Start second instance:

nohup java -jar spring-boot-locking.jar --server.port=8091 > locking-2-stdout.log 2>&1 &

Start third instance:

nohup java -jar spring-boot-locking.jar --server.port=8092 > locking-3-stdout.log 2>&1 &

You should then see "Gone fishing" only in one of the logs at any given time. You can tailor the timeout so that the task exceeds the time limit, at which point the lock guarantee goes out of the window.

Conclusion

In this article, we learned how to create and synchronize cluster singleton scheduled tasks using ShedLock and CockroachDB. ShedLock does however come with a caveat that it relies on lock timeouts.

Did you find this article valuable?

Support Kai Niemi by becoming a sponsor. Any amount is appreciated!