Multi-active systems

A guide to the features of multi-active and multi-region systems

Jul 31, 2023·

6 min read

Some business drivers that could justify adopting a multi-active and multi-region deployment strategy:

  • Securing business continuity in the event of regional data centre disruptions

  • Deliver a good customer experience worldwide

  • Deliver business adaptability and scalability to different market needs/volumes

  • Compliance with data locality/placement regulations

  • Sustainable operational costs as the business grows

Multi-active systems are a prerequisite to effectively adopting a multi-region strategy. Let's find out how.

Multi-Active Systems

A multi-active system is capable to operate and serve online traffic simultaneously from multiple active data centres and regions. With that comes characteristics that will satisfy the goals stated above without adding too much infrastructure/app complexity and cost overhead.

Multi-active systems run live in multiple datacenters in different regions all the time and workloads are dynamically shared across these datacenters. A multi-active system process requests simultaneously for either domestic or global markets without any assumptions on locality, traffic affinity or replication delays.

There are no actual concepts of traffic failover or failback. Instead, failures and disruptions are handled transparently through regional and/or global load balancing and traffic rerouting. If one failure domain (data centre or region) begins to fall behind due to disruptions, parts of its workload can move to other domains transparently. If one domain is completely offline, all its work is rebalanced to the remaining domains.

Using three failure domains allows one to go completely dark, while still allowing systems to make forward progress through consensus decisions. Using five failure domains allows two domains to go dark, and so on. This allows for systems to be both highly available and always consistent and correct.

A multi-active system is not without certain challenges:

  • Needs a fit-for-purpose solution to manage state and replication (data distribution at a global scale is a difficult problem) such as:

    • Protecting rule invariants also during contention and failures

    • Ensure effectively once outcomes of event processing

  • Higher service latency due to mandatory cross-datacenter coordination

  • Existing design assumptions on single DC/region deployment

  • Constraints around auxiliary system integrations

  • Breaking current assumptions on system design

To contain complexity, most of these challenges are preferably pushed down to the resource tier - the database - instead of being managed in the app tier. CockroachDB is one such system with a wide range of multi-region deployment options and first-class support for crafting multi-active systems.

Failover-based Systems

To contrast multi-active systems, let's quickly look at the main predecessor: singly-homed, failover-based systems. A singly-homed system is crafted to operate and serve online traffic from a single data centre or at most a single region. Crafted in terms of design choices, technology selections, network assumptions and supporting infrastructure components.

In the event of a primary domain disruption or disaster, a singly-homed system may failover traffic to an alternative secondary data centre. After the disruption is cleared away, it may "fail back" traffic to the original primary domain.

This type of setup has many limitations:

  • Unable to scale horizontally beyond a single data centre/region

  • Unable to load-balance traffic freely across multiple active datacenters

  • Dependent on standby, underutilized resources, increasing TCO (300% capacity for steady state)

  • Must use asynchronous replication for availability and performance, with the risk of data loss

  • Long recovery times after failures

  • Complex and error-prone failover protocols with manual checkpoints/sign-offs

  • Unclear when and if a standby system can resume traffic from a safe point

  • Difficult and risky to test and verify that the protocol works

Disaster Recovery Spectrum

The main objective of a disaster recovery plan is to minimize the time it takes to recover from a severe disruption event and reduce the amount of data loss and other business impacts.

The spectrum of disaster recovery solutions typically ranges from offline backups to full-blown multi-region deployments, also with backups.

  • Backups - Data is frequently backed up and sent off-side or to cloud storage.

    • The recovery time objective (RTO) is governed by the time it takes to restore the database to a new setup

    • The recovery point objective (RPO) is governed by the frequency of incremental and full backups

  • Cold Standby - A minimally provisioned environment with the ability to take over core services from a failed primary data centre.

    • Higher TCO due to under-utilized standby capacity

    • RTO is governed by how fast a switchover can be made to the secondary

    • RPO is governed by the async replication delay from the primary to the secondary

  • Warm Standby - A fully provisioned environment with the ability to take over a failed primary data centre.

    • Higher TCO due to excessive amounts of under-utilized standby capacity

    • Could serve certain read-only traffic at the same time as the primary

    • RTO and RPO are quite similar to a cold standby, only a bit lower due to higher readiness

  • Multi-Active - Each deployment site serves production traffic simultaneously.

    • All data centres provide traffic at the same time for the entire keyspace

    • There is no actual notion of fail-over or fail-back, failures and recovery are handled transparently towards the app tier

    • RTO is governed by how quickly an isolated or crashed node can drop its authority over reads and writes to local data (typically a few seconds)

    • RPO is zero due to consensus-based replication

Multi-active systems stand out from most fail-over-based models in terms of cost and complexity reduction. It's far more resilient against different categories of disruptions, but not immune to disasters. If a multi-active system loses a majority of its failure domains (like 2 zones in a 3-zone region) or if some operator error corrupts a database, then the music stops. Therefore, backups are still commonly used alongside multi-active systems, which adds a safety harness for recovery.

Combined with multiple regions, the blast radius is extended to cover most conditions and you can also improve customer experience for a global market.

Multi-region deployments

One data centre is a single point of failure, similar to a single region. If that data centre/region goes offline for a longer period without any recovery option, it may have a severe impact on the business and the company's reputation.

Adding two or more data centres to a single region will increase the blast radius and decrease the likelihood of severe, long-lasting service disruptions due to a single DC outage.

Deploying a system (as in many services/components working in concert) across multiple, geo-separated regions extends the blast radius even further. Single-region assumptions cannot however be transferred to this new ecosystem due to how we traditionally manage state and consistency. Leveraging multi-region effectively requires a multi-active system architecture. Not exclusively, but it's very much a state/database undertaking that needs a fit-for-purpose solution like CockroachDB.

Summary

This article discusses the advantages of adopting a region-level deployment strategy for businesses, focusing on multi-active systems. These systems operate simultaneously across multiple data centres, providing increased resiliency and adaptability to market needs. The article also contrasts multi-active systems with traditional failover-based systems and examines the disaster recovery spectrum, including backups, cold standby, warm standby, and multi-active solutions. Ultimately, multi-active systems offer significant benefits in terms of cost and complexity reduction, while still requiring backups to ensure data safety.