Photo by Jan Kopřiva on Unsplash
Multi-active systems
A guide to the features of multi-active and multi-region systems
Some business drivers that could justify adopting a multi-active and multi-region deployment strategy:
Securing business continuity in the event of regional data centre disruptions
Deliver a good customer experience worldwide
Deliver business adaptability and scalability to different market needs/volumes
Compliance with data locality/placement regulations
Sustainable operational costs as the business grows
Multi-active systems are a prerequisite to effectively adopting a multi-region strategy. Let's find out how.
Multi-Active Systems
A multi-active system is capable to operate and serve online traffic simultaneously from multiple active data centres and regions. With that comes characteristics that will satisfy the goals stated above without adding too much infrastructure/app complexity and cost overhead.
Multi-active systems run live in multiple datacenters in different regions all the time and workloads are dynamically shared across these datacenters. A multi-active system process requests simultaneously for either domestic or global markets without any assumptions on locality, traffic affinity or replication delays.
There are no actual concepts of traffic failover or failback. Instead, failures and disruptions are handled transparently through regional and/or global load balancing and traffic rerouting. If one failure domain (data centre or region) begins to fall behind due to disruptions, parts of its workload can move to other domains transparently. If one domain is completely offline, all its work is rebalanced to the remaining domains.
Using three failure domains allows one to go completely dark, while still allowing systems to make forward progress through consensus decisions. Using five failure domains allows two domains to go dark, and so on. This allows for systems to be both highly available and always consistent and correct.
A multi-active system is not without certain challenges:
Needs a fit-for-purpose solution to manage state and replication (data distribution at a global scale is a difficult problem) such as:
Protecting rule invariants also during contention and failures
Ensure effectively once outcomes of event processing
Higher service latency due to mandatory cross-datacenter coordination
Existing design assumptions on single DC/region deployment
Constraints around auxiliary system integrations
Breaking current assumptions on system design
To contain complexity, most of these challenges are preferably pushed down to the resource tier - the database - instead of being managed in the app tier. CockroachDB is one such system with a wide range of multi-region deployment options and first-class support for crafting multi-active systems.
Failover-based Systems
To contrast multi-active systems, let's quickly look at the main predecessor: singly-homed, failover-based systems. A singly-homed system is crafted to operate and serve online traffic from a single data centre or at most a single region. Crafted in terms of design choices, technology selections, network assumptions and supporting infrastructure components.
In the event of a primary domain disruption or disaster, a singly-homed system may failover traffic to an alternative secondary data centre. After the disruption is cleared away, it may "fail back" traffic to the original primary domain.
This type of setup has many limitations:
Unable to scale horizontally beyond a single data centre/region
Unable to load-balance traffic freely across multiple active datacenters
Dependent on standby, underutilized resources, increasing TCO (300% capacity for steady state)
Must use asynchronous replication for availability and performance, with the risk of data loss
Long recovery times after failures
Complex and error-prone failover protocols with manual checkpoints/sign-offs
Unclear when and if a standby system can resume traffic from a safe point
Difficult and risky to test and verify that the protocol works
Disaster Recovery Spectrum
The main objective of a disaster recovery plan is to minimize the time it takes to recover from a severe disruption event and reduce the amount of data loss and other business impacts.
The spectrum of disaster recovery solutions typically ranges from offline backups to full-blown multi-region deployments, also with backups.
Backups - Data is frequently backed up and sent off-side or to cloud storage.
The recovery time objective (RTO) is governed by the time it takes to restore the database to a new setup
The recovery point objective (RPO) is governed by the frequency of incremental and full backups
Cold Standby - A minimally provisioned environment with the ability to take over core services from a failed primary data centre.
Higher TCO due to under-utilized standby capacity
RTO is governed by how fast a switchover can be made to the secondary
RPO is governed by the async replication delay from the primary to the secondary
Warm Standby - A fully provisioned environment with the ability to take over a failed primary data centre.
Higher TCO due to excessive amounts of under-utilized standby capacity
Could serve certain read-only traffic at the same time as the primary
RTO and RPO are quite similar to a cold standby, only a bit lower due to higher readiness
Multi-Active - Each deployment site serves production traffic simultaneously.
All data centres provide traffic at the same time for the entire keyspace
There is no actual notion of fail-over or fail-back, failures and recovery are handled transparently towards the app tier
RTO is governed by how quickly an isolated or crashed node can drop its authority over reads and writes to local data (typically a few seconds)
RPO is zero due to consensus-based replication
Multi-active systems stand out from most fail-over-based models in terms of cost and complexity reduction. It's far more resilient against different categories of disruptions, but not immune to disasters. If a multi-active system loses a majority of its failure domains (like 2 zones in a 3-zone region) or if some operator error corrupts a database, then the music stops. Therefore, backups are still commonly used alongside multi-active systems, which adds a safety harness for recovery.
Combined with multiple regions, the blast radius is extended to cover most conditions and you can also improve customer experience for a global market.
Multi-region deployments
One data centre is a single point of failure, similar to a single region. If that data centre/region goes offline for a longer period without any recovery option, it may have a severe impact on the business and the company's reputation.
Adding two or more data centres to a single region will increase the blast radius and decrease the likelihood of severe, long-lasting service disruptions due to a single DC outage.
Deploying a system (as in many services/components working in concert) across multiple, geo-separated regions extends the blast radius even further. Single-region assumptions cannot however be transferred to this new ecosystem due to how we traditionally manage state and consistency. Leveraging multi-region effectively requires a multi-active system architecture. Not exclusively, but it's very much a state/database undertaking that needs a fit-for-purpose solution like CockroachDB.
Summary
This article discusses the advantages of adopting a region-level deployment strategy for businesses, focusing on multi-active systems. These systems operate simultaneously across multiple data centres, providing increased resiliency and adaptability to market needs. The article also contrasts multi-active systems with traditional failover-based systems and examines the disaster recovery spectrum, including backups, cold standby, warm standby, and multi-active solutions. Ultimately, multi-active systems offer significant benefits in terms of cost and complexity reduction, while still requiring backups to ensure data safety.