Highly available multi-region web application

May 17, 20225 min read

Updated: May 24, 2024

This reference architecture shows how to run an Azure App Service application in multiple regions to achieve high availability.

There are several general approaches to achieving high availability across regions:

Active/Passive with hot standby: traffic goes to one region while the other waits on hot standby. Hot standby means the VMs in the secondary region are allocated and running at all times.
Active/Passive with cold standby: traffic goes to one region while the other waits on cold standby. Cold standby means the VMs in the secondary region are not allocated until needed for failover. This approach costs less to run but will generally take longer to come online during a failure.
Active/Active: both regions are active, and requests are load-balanced between them. If one region becomes unavailable, it is taken out of rotation.

This reference focuses on active/passive with hot standby. It extends the single region design for a scalable web application.

Use Cases –

Design a business continuity and disaster recovery plan for LoB applications
Deploy mission-critical applications running on Windows or Linux
Improve user experience by keeping applications available

Architecture

Architecture Workflow

This architecture builds on the one shown in Improve scalability in a web application.

The main differences are:

Primary and secondary regions. This architecture uses two regions to achieve higher availability. The application is deployed to each region. During normal operations, network traffic is routed to the primary region. If the primary region becomes unavailable, traffic is routed to the secondary region.
Front Door. Front Door routes incoming requests to the primary region. If the application running that region becomes unavailable, Front Door fails over to the secondary region.
Geo-replication of SQL Database and/or Cosmos DB.

A multi-region architecture can provide higher availability than deploying to a single region. If a regional outage affects the primary region, you can use Front Door to fail over to the secondary region. This architecture can also help if an individual subsystem of the application fails.

Components

Azure Active Directory
Azure DNS
Azure Content Delivery Network
Azure Front Door
Azure AppService
Azure Function
Azure Storage
Azure Redis Cache
Azure SQL Database
Azure Cosmos DB
Azure Search

Recommendations

Your requirements might differ from the architecture described here. Use the recommendations in this section as a starting point.

Regional Pairing

Each Azure region is paired with another region within the same geography. In general, choose regions from the same regional pair (for example, East US 2 and Central US).

Benefits of doing so include:

If there is a broad outage, recovery of at least one region out of every pair is prioritized.
Planned Azure system updates are rolled out to paired regions sequentially to minimize possible downtime.
In most cases, regional pairs reside within the same geography to meet data residency requirements.

Resource Group

Consider placing the primary region, secondary region, and Traffic Manager into separate resource groups. This lets you manage the resources deployed to each region as a single collection.

Azure Front Door

Routing - Front Door supports several routing mechanisms. Front Door sends all requests to the primary region unless the endpoint for that region becomes unreachable. At that point, it automatically fails over to the secondary region.
Health Probe - Front Door uses an HTTP (or HTTPS) probe to monitor the availability of each back end. The probe gives Front Door a pass/fail test for failing over to the secondary region.

SQL Database - Use Active Geo-Replication to create a readable secondary replica in a different region. You can have up to four readable secondary replicas. Failover to a secondary database if your primary database fails or needs to be taken offline. Active Geo-Replication can be configured for any database in any elastic database pool.

Cosmos DB - Cosmos DB supports geo-replication across regions in an active-active pattern with multiple write regions. Alternatively, you can designate one region as the writable region and the others as read-only replicas. If there is a regional outage, you can fail over by selecting another region to be the write region.

Storage - For Azure Storage, use read-access geo-redundant storage (RA-GRS). With RA-GRS storage, the data is replicated in a secondary region. You have read-only access to the data in the secondary region through a separate endpoint. If there is a regional outage or disaster, the Azure Storage team might decide to perform a geo-failover to the secondary region. There is no customer action required for this failover.

Considerations

Availability

Consider these points when designing for high availability across regions.

Azure Front Door

Azure Front Door automatically fails over if the primary region becomes unavailable. When Front Door fails over, there is a period of time (usually about 20-60 seconds) when clients cannot reach the application. The duration is affected by the following factors:

Frequency of health probes. The more frequently the health probes are sent, the faster Front Door can detect downtime or the backend coming back healthy.
Sample size configuration. This configuration controls how many samples are required for the health probe to detect that the primary backend has become unreachable. If this value is too low, you could get false positives from intermittent issues.

SQL Database

The recovery point objective (RPO) and estimated recovery time (ERT) for SQL Database are documented in the Overview of Business Continuity with Azure SQL Database.

Cosmos DB

RPO and recovery time objective (RTO) for Cosmos DB are configurable via the consistency levels used, which provide trade-offs between availability, data durability, and throughput. Cosmos DB provides a minimum RTO of 0 for a relaxed consistency level with multi-master or an RPO of 0 for strong consistency with single-master. To learn more about Cosmos DB consistency levels.

Storage

RA-GRS storage provides durable storage, but it's important to understand what can happen during an outage:

If a storage outage occurs, there will be a period of time when you don't have write access to the data. You can still read from the secondary endpoint during the outage.
If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geo-failover to the secondary region.
Data replication to the secondary region is performed asynchronously. Therefore, if a geo-failover is performed, some data loss is possible if the data can't be recovered from the primary area.
Transient failures, such as a network outage, will not trigger a storage failover. Design your application to be resilient to transient failures. Mitigation options include:
- Read from the secondary region.
- Temporarily switch to another storage account for new write operations (for example, to queue messages).
- Copy data from the secondary region to another storage account.
- Provide reduced functionality until the system fails.

Manageability

If the primary database fails, perform a manual failover to the secondary database. The secondary database remains read-only until you fail.

Pricing

Use the pricing calculator to estimate costs. These recommendations in this section may help you to reduce cost.

Azure Front Door

Azure Front Door billing has three pricing tiers: outbound data transfers, inbound data transfers, and routing rules. The pricing chart does not include the cost of accessing data from the backend services and transferring to Front Door. Those costs are billed based on data transfer charges, described in Bandwidth Pricing Details.

Azure Cosmos DB

There are two factors that determine Azure Cosmos DB pricing:

The provisioned throughput or Request Units per second (RU/s). There are two types of throughput that can be provisioned in Cosmos DB: standard and auto-scale. Standard throughput allocates the resources required to guarantee the RU/s that you specify. For auto-scale, you provision the maximum throughput, and Cosmos DB instantly scales up or down depending on the load, with a minimum of 10% of the maximum auto scale throughput. Standard throughput is billed for the throughput provisioned hourly. Auto scale throughput is billed for the maximum throughput consumed hourly.
Consumed storage. You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given hour.