Disaster Recovery and High Availability for SAP-C02: Strategies and Architectures

Disaster recovery (DR) and high availability (HA) are among the most critical topics on the AWS Certified Solutions Architect Professional (SAP-C02) exam. The exam tests your ability to choose the right DR strategy based on business requirements, design multi-region architectures, and calculate whether a solution meets defined RTO and RPO targets.

This guide covers all four DR strategies, RTO/RPO analysis, multi-region architecture patterns, database replication, and Route 53 failover configurations — everything you need for this domain of the SAP-C02. For a broader overview of all exam domains, see our SAP-C02 exam guide.

Understanding RTO and RPO

Before diving into strategies, you must understand the two metrics that drive every DR decision.

Recovery Time Objective (RTO) is the maximum acceptable time between a disaster and full service restoration. It answers: “How long can we be down?”

Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time. It answers: “How much data can we afford to lose?”

Metric	Question It Answers	Example
RTO	How long until service is restored?	RTO of 4 hours means the system must be operational within 4 hours of a disaster
RPO	How much data can be lost?	RPO of 1 hour means you can lose at most 1 hour of data

On the SAP-C02, you will be given specific RTO and RPO requirements and asked to select the most cost-effective strategy that meets them. Understanding the cost-availability trade-off is essential.

The Four DR Strategies

AWS defines four DR strategies, ordered from lowest cost (and longest recovery) to highest cost (and fastest recovery).

Strategy 1: Backup and Restore

How it works: You regularly back up data and configurations to a durable store (usually S3 or AWS Backup). In a disaster, you restore from backups and rebuild infrastructure — often using Infrastructure as Code (CloudFormation, Terraform).

Typical RTO: Hours to days Typical RPO: Hours (depends on backup frequency) Cost: Lowest — you only pay for storage of backups

Key AWS services:

AWS Backup for centralized backup management
S3 Cross-Region Replication for backup durability
CloudFormation or Terraform for infrastructure rebuild
AMI copies across regions

When to use: Non-critical workloads, development environments, or when budget is severely constrained.

Exam trap: If the question specifies an RTO of minutes, backup and restore is never the correct answer, regardless of cost constraints.

Strategy 2: Pilot Light

How it works: A minimal version of your core infrastructure is always running in the DR region. Typically, this means database replicas are active, but application and web tiers are provisioned but not running (AMIs ready, launch templates configured, but no instances running).

Typical RTO: Tens of minutes to hours Typical RPO: Minutes (continuous database replication) Cost: Low to moderate — you pay for running database replicas and minimal infrastructure

Key AWS services:

RDS Cross-Region Read Replicas (promote on failover)
Aurora Global Database (1-second replication lag)
EC2 AMIs pre-copied to DR region
Auto Scaling Groups with desired count set to 0 (scale up on failover)

When to use: Core business systems where recovery within an hour is acceptable and you want to minimize standing costs.

Strategy 3: Warm Standby

How it works: A scaled-down but fully functional version of your production environment runs continuously in the DR region. All tiers (web, application, database) are running but at reduced capacity. On failover, you scale up to production capacity.

Typical RTO: Minutes Typical RPO: Seconds to minutes (continuous replication) Cost: Moderate to high — you pay for continuously running (reduced) infrastructure

Key AWS services:

Aurora Global Database or RDS Cross-Region Read Replicas
Auto Scaling Groups running at minimum capacity
Elastic Load Balancers provisioned and health-checking
Route 53 health checks and failover routing

When to use: Business-critical systems that require recovery within minutes and can tolerate a brief period of reduced capacity during scale-up.

Strategy 4: Multi-Site Active-Active

How it works: Full production environments run simultaneously in two or more regions. Traffic is distributed across all regions at all times. There is no failover in the traditional sense — if one region fails, the others absorb its traffic.

Typical RTO: Near zero (seconds) Typical RPO: Near zero (synchronous or near-synchronous replication) Cost: Highest — you pay for full infrastructure in multiple regions

Key AWS services:

DynamoDB Global Tables (multi-region, multi-active)
Aurora Global Database with write forwarding
Route 53 latency-based or weighted routing
Global Accelerator for performance optimization
S3 Cross-Region Replication (bidirectional)

When to use: Mission-critical systems with zero tolerance for downtime or data loss.

DR Strategy Comparison Table

This table is essential exam reference material:

Strategy	RTO	RPO	Cost	Running Infra in DR	Failover Automation
Backup & Restore	Hours to days	Hours	Lowest	None (backups only)	Manual or scripted
Pilot Light	10 min to hours	Minutes	Low-moderate	Database replicas only	Semi-automated
Warm Standby	Minutes	Seconds-minutes	Moderate-high	Scaled-down full stack	Automated
Multi-Site Active-Active	Near zero	Near zero	Highest	Full production	Automatic (no failover needed)

Exam strategy: When the question asks for the “most cost-effective” solution that meets specific RTO/RPO requirements, start from the cheapest strategy and work up until you find one that satisfies the requirements. If the question asks for “minimal downtime” or “business continuity,” lean toward warm standby or active-active.

Multi-Region Architecture Patterns

Data Replication Across Regions

Choosing the right data replication strategy is central to DR design. Here are the key options:

Amazon Aurora Global Database

1-second typical replication lag across regions
Up to 5 secondary regions
Managed planned failover (RPO = 0) and unplanned failover (RPO in seconds)
Write forwarding allows secondary regions to forward writes to the primary

DynamoDB Global Tables

Multi-region, multi-active (read and write in any region)
Eventually consistent replication (typically under 1 second)
No primary/secondary — all tables are active
Best for active-active patterns

RDS Cross-Region Read Replicas

Asynchronous replication
Can be promoted to standalone instance (minutes of downtime)
Available for MySQL, PostgreSQL, MariaDB, Oracle, SQL Server
Higher replication lag than Aurora Global Database

S3 Cross-Region Replication (CRR)

Object-level replication between S3 buckets in different regions
Can be configured for entire bucket or prefix/tag-based filtering
Supports bidirectional replication for active-active patterns
Replication Time Control (RTC) for SLA-backed 15-minute replication

Stateful vs. Stateless Components

When designing multi-region DR, treat components differently based on their state:

Stateless components (web servers, API servers, Lambda functions): Deploy identically in both regions using IaC. No data synchronization needed. Scale independently.

Stateful components (databases, caches, session stores): Require active replication. Choose replication method based on RPO requirements. Consider data consistency implications.

Shared state (user sessions, shopping carts): Use DynamoDB Global Tables or ElastiCache Global Datastore for cross-region session management.

Route 53 Failover Patterns

Route 53 is the traffic management layer for most DR architectures. The SAP-C02 tests your knowledge of routing policies extensively.

Failover Routing Policy

The simplest DR pattern. Route 53 directs traffic to a primary resource and automatically fails over to a secondary resource when the primary is unhealthy.

Requires health checks on the primary endpoint
Can be combined with Evaluate Target Health for alias records
Failover happens within 1-2 minutes of health check failure (configurable intervals: 10s or 30s)

Latency-Based Routing

Routes users to the region with the lowest latency. In an active-active setup, this distributes traffic geographically. If a region fails and its health check fails, Route 53 removes it from responses.

Weighted Routing

Distributes traffic by percentage across endpoints. Useful for gradual failover (e.g., shift 10% of traffic to the DR region, then 50%, then 100%) or canary deployments.

Geolocation and Geoproximity Routing

Geolocation: Routes based on the user’s geographic location. Useful for compliance (keep EU data in EU).
Geoproximity: Routes based on physical distance, with configurable bias to shift traffic between regions.

Health Check Configuration

Route 53 health checks are critical for automated failover:

Health Check Type	What It Monitors	Use Case
Endpoint	HTTP/HTTPS/TCP endpoint	Direct service health
Calculated	Aggregates other health checks (AND/OR logic)	Complex multi-component health
CloudWatch Alarm	CloudWatch alarm state	Metric-based failover (CPU, latency, error rate)

Exam tip: Calculated health checks are the answer when the question describes a scenario where failover should occur only when multiple components fail, not just one endpoint.

Database-Specific DR Patterns

Aurora Global Database Failover

Planned failover (switchover):

Zero data loss (RPO = 0)
Demotes the primary to secondary, promotes a secondary to primary
Takes 1-2 minutes
Use for planned maintenance or region migration

Unplanned failover (detach and promote):

RPO is typically 1 second (replication lag)
Detaches a secondary region and promotes it to standalone
Application connection strings must be updated (use RDS Proxy or Route 53 CNAME for automation)

DynamoDB Global Tables

DynamoDB Global Tables provide automatic multi-region replication with conflict resolution (last writer wins). Key considerations:

All replicas accept writes simultaneously
Conflict resolution is automatic but eventually consistent
Adding a new region to an existing table is supported (replication starts automatically)
Capacity planning must account for replicated write capacity in each region

ElastiCache Global Datastore

For Redis-based caching layers:

Cross-region replication with sub-second lag
One primary region (read-write), secondary regions (read-only)
Promotion of secondary to primary takes minutes
Useful for session store DR

High Availability Within a Single Region

While DR focuses on cross-region resilience, high availability within a single region is equally important for SAP-C02.

Multi-AZ Patterns

Service	Multi-AZ Behavior
EC2 + ALB	Distribute instances across AZs; ALB routes around unhealthy targets
RDS Multi-AZ	Synchronous standby in another AZ; automatic failover in 1-2 minutes
Aurora	Up to 15 read replicas across 3 AZs; storage replicated 6 ways across 3 AZs
ElastiCache	Multi-AZ with automatic failover for Redis clusters
EFS	Automatically stores data across multiple AZs
S3	Automatically stores data across a minimum of 3 AZs
DynamoDB	Automatically replicates across 3 AZs

Auto Scaling for Resilience

Auto Scaling Groups contribute to HA by:

Replacing unhealthy instances automatically
Distributing instances across specified AZs
Maintaining desired capacity even during AZ failures

Exam tip: Configure Auto Scaling Groups across at least 3 AZs. If an AZ fails, the remaining AZs should have enough capacity headroom. Over-provision by at least 33% (N+1 across 3 AZs).

Real-World Exam Scenarios

Scenario 1: E-Commerce Platform DR

Requirements: RTO of 15 minutes, RPO of 1 minute, cost-optimized.

Correct strategy: Warm standby. Pilot light cannot reliably achieve 15-minute RTO (starting application tiers takes time). Active-active exceeds requirements and costs more.

Architecture: Aurora Global Database in DR region, scaled-down Auto Scaling Groups, pre-provisioned ALB, Route 53 failover routing with health checks.

Scenario 2: Financial Trading Platform

Requirements: Near-zero RTO, zero data loss, regulatory requirement for two active regions.

Correct strategy: Multi-site active-active. The regulatory requirement and near-zero targets make this the only viable option.

Architecture: DynamoDB Global Tables, application deployed at full scale in two regions, Route 53 latency-based routing, Global Accelerator for TCP optimization.

Scenario 3: Internal Analytics Platform

Requirements: RTO of 24 hours, RPO of 4 hours, minimal budget.

Correct strategy: Backup and restore. The relaxed RTO/RPO allows the cheapest approach.

Architecture: AWS Backup with cross-region copy rules (4-hour frequency), CloudFormation templates stored in S3, AMIs copied to DR region nightly.

Exam Tips for DR and HA Questions

Always match the strategy to the specific RTO/RPO numbers in the question — do not over-engineer
“Cost-effective” means choose the cheapest strategy that meets the requirements
Aurora Global Database is the go-to for relational database DR with near-zero RPO
DynamoDB Global Tables is the go-to for active-active NoSQL scenarios
Route 53 failover routing is the standard DNS-based failover mechanism
Multi-AZ is HA; multi-region is DR — the exam distinguishes between these
Remember that RDS Multi-AZ failover takes 1-2 minutes, not zero
For “automate failover” questions, look for Route 53 health checks + failover routing or Global Accelerator

Frequently Asked Questions

What is the difference between high availability and disaster recovery?

High availability (HA) minimizes downtime within a single region by distributing workloads across multiple Availability Zones. Disaster recovery (DR) ensures business continuity when an entire region or site becomes unavailable. HA addresses component failures; DR addresses catastrophic failures.

Which DR strategy should I choose for the SAP-C02 exam?

The exam expects you to choose based on stated RTO, RPO, and cost requirements. Match the cheapest strategy that satisfies all requirements. Never over-engineer unless the question specifically asks for maximum resilience regardless of cost.

What is the replication lag for Aurora Global Database?

Aurora Global Database typically has under 1 second of replication lag between the primary and secondary regions. Planned failover achieves zero data loss (RPO = 0). Unplanned failover RPO is typically 1 second based on replication lag at the time of failure.

Can DynamoDB Global Tables handle write conflicts?

Yes. DynamoDB Global Tables use a last-writer-wins reconciliation strategy based on timestamps. All replicas accept writes simultaneously, and conflicts are resolved automatically. This means the most recently written value takes precedence.

How does Route 53 detect a regional failure?

Route 53 uses health checks that continuously monitor endpoint availability. Health checks can monitor HTTP/HTTPS/TCP endpoints directly, aggregate multiple health checks using calculated health checks, or monitor CloudWatch alarm states. When a health check fails, Route 53 stops routing traffic to the unhealthy endpoint.

What is the cost difference between pilot light and warm standby?

Pilot light keeps only database replicas and minimal infrastructure running, so you pay primarily for database instance costs. Warm standby runs a scaled-down but complete environment (compute, load balancers, databases), which costs significantly more. The exact difference depends on your architecture, but warm standby typically costs 3-5x more than pilot light.

How do I handle stateful sessions in a multi-region active-active architecture?

Use DynamoDB Global Tables or ElastiCache Global Datastore for cross-region session management. Store session data externally (not on the instance) so any region can serve any user. DynamoDB Global Tables is the preferred option for active-active because all replicas accept reads and writes.

Does the SAP-C02 exam test specific RTO/RPO calculations?

The exam does not require mathematical calculations, but it tests your ability to evaluate whether a proposed architecture meets stated RTO and RPO requirements. You need to understand the typical RTO and RPO achievable with each strategy and service, and identify mismatches in scenario-based questions.

Conclusion

Disaster recovery and high availability design is a core competency for the SAP-C02 exam. Understanding the four DR strategies — their trade-offs, costs, and achievable RTO/RPO — allows you to quickly eliminate wrong answers and select the correct architecture for any given scenario. Combined with deep knowledge of multi-account architecture and advanced networking, DR patterns form the backbone of professional-level AWS architecture.

Practice applying these strategies to realistic scenarios with Sailor.sh’s SAP-C02 mock exams. The exam presents complex, multi-constraint scenarios where you must balance RTO, RPO, cost, and operational overhead — and the best way to build that skill is through deliberate practice.