- Excelling With DynamoDB
- Posts
- Building Highly Resilient Systems with DynamoDB: Lessons from Amazon Ads
Building Highly Resilient Systems with DynamoDB: Lessons from Amazon Ads
How Amazon Ads switched over to DynamoDB to benefit from higher reslience
Anyone who has operated a production system long enough has lived through the same failures: a disk dies in the middle of the night, a database can’t keep up with a sudden traffic surge, or a critical patch requires planned downtime.
These incidents teach a hard lesson: an application can only be as resilient as the database that supports it.
At AWS re:Invent, teams from Amazon DynamoDB and Amazon Ads shared a detailed look into what resilience really means at scale and how DynamoDB enables systems to remain available, elastic, and operationally simple even under extreme conditions.
Their experience offers practical guidance for anyone building large, customer-facing systems.
What Resilience Really Means in Practice
Resilience is often confused with disaster recovery, but the two are not the same.
Disaster recovery focuses on how a system recovers after a failure, while resilience is about continuing to operate despite change. That change can come from infrastructure failures, traffic spikes, software updates, or even human error.
To reason about resilience, the industry typically relies on two metrics.
RPO: Recovery Point Objective
RTO: Recovery Time Objective
RPO defines how much data loss is acceptable when something goes wrong, while RTO defines how quickly a system must return to normal operation.
Lower RPOs and RTOs provide better guarantees but come with higher cost and complexity. A development environment may tolerate hours of data loss, while a revenue-critical system may not tolerate seconds.
AWS frames these tradeoffs through a small set of resilience strategies:
backup and restore
pilot light
warm standby
active/active.
Each step along this spectrum reduces downtime and data loss while increasing operational and financial cost.
The idea here is that resilience is not binary. Systems must be designed to meet the resilience level they actually need.
DynamoDB as a Foundation for Resilient Architecture
DynamoDB starts from a strong baseline. It is a serverless database, meaning there are no servers to provision, patch, or scale.
Every write is synchronously stored across multiple Availability Zones before it is acknowledged, and updates are applied with zero downtime. These properties eliminate entire classes of failures that plague self-managed databases.
Capacity management is another critical part of resilience. DynamoDB offers both provisioned and on-demand capacity modes.
Provisioned capacity works well for predictable workloads, especially when paired with auto scaling.
On-demand capacity, however, is best when traffic is unpredictable or spiky.
DynamoDB automatically scales throughput up or down based on demand, allowing systems to absorb sudden surges without manual intervention.
For recovery and data protection, DynamoDB provides point-in-time recovery (PITR), which continuously tracks changes for 35 days and allows restores to any second in that window.
This protects against accidental overwrites and bad deployments, not just infrastructure failures. For multi-region resilience, DynamoDB Global Tables replicate data across regions in a multi-active model.
Applications can write to any region and shift traffic without performing failovers, dramatically increasing availability.
These features form the building blocks, but resilience ultimately depends on how they are used.
How Amazon Ads Achieved Resilience at Massive Scale
Amazon Ads operates one of the most high traffic data pipelines in the world. Its attribution system processes more than 100 billion events per day, stores petabytes of data, and must respond in near real time.
Originally, this system relied on large HBase clusters running on EC2. Despite heavy automation, the setup required constant operational effort, frequent scaling ahead of peak events, and manual intervention during failures.
Unplanned outages were common, and engineers were regularly paged during off-hours.
The team set out to migrate this workload to DynamoDB with four strict requirements: zero downtime, equal or better latency, a fully managed solution, and completion within a single month.
They introduced a service layer to decouple storage from ingestion, enabled dual writes to HBase and DynamoDB, and validated data and latency in parallel.
Once confidence was established, they completed the cutover and decommissioned the legacy database.
The results were dramatic.
Availability increased from four nines to five nines (that’s ~47.3 minutes of downtime less per year; a 10x availability improvement) after enabling Global Tables.
Developer onboarding dropped from months to weeks, operational tickets fell by 40%, and the system survived Prime Day, Black Friday, and global sporting events without a single operational page [1].
Most importantly, the migration was cost-neutral compared to the self-managed solution.
A key part of this success was data modeling.
Rather than optimizing for minimal tables, the team optimized for throughput and operational simplicity. They adopted a fully time-sorted access pattern, replicated data to increase read throughput, used GSIs as read replicas, and leveraged Global Tables for regional redundancy.
Instead of a single monolithic table, they treated DynamoDB as a set of elastic shards, allowing capacity and replication to be adjusted incrementally without large, irreversible changes.
Conclusion: Resilience Is an Architectural Choice
Resilience does not come from a single feature or configuration. It comes from deliberate architectural decisions that balance availability, cost, and operational complexity.
DynamoDB provides powerful tools: serverless scaling, multi-AZ durability, point-in-time recovery, and global replication, but it is still the responsibility of the application architect to use them correctly.
Amazon Ads’ experience shows that resilience is not just about surviving failures. It’s about eliminating operational burden, enabling rapid growth, and allowing teams to focus on business logic instead of infrastructure emergencies.
When resilience is designed in from the start, systems don’t just stay online, they stay boring, and boring is exactly what you want in production.
👋 My name is Uriel Bitton and I hope you learned something in this edition of Excelling With DynamoDB.
📅 If you're looking for help with DynamoDB, let's have a quick chat.
🙌 I hope to see you in next week's edition!