Why the Internet Broke and What We Can Learn From It

The Day the Web Vanished

The morning of October 20th, 2025, started like any other for millions of users around the world. Then, one by one, apps stopped responding. Banking platforms timed out. Video calls froze. Smart devices ignored commands. For some, it felt like their phone was broken. For others, the world simply went quiet.

This wasn’t a glitch. It was a widespread infrastructure failure.

Amazon Web Services (AWS), the world’s largest cloud provider, experienced a major outage in its Northern Virginia data centres. That single region, US-EAST-1, hosts a massive share of the global internet’s operational backbone. In the space of a few hours, over 1,000 digital platforms became unusable, including Snapchat, Reddit, Hulu, Venmo, Alexa, and several global banks. Companies relying on AWS for backend systems faced transaction errors, data access failures and total service blackouts.

The disruption exposed how fragile modern digital systems are if concentrated in a handful of cloud infrastructure nodes. Even multinational firms with large IT budgets did not see it coming.

What Went Wrong

AWS confirmed the issue stemmed from a network connectivity error that impacted its Elastic Compute Cloud (EC2) and other core services in the US-EAST-1 region. DNS resolution issues, particularly those linked to DynamoDB APIs, created a ripple effect that cut off essential routing paths.

Amazon commands around 30% of the global cloud services market. Its dominance, while commercially successful, creates a single point of failure. When one region falters, businesses on every continent feel the impact.

The Global Impact

This outage didn’t stay within the walls of Silicon Valley or an American cloud farm. Its reach extended across industries and continents.

In the UK and Europe, app-based banking was interrupted. Several NHS services reliant on cloud platforms had limited functionality, delaying healthcare responses. In India, users faced service errors on WhatsApp and Paytm. Across Asia and Africa, smaller startups found themselves unable to access backend tools, leaving their users in the dark.

Payment systems in South America timed out during peak business hours. In Australia, service dashboards monitoring emergency response systems reported blind spots. IoT-based home security feeds shut off mid-stream. What was initially reported as a U.S. outage became a global technology incident.

How Brands Were Affected

Snapchat went offline during peak hours in Europe. Ring’s security cameras stopped streaming live footage. Online retailers experienced cart abandonments, failed checkouts and inventory mismatches.

The technical root cause may have been external, but the disruption affected user trust, brand perception, and daily business functions.

Build a Resilient Cloud Strategy

For instance, it might be worth diversifying your cloud infrastructure to limit the exposure to single-region outages. Consider multi-region, even in the same cloud provider. The key to deploying across multiple availability zones is that there should never be a full system outage when a single data centre has failed.

CDNs are useful for dispersing static assets and doing the hard work of the main infrastructure. There’s basic functionality from them even in the event of failures of primary systems.

The hybrid- or multi-cloud strategy seems to have done well. The highest risk was faced by those who relied solely on the AWS US-EAST-1 region.

Audit and Document Your Dependencies

Map all services that rely on cloud infrastructure—down to specific regions and APIs. Many businesses discovered the hard way that some vendors were indirectly tied to US-EAST-1. Understand where these silent dependencies exist.

Keep clear documentation of these dependencies. When outages happen, knowing what’s impacted helps reduce downtime. This also aids in vendor accountability and improves response coordination.

Prepare Internal Teams

Crisis management is not just an IT function. Cross-functional response plans are essential. Marketing, operations, and support teams need to know how to act when core tools go offline.

Create fallback communication systems. When Slack or Teams is down, internal coordination should continue. Whether through SMS, phone trees, or alternative platforms, ensure team connectivity remains intact.

Train customer-facing staff on outage protocols. Equip them with the information needed to respond to complaints or confusion with confidence and consistency.

Improve Customer Communication

If users are informed, they shall forgive the disruption. Silence destroys trust.

Keep communication templates for service interruption in pre-draft form. Email, SMS, and status pages hosted on independent infrastructure are what to use. Do not rely on your main domain alone or on social media profiles linked to the affected systems.

Acceleration and consistency will beat a too detailed (technical) explanation. All users want to know is: What happens? How does it affect me? When is it going to be fixed?

Push for Vendor Accountability

Review your SLAs with infrastructure providers. Understand what truly gets covered and what are not instances of reported outages.

Demand transparent post-incident reports and measure their actions against declared SLA terms. This helps both in your internal assessment and in raising standards within the industry at large.

Choose vendors that provide multi-region failover options and make them part of your procurement criteria.

Invest in Testing and Recovery

Conduct regular outage simulations to observe how systems respond and how teams communicate, and identify the gaps.

Keep warm backups for core databases and services. Determine how long it will take to switch to another region or another provider.

Recovery is not just recovery of the systems. It’s about recovering service to the users and rebuilding trust. An hour or so delay will impair customers’ view in the long term.

Shift Mindsets, Not Just Architecture

Technical fixes alone won’t future-proof your brand. Business continuity needs to be treated as a company-wide priority.

Encourage leadership to integrate downtime resilience into strategic planning. Include cloud diversity, vendor evaluation, and team preparedness as part of your annual goals.

Make infrastructure resilience a brand value—not just a back-end concern. When your systems fail gracefully, your brand maintains credibility.

Focused Action for Digital Continuity

The AWS outage was a clear demonstration of what can go wrong. It’s also a roadmap for what to do right.

Evaluate where your systems are hosted. Build redundancy into every layer of your stack. Train your teams beyond their everyday functions. Maintain open lines of communication with users, even when things go wrong.

Business continuity isn’t about preventing failure. It’s about minimising its impact.

Scroll to Top