Skip to content
← All articles

Replication Is Not a Backup: The UniSuper Wipeout and the Limits of Cloud-Native Resilience

Google Cloud deleted a $124B fund's entire infrastructure, replicas included. The only thing that saved it was a backup Google didn't control.

i for one5 min read

For two weeks beginning April 29, 2025, one of Australia’s largest pension funds did not exist as far as Google Cloud was concerned. UniSuper — $124 billion in assets, 615,000 members — had its entire Google Cloud subscription accidentally deleted. Not a region. Not a service. The whole account, and with it the data inside.

What makes this incident worth dwelling on isn’t the size of the blast radius. It’s why the standard defenses didn’t help, and what that says about assumptions baked into nearly every cloud architecture diagram drawn in the last decade.

Replication propagated the deletion

UniSuper had done what good cloud architecture guides tell you to do. Their infrastructure was replicated across two Google Cloud regions. In the mental model most engineers carry, that is resilience: a fire, a flood, a regional power event takes out one location, and traffic fails over to the other.

But that model only defends against a specific failure class — infrastructure failures localized to a region. The UniSuper deletion was an account-level action. When the subscription was deleted, the replica didn’t stand in as a survivor. It was deleted too, because from the control plane’s perspective both regions were the same logical thing being torn down.

This is the part worth internalizing: regional replication is a copy that obeys the same authority as the original. Anything with permission to destroy the primary can destroy the replica in the same gesture. Replication multiplies availability. It does not multiply authority boundaries, and authority boundaries are exactly what you need when the threat is a mistaken or malicious delete rather than a hardware fault.

Think of it in terms of what each layer protects against:

RAID / multi-AZ      → disk + datacenter hardware failure
Multi-region replica → regional outage (fire, flood, power)
Independent backup    → deletion, corruption, ransomware,
                        account compromise, provider error

Each row guards a different failure mode. Teams routinely buy the first two and assume they’ve covered the third. They haven’t. The threat that actually erased UniSuper lived in the bottom row, and the bottom row requires something the cloud-native pitch quietly discourages: a copy outside the provider’s blast radius.

The thing that actually saved them

UniSuper recovered because it kept backups with a third-party provider, entirely outside Google Cloud. That decision — declining to fully trust Google’s own internal replication and durability guarantees — is the only reason this is a story about a two-week outage and not about a national pension fund losing its records.

It’s worth sitting with how counter-cultural that choice is. The dominant cloud narrative for fifteen years has been: trust the platform’s durability numbers, eleven nines of object storage, managed replication, let the provider worry about it. UniSuper essentially said the durability of a single provider’s promises is not a property you can build a regulated financial institution on, regardless of how many nines are printed on the datasheet. Eleven nines of durability says nothing about a control-plane action that deletes the bucket itself.

Google’s response was unusually direct — CEO Thomas Kurian co-signed a statement accepting full responsibility. That candor is welcome, and rare. But “we take full responsibility” restores nothing if you don’t have an independent copy. Accountability is not recovery.

“Cloud-native resilience” hides a category error

The broader lesson cuts against a comfortable assumption. We tend to treat the major clouds as effectively infallible substrates — the modern equivalent of “nobody got fired for buying IBM.” The incident is a reminder that the failure modes that destroy data are not primarily the ones the marketing defends against.

Most published cloud failures are mundane and recoverable when you understand the contract. Consider a separate, much smaller example: a Spark workload that kept dying with OOMKilled errors on Azure Kubernetes Service. The team chased application-level heap tuning — bumping executor memory from 8GB to 10GB — and got nowhere, because the real cause was two infrastructure misconfigurations: scratch directories backed by RAM (spark.kubernetes.local.dirs.tmpfs=true) plus a hard podAffinity rule pinning every executor onto one 64GB node. Shuffle spill went into memory instead of disk and exhausted the node in seconds.

The fix was about respecting the actual contract of the platform:

# Don't back scratch dirs with RAM, and don't pile
# every executor onto one node.
spark.kubernetes.local.dirs.tmpfs: "false"
# replace hard podAffinity with soft anti-affinity
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution: [...]

Different scale, same underlying theme: the cloud is not a magic abstraction that obeys your intuitions. It has contracts — about storage semantics, scheduling, and crucially who is authorized to delete what — and lift-and-shift assumptions break when those contracts differ from what you imagined.

What to actually do

The remediation is unglamorous and old:

  • Keep at least one backup outside your primary provider’s authority. Another cloud, on-prem, or an offline/immutable store. The test is simple: could a single compromised or fat-fingered admin account in provider X destroy this copy? If yes, it isn’t a backup.
  • Distinguish replication from backup in your runbooks and in your head. They solve different problems. Owning both is the point.
  • Rehearse restoration, not just failover. A backup you’ve never restored from is a hypothesis.

There’s a human dimension too. In his talk on the human toll of incidents, Kyle Lexmond frames the on-call goal as mitigation — neutralizing impact — rather than solving root cause under pressure, and argues incidents are valuable mostly because they correct your mental model of a system. The UniSuper event is a mental-model correction delivered at maximum severity: the model where “multi-region equals safe” was simply wrong for the threat that hit.

If your organization holds data that cannot be reconstructed from another source — financial records, member accounts, anything legally or existentially load-bearing — the question is not whether your cloud provider is reliable. Even a hyperscaler running sophisticated coordinated experiments across its global fleet is one control-plane mistake away from deleting you. The question is whether your last copy is somewhere they can’t reach. UniSuper’s answer was yes. That’s the entire reason they’re still here.

Sources

  1. Google Cloud deletes Australian trading fund’s infra
  2. Inside Google’s System for Coordinated A/B Testing Across Its Global Service Fleet
  3. Two Misconfigurations That Caused Spark OOM Failures on Kubernetes
  4. The Human Toll of Incidents & Ways To Mitigate It

Keep reading

5 min read

Your AI Agent Reads Untrusted Code for a Living

A sabotaged jqwik release and a critical Starlette flaw expose one blind spot: coding agents run third-party code under a threat model nobody designed for.

securityai-tooling