Coinbase Details Eight‑Hour Service Breakdown in Postmortem of May 7 Outage

01-Jun-2026 Crypto Economy

TL;DR

Root Cause: AWS cooling failure shut down racks, triggering widespread Coinbase service outages across trading and account actions.
Compounding Issues: Matching‑engine quorum loss and a silent Kafka control‑plane defect significantly extended recovery time.
Next Steps: Coinbase is enhancing cross‑zone resiliency, expanding failover testing, and upgrading Kafka deployments to prevent similar incidents.

The May 7 service disruption left customers across retail and institutional platforms facing hours of halted activity, and the company is now outlining how a localized AWS cooling failure escalated into a prolonged, multi‑layer outage. Coinbase says the eight‑hour interruption, followed by a twelve‑hour full‑system recovery window, fell short of its standards and warranted a detailed accounting of the technical breakdowns that compounded the incident.

AWS Cooling Failure Triggered Widespread Shutdowns

According to the company, the chain of events began at 7:20 PM ET when multiple chiller units failed inside a single data hall in AWS’s us‑east‑1 region. The cooling loss forced a thermal‑safety shutdown of racks hosting EC2 instances and EBS volumes, impacting Coinbase systems alongside other major internet services. By 7:48 PM ET, nearly all trading on the platform had halted, leaving retail users unable to buy, sell, send, receive, deposit, or withdraw.

Institutional clients on Prime also saw order‑routing degradation as Coinbase Exchange markets went offline. Recovery progressed unevenly. The matching engine returned in cancel‑only mode at 2:25 AM ET on May 8, full trading resumed at 3:49 AM ET, and coinbase.com and the mobile app regained full functionality by 9:53 AM ET. Event‑streaming backlogs cleared by 2:00 PM ET. Coinbase also notified regulators within required windows and is completing formal impact assessments.

Two Architectural Weaknesses Prolonged the Outage

The company identified two core issues that turned a single‑zone AWS event into a multi‑hour platform disruption. First, the matching engine was pinned to a single building due to its Raft‑based cluster design, which prioritizes low‑latency co‑location. When AWS terminated instances at 9:29 PM ET, three of five nodes went down, eliminating quorum. With no automated cross‑zone failover, engineers had to ship an emergency code change, create a new node group, and restore a 3‑of‑5 quorum manually. Second, AWS’s managed Kafka service failed silently.

A defect in the MSK control plane prevented automatic partition‑leader reelection, leaving producers unable to write and blocking downstream systems, including fees, quoting, ledger components, payments, and data pipelines. Manual partition reassignments began at 3:00 AM ET, with priority topics restored by 9:30 AM ET. Coinbase says it is improving cross‑zone standby design for its matching engines, increasing failover testing, collaborating with AWS on the Kafka defect, enhancing internal tooling, and migrating the remaining 2‑AZ Kafka clusters to 3‑AZ deployments.

Also read: ECB Stablecoin Warning: Schnabel Backs Digital Euro to Counter Risks

WHAT'S YOUR OPINION?