The Update That Crashed WordPress
TL;DR: I once shipped an iOS update that crashed WordPress for millions of users. The lessons from that failure shaped how I approach software engineering today: test relentlessly, ship small increments, and build systems that fail gracefully.
The Crash
I was Lead iOS Developer at Automattic, the company behind WordPress. We shipped an update to the WordPress iOS app. Within hours, the app was crashing on launch for millions of users.
Not crashing sometimes. Not crashing under specific conditions. Crashing immediately, every time, for a huge percentage of our user base. The App Store reviews went from four stars to one star overnight. Support tickets flooded in. People who ran their businesses on WordPress couldn't access their sites from their phones.
I still remember the moment I realized the scale. You see a crash report and think, "OK, a bug." Then you see a thousand crash reports. Then ten thousand. Then you check the analytics dashboard and the line is vertical.
Matt Mullenweg, Founder of WordPress, gave out my personal Skype username to some of the most unsatisfied users. I had people in Sweden Skyping me just to curse me out in Swedish.
What Went Wrong
The bug was a data migration issue related to XML-RPC. XML-RPC itself kindof felt like a fatal error back then. When paired with Objective-C, it was cruel, unforgiving, and crash-prone. When the app updated, it needed to transform some locally stored data into a new format. The migration code worked perfectly on clean test data. It did not work on the messy, inconsistent, real-world data that lived on actual users' devices.
Some users had cached data from years of use, with edge cases and formats that no longer matched our assumptions. When the migration encountered data it didn't expect, instead of handling the error gracefully, the app crashed before it could even finish launching. No fallback. No recovery path. Just a crash loop.
The kicker: our test suite didn't catch it because our test devices had clean data. We were testing the migration against the data we expected, not the data that actually existed in the wild.
The First 24 Hours
Incident response at Automattic was fast but constrained by a hard reality: you can't push a hotfix to the App Store in an hour. Apple's review process meant even an emergency update would take at least a day, probably more.
So we did what we could. We communicated with users through every channel we had. We worked with Apple to expedite the review. And we wrote the fix, making the migration resilient to every edge case we could find, plus a blanket catch that would skip the migration entirely and rebuild from the server rather than crash.
The patched version went live about 36 hours later. It felt like a week.
Three Lessons I Still Use
Test the upgrade path, not just the clean install. Migrations, stale caches, and "old state meets new code" are where the real bugs live. Your test suite should include devices with years of accumulated data, not just fresh installs.
Ship smaller. A big release has a big blast radius. If we had shipped the migration as part of a staged rollout to 5% of users first, we would have caught the crash before it hit millions of people. Smaller releases mean smaller failures, and smaller failures mean faster recovery. But LaunchDarkly didn't exist back then, nor did phased rollouts in the app stores.
Fail gracefully, always. If your app can't complete a migration, it should degrade, not die. Crash reporting, kill switches, and sane defaults aren't luxuries. They're the difference between "some users see stale data for a day" and "millions of users can't open the app."
Key Takeaways
- Real world data is nothing like test data. If your test suite only runs against clean inputs, you're testing a fantasy.
- Staged rollouts aren't optional for apps at scale. Roll out to a small percentage first. Always.
- A crash loop with no recovery path is the worst failure mode in mobile. Design your app to survive its own bugs.
- Incident response speed is limited by your distribution channel. Plan for the App Store review delay before you need it.
- The bug that takes your system down won't be the one you thought about. It'll be the one hiding in data you forgot existed.
- Avoid XML-RPC at all costs.
Related Posts
Agentic Workflows That Actually Work
How to build production agentic workflows with retry logic, audit trails, and human-in-the-loop checkpoints that survive real-world failure modes.
Building Systems That Survive Contact With Humans
Practical patterns for building reliable systems that handle incomplete inputs, confused users, and the failure modes that happen in production.