The Update That Crashed WordPress

TL;DR: I once shipped an iOS update that crashed WordPress for over a million users. The lessons from that failure shaped how I approach software engineering today: test relentlessly, ship small increments, and build systems that fail gracefully.

The Crash

I was a Lead iOS Developer at Automattic, the company behind WordPress. We shipped an update to the WordPress iOS app. Within hours, the app was crashing on launch for over a million users.

Not crashing sometimes. Not crashing under specific conditions. Crashing immediately, every time, for a huge percentage of our user base. The App Store reviews went from four stars to one star overnight. Support tickets flooded in. People who ran their businesses on WordPress couldn’t access their sites from their phones.

I still remember the moment I realized the scale. You see a crash report and think, “OK, a bug.” Then you see a thousand crash reports. Then ten thousand. Then you check the analytics dashboard and the line is vertical.

What Went Wrong in Plain English

The bug was a data migration issue. When the app updated, it needed to transform some locally stored data into a new format. The migration code worked perfectly on clean test data. It did not work on the messy, inconsistent, real-world data that lived on actual users’ devices.

Some users had cached data from years of use, with edge cases and formats that no longer matched our assumptions. When the migration encountered data it didn’t expect, instead of handling the error gracefully, the app crashed before it could even finish launching. No fallback. No recovery path. Just a crash loop.

The kicker: our test suite didn’t catch it because our test devices had clean data. We were testing the migration against the data we expected, not the data that actually existed in the wild.

The First 24 Hours

Incident response at Automattic was fast but constrained by a hard reality: you can’t push a hotfix to the App Store in an hour. Apple’s review process meant even an emergency update would take at least a day, probably more.

So we did what we could. We communicated with users through every channel we had. We worked with Apple to expedite the review. And we wrote the fix, making the migration resilient to every edge case we could find, plus a blanket catch that would skip the migration entirely and rebuild from the server rather than crash.

The patched version went live about 36 hours later. It felt like a week.

Three Lessons I Still Use

Test the upgrade path, not just the clean install. Migrations, stale caches, and “old state meets new code” are where the real bugs live. Your test suite should include devices with years of accumulated data, not just fresh installs.

Ship smaller. A big release has a big blast radius. If we had shipped the migration as part of a staged rollout to 5% of users first, we would have caught the crash before it hit a million people. Smaller releases mean smaller failures, and smaller failures mean faster recovery.

Fail gracefully, always. If your app can’t complete a migration, it should degrade, not die. Crash reporting, kill switches, and sane defaults aren’t luxuries. They’re the difference between “some users see stale data for a day” and “a million users can’t open the app.”

Key Takeaways

Real-world data is nothing like test data. If your test suite only runs against clean inputs, you’re testing a fantasy.
Staged rollouts aren’t optional for apps at scale. Roll out to a small percentage first. Always.
A crash loop with no recovery path is the worst failure mode in mobile. Design your app to survive its own bugs.
Incident response speed is limited by your distribution channel. Plan for the App Store review delay before you need it.
The bug that takes your system down won’t be the one you thought about. It’ll be the one hiding in data you forgot existed.