Major outage today from 4:49am - 8:49am EST
This morning while running a database update Postmark had a major outage, which lasted about four hours. It’s by far one of the longest outages we’ve ever had since we launched in 2010. I’d like to explain what happened and what was affected.
What happened? #
At around 2am EST, both Russ and Milan started a planned maintenance on our database. The update involved some changes to a few important tables that are used to save and process outbound messages. This was tested thoroughly to make sure we had no anticipated issues and no maintenance window was posted.
Each update was going smoothly until one table processed the change incorrectly. In short, an update we made to a column in one table caused a completely different table to change. This resulted in our background services failing to process messages and we instantly put the app in offline mode. In offline mode, Postmark will still accept messages via the API and SMTP, but other API calls and the UI are completely offline. During this time both inbound and outbound messages are queued, but not sent. It’s a nice failsafe we have to ensure that messages are not lost during an outage.
The real problems started when we tried to recover the table. Each attempt to revert the change required a long process to import the records back, and each time it failed. We attempted three one hour long imports that completely froze at the end. Meanwhile, the app was offline and we were saving messages.
At this point it was all hands on deck in the Hipchat room. We managed to get it working again around 8:30am by recreating the table from scratch and recovering it by hand. We then brought all services back online at 8:49 EST, exactly four hours later. Once services were back we worked on recovering the messages that we accepted and saved during the outage.
What was the impact? #
So far we can’t see any evidence of lost emails. The biggest impact was that sending and inbound was offline during the entire outage. In addition, the web UI and other API calls were also offline.
The only good news is that messages sent to both SMTP and the API were correctly saved and resent after we brought everything back.
What will we do in the future to avoid this? #
I’ve done my fair share of postmortems over the years for both Beanstalk and Postmark. In most cases I have a long explanation on how we will avoid the issue from happening again in the future. With this outage, it’s not as easy. I really have to say that this maintenance was well tested and executed, both in our test environments weeks before it happened and today when the issues occurred. The biggest thing we learned from this is to always have a plan B if something blows up. It sounds easy, but it’s not always clear what problems might actually occur.
We’re in contact with Percona to go through the steps and see if they have ideas. We still have to run the maintenance, so when that happens the next time you can be sure we will be armed with more information and confidence.
Myself and the entire team are truly sorry for all of the trouble this has caused. We’re exhausted and worked as smart as we could to resolve it quickly. We know how much each of you rely on Postmark to run your apps and any downtime at all is not acceptable. In fact, we push Twitter updates into our HipChat through Zapier, so when messages like this show up while we are working on the problems, it really has a way of showing the impact our outages have on all of you.
 
          
