Services outage on March 29th, 2022
Last Tuesday, March 29th, Plumsail services were experiencing an outage. This means users were not able to access and use our services.
The root cause was that our Microsoft Azure subscription was disabled by Microsoft without prior notification. They just shut down the whole thing. This happened because of their automated security system. They false-detected that we are intentionally spreading phishing web forms. They don't expose their algorithms, but we know that this affects other businesses. We were in contact with different companies, including one from Northern Ireland. They experienced the same issue a month ago.
We have been on the market for over 10 years. We have never had an incident of this magnitude. Especially with Microsoft Azure. We rely on Microsoft Azure to host our infrastructure because of many reasons - 99.99% availability, security compliance, scalability. It includes virtual machines, SQL databases, cache, queues, etc. We have a scalable and stable multi-part infrastructure built on a modern technology stack in Microsoft Azure. However, when Microsoft disables a subscription, it means you lose access to all your data. It includes daily and weekly backups. We handled them using Azure infrastructure as well.
It doesn't mean we didn't have any backup infrastructure. An hour after the incident we redirected our services to our Frankfurt infrastructure. It allowed us to restore part of our services in a few hours and prevent losing email messages by our email processing services. However, setting up a replacement for this kind of multi-part infrastructure takes significant time.
We have a product Plumsail forms for public sites. It provides a free plan for creating public web forms. Unfortunately, some users use this free plan for creating phishing forms. This problem affects any public form services including Microsoft Forms and Google Forms. We are not the first affected by users like this. We were aware of this and our moderation team continuously worked on moderation, removal, and blocking of forms like this. We take security very seriously and enforce several measures to detect and prevent spam and phishing. However, some forms may slip through the cracks. The form that was the reason for the incident was proactively removed a long time before Microsoft disabled our subscription. Apparently, this was not an argument for Microsoft's support.
In the end, Microsoft has realized that disabling our subscription was a mistake and activated it. It took tremendous effort from our team and our customers. Thank you to everyone who contacted Microsoft support. It really helped to speed up the process of reevaluation of our subscription.
What we do to avoid this in future
As a temporary measure, new public Plumsail Forms users will pass manual approval before getting access to the publication of web forms on our plumsail.io domain.
We are improving our backup infrastructure outside of Microsoft Azure. We can't rely on Microsoft's 99.99% availability and backups after this incident. We had recovery plans built mostly on top of Microsoft Azure. We already made initial actions to guarantee the successful recovery of our services outside of Microsoft Azure.
Timeline (Pacific Time)
- 7:20 AM - We received notifications from our infrastructure monitoring tools.
- 7:23 AM - We received an email notification from Microsoft about disabling our subscription. We didn't get any upfront warning. They just shut down all Azure resources on our paid plan instantly.
- 7:25 AM - We reactivate the subscription manually. Services are restored for some time.
- 8:00 AM - Microsoft automatically disables the subscription again without any ability to reactivate.
- 8:17 AM - All services are redirected to our backup infrastructure in Frankfurt. However, most of the customers are not able to use our services because the most recent SQL database backups are blocked by Microsoft.
- 8:21 AM - We start communicating with Microsoft using various channels. Paid support, public communities, direct contacts with Microsoft employees. In most cases, it leads to a dead-end with automated replies about us "violating" the rules of Azure. Even for support requests with "A" priority.
- 1:03 PM - Plumsail Forms for Microsoft 365 are restored.
- 1:19 PM - Plumsail Documents is partially restored. All Power Automate actions except triggers and "Start process" are working.
- 1:33 PM - Org Chart for Microsoft 365 is restored.
- 1:54 PM - Workflow Actions Pack is restored.
- 2:20 PM - Communication with Microsoft doesn't help to resolve the incident. We sent a newsletter to our customers.
- 9:03 PM - Microsoft activated our subscription.
- 9:10 PM - All services are restored.
- After 9:10 PM - we are working on improving our backup infrastructure outside Microsoft Azure.
If you still experience any issues after the outage send us a message to email@example.com.