The main goal of this document to show our customers that we take seriously to availability of our services. The high-level application infrastructure you can find in Architecture section. The document covers the main disaster scenario, reactions and recovery plans.
According to high-level architecture we have the following systems:
DNS
CloudFlare
Load balancer
API servers
Database
Queues & Storages
Worker’s servers
We use Nagios to monitor our services. In case of any issues with API we will receive a notification about an incident. The available engineer should do a primary diagnostic and classification of the incident. The next steps depend on result of diagnostics in case where the incident cannot be fixed immediately it should be escalated to a manager and dev team.
Time to diagnose: 5-30 min
We use CloudFlare as DNS provider and DDoS protection firewall. In case of any issues with the network the administrator should switch DNS records directly to load balancer.
We use a Linux machine with HAProxy as load balancer it allows us very flexible distribute network traffic between the API servers and guarantee zero downtime deployment. However, in case of issues with load balancer (invalid configuration or VM maintenance/corruption), the administrator should switch DNS records to backup server or directly to API server.
The API servers use the database to authenticate the requests, collect and save statistics. In case of unavailability the database the service will not work. To guarantee the DB availability we use Azure SQL. In case of corruption the DB, we can restore it in any point of time for the last 7 days, additionally Azure guarantee geo-distributed storage for backup that allows us rely to azure infrastructure.
To guarantee the SLA we have configured backup server that can receive load in case of corruption of the main server. Load balancer supports heath checks and constantly monitor heath of the API servers. Additionally, the load can be switched manually.
We heavily rely on Azure Queue and Azure Storage, we don’t have any reservation mechanism. Azure Storage SLA In case of unavailability the only thing we can do it try to switch to the storage in the different region.