Knowlesys

Microsoft Azure DevOps outage for 10 hours after 17 databases were deleted due to a typo

The Microsoft Azure DevOps service deployed in the southern region of Brazil stopped working for about ten hours. Subsequently, Microsoft's Chief Software Engineering Manager Eric Mattingly publicly apologized for the outage and revealed that the cause of the outage was a simple typo that led to the deletion of 17 production databases.

Mattingly said that Azure DevOps engineers regularly snapshot production databases to investigate reported issues or test for performance improvements in a timely manner, relying on a backend system that runs daily and removes old snapshots at specific times.

In a recent code upgrade by Azure DevOps engineers, the deprecated Microsoft.Azure.Management.* package was replaced with the supported Azure.ResourceManager.*NuGet package, resulting in a large pull request that replaced the API calls in the old and new packages.

However, a typo in the pull request mistakenly changed the call to delete the snapshot database to a call to delete the Azure SQL Server hosting the database, causing the background snapshot delete job to delete the entire server.

Cause of the Incident

Mattingly notes that Azure DevOps has specific tests to catch such issues, but the incorrect code only runs under certain conditions and is therefore not well covered in the existing tests. (Presumably, these conditions need to be present in an "old" enough snapshot of the database to be caught by the delete script.)

Mattingly further noted that in the absence of any snapshot database, the internal deployment of Sprint 222 ( Ring 0) went without incident, and a few days later, the software changes were deployed to the client environment ( Ring 1) to be used for the South Brazil scale unit (a cluster of role-specific servers). The environment had a snapshot database that was "old" enough to trigger the error, which eventually led to a backend job that deleted "the entire Azure SQL server and all 17 production databases" for the scale unit.

After more than 10 hours of work, Microsoft has fully restored the databases and has taken various fixes and reconfigurations to prevent such issues from happening again. The reasons why it took so long are as follows:

First: since the customer could not recover Azure SQL Server by itself, the problem had to be handled by Azure engineers, a process that took about an hour;

Second: the databases had different backup configurations, some were configured as regional redundant backups and others were set up as redundant backups for the nearest geographic area, and coordinating this mismatch of redundant backups took several hours;

Third: customers using these databases were also unable to access the entire scale unit immediately after the databases began to come back online due to a series of complications with their own web servers.

These problems were reportedly caused by a server warm-up task that was repeated through test calls in the list of available databases, and a single error in the database during recovery triggered the warm-up test to perform an exponential fallback retry, causing the warm-up to take an average of 90 minutes, when under normal circumstances this operation would have taken only a few seconds.

To further complicate matters, the entire recovery process is staggered, and once one or two servers start receiving customer traffic, they become overloaded and then shut down. As a result, the recovery service needed to block all traffic to the SBCU until everything was fully ready before rejoining the load balancer and processing the traffic.



銆怰esources銆戔棌The Achilles heel of AI startups: no shortage of money, but a lack of training data
銆怰esources銆戔棌The 27 most popular AI Tools in 2023
銆怤ews銆戔棌Access control giant hit by ransom attack, NATO, Alibaba, Thales and others affected
【Open Source Intelligence】●10 core professional competencies for intelligence analysts
【Web Intelligence Monitoring】●Advantages of open source intelligence
【Artificial Intelligence】●Advanced tips for using ChatGPT-4
【Dark Web】●5 Awesome Dark Web Links