Azure Incident Retrospective: PSM0-BQ8

Show Video

Welcome to our Azure incident retrospective I'm David steel I'm Samy kuber we work in the Azure Communications team in addition to providing a written post-incident review after major outages we also now host these retrospective conversations you're about to watch a recording of a live stream where we invited impacted customers through Azure service Health to join our panel of experts in a live Q&A we had a conversation about reliability as a shared responsibility so this includes our learnings at Microsoft as well as guidance for customers and partners to be more resilient we hope you enjoy the conversation that follows right today we're here to talk about a pretty major service incident um that impacted customers we're running two sessions to support our customers in multiple time zones so if you're watching session number one you're watching this on a 30C delay if you're watching session number two you're watching this on a 12-hour delay um but it won't matter because the experience to you is the same and during both sessions we've got subject matter experts from our relevant service teams uh standing by ready to answer your question so please feel free to use the Q&A that's our question and answer side panel in the top right hand corner of your team's live event screen um that Q&A is open now go ahead ask us anything about the talk track about this incident or about incident Readiness generally super and today's incident we're going to focus on an Azure storage issue um this led to unreachable blob and data Lake storage and one of the key characteristics of this outage was it is in addition to impacting multiple Services it was multi-regional that's obviously a big problem we don't have multi-region outages very often here in Asia so definitely one that we wanted to talk about uh this inue this incident impacted customers back on the 13th of November uh and for some customers they experience this as an almost 12-hour outage so this this was quite a big deal um with that I'd like to introduce our guest speaker for today super we have the corporate vice president of azure storage Ahmed shab welcome and thank you very much for joining us Ahmed um can I kick off with asking with this incident with this outage what did it feel like for our impacted customers what did they experience uh thank you good morning to everybody good day to everybody that uh joined joined us today um let's let's start by thinking about it from the customer perspective what did it feel like what did it see so you would have experienced an almost L total loss of uh access to certain resources certain accounts uh it the event impacted a subset of accounts that are available in those regions but those were high impact to customers in the sense that lost access to that data although the data was was retained and continued to be available it access to it was ciled and it was almost zero for a certain amount of time for a lot of customers the outage was um yeah spread from a couple of hours to up to 12 hours for certain accounts and not everybody experienced all 12 hours they experienc a lot of customers were back on the air with their services within you know 4 to6 hours but there was a long tale of repairs that we had to do to to get back the final accounts that had more complicated recoveries that we had to do but the overall incident will took about 12 hours thank you for that so it it sounds like uh you know it was like you say mileage varied for for different customers as far as how long the outage was here uh and and I'm glad you mentioned there so data Lake and blob storage as well as the you know different services that depended on those under the hood so I me before we get into the specifics of what contributed to impact here could you just help us to introduce the relevant components right who who are the characters in our story here how does these components or these elements usually work together right so as you can imagine storage services like blob and the data leg services are very complicated Services behind the scenes there are many moving Parts there are many uh layers of defense that allow us to operate the service even under under degradation and certain failure scenarios so customers don't see those impacts one of the the uh services that we use internally is uh what's called Azure traffic manager uh that service takes one endpoint IP address if you will and then in the background distributes the data across the different physical um data points or the racks where we store the data so that we can aggregate performance aggregate resiliency and deliver the experience that customers want for uh capacity and performance behind that sits a couple of layers of our services to do with how we receive the data and the request and how we then distribute and service the data in a very structured and safe way so we always guarantee the data durability of the data in this particular case what happened is that when you first come in there is a what's called a traffic profile that's used by the traffic manager service and that describes for that particular customer for that particular account they're trying to access how to then distribute the request whether it's a read or write request uh those um allow you to come in from any region from any Zone and then have the same experience um in terms of latencies and performance this is the thing that was impacted in this particular outage that we had um uh we we lost access to those profiles and therefore customers making the request to those endpoints couldn't reach the where the data was located and that's what manifested in this particular case um as you can imagine uh the service is resilient is designed to be resilient against such things and limiting the blast radius so the way that we handle those profiles and the way we handle um managing those resources is we distribute them among subscriptions and we distribute them such that loss of any one of those areas or those resources doesn't have a wide impact to to customers to some extent this worked here some to some extent it didn't um clearly we affected more than uh One customer we affected about ,200 accounts in all uh but the impact of customers was obviously very large um the other thing that typically we we like to do is to make sure those resources are not multi-regional so that again limit the blast radius and allow customers to have certain uh backup options um in this case we made mistakes and there were issues associated with how we buil how certain mistakes propagated to create this large aage for customers super and and Ahmed I know um because AZ is a complex system and complex systems fail in very complicated ways and while we had the trigger event as an an unintentional accident accidental deletion when we think about as you said you alluded to the resiliency in the service what are the other contributing factors that led to this kind of outage being bigger than it should have been perhaps yeah so the trigger event as we pointed out is The Accidental deletion of the resources that describe the traffic profiles all the services operated as expected so when we delete a resource profile um you know it's because we deleted the account we don't the customer don't doesn't want that service anymore so we remov that from from production and that's a normal behavior of the service it is for security reasons it is for obviously correctness reasons and in this case the services that we are operating in AIA actually operated as expected so when we deleted the resource they went to they went ahead and deleted the profile which then led to the particular outage so the services themselves were operating as expected the mistake the trigger event that led to this was The Accidental deletion of um a sub cription that contained those it's a resource Group within a subscription that contained those resource profiles and that trigger event was part of a routine cleanup that we do after certain you know for security and operational uh reasons um but in this particular case they contain that particular Resource Group contained more resources than it should have done and they were cross-connected to our service versus is being part of the original uh deleted service so in this particular case uh the impact was larger than we liked the other contributing factor to the to the scale of this event was this particular Resource Group contained re uh data or resources that were for multiple regions describe traffic profiles in multiple regions so when the resource Group was deleted clearly that we affected out uh multiple regions and that that that was a particularly um difficult thing to to accept in that we had that that that exposure great great summary thank you I made we'll move on to our learnings and repair items towards the end but I appreciate that understanding of of the trigger event the other thing that I noticed in the post incident review was that we had some detection issues here right it seems like we only learned of the issues that customers were having once those customers were impacted and brought it to our attention is that right it sounds like that there was a monitoring G here unfortunately that is right um it's clearly something that uh we I'll talk about the particular lessons and what we're going to do about that shortly uh but let's describe what we do normally normally we have monitors that sit around our service and check the availability of our endpoints so in the same way a customer will come talk to our service and access the service we have monitors that look at the accessibility of our service um the monitors that we had running in at this particular incident their end points weren't affected by this particular Resource Group so they did not fire in uh as quickly they would have fired and created a high severity incident for us had that been affected and that monitor would have fired so while we did have that monitor it just wasn't in the in the blast radius of the event clearly that highlighted a gap for us uh that we are actively repairing we'll talk about that shortly uh but in this particular case the customer actually called us and said hey there's there's a problem going on here and that incident very quickly escalated to a a high severity incident for us um once however we um we became aware of that incident we were able to diagnose uh what had led to this incident very quickly based on our internal monitoring so we are clearly our service is extremely well uh instrumented uh for monitors that tell us how different parts of the service are operating so we were able to get to understand what happened in reasonably quick time uh super thank very much Ahmed so it it was interesting because we we jumped on it I'd like to double click on a little bit so we said once it was flagged to our attention um not through our monitoring unfortunately once we got there we spoke about very quickly you were able to determine it how do you go about mitigating it as well because it was quite a long outage yes it was a long outage um longer than we obviously like or accept for our customers the detection for us was straightforward in that we looked at the traffic patterns we understood are different monitors inside the service and we're able to determine there was a broken link between the external um IP addresses the VIPs and the internal service and that very quickly led us to the traffic profiles and then understood that there was a deletion that was going on at the time right um to mitigate that the simple thing to do is to recover those profiles or to rebuild them right the rebuilding those profiles isn't um a difficult process it's reasonably straightforward unfortunately the tools that we use for such recoveries um had weren't up to the scale that we faced so even only 12 roughly 1200 accounts were affected there are many more uh traffic profiles that we had to recover and there's a lot of confusion about customer priorities asking us to do certain things and certain ways that actually led us down some um interesting um dead ends or CES acts that we had to work through but that wasn't the major effect as to why we're late the major effect is that our recovery tools although did work they didn't operate at the speed and scale that we would have liked to recover more quickly and those are certainly uh some of the lessons we had to learn so we ended up having to um operate those tools manually we actually crowdsource our internal team so we worked across three different time zones we both our entire engineering team to help with those mitigations and for the large majority of customers a lot of their resources were back on the air within a a few hours reasonably quickly some customers we particularly chose to have um different mitigations for them to get them back uh more quickly and our goal really is to recover the customer services especially you know any business impacting things uh but the long taale as I said was where we had to do manual recoveries that took us a little longer than than we expected well a lot longer than we expected and that is certainly the lessons that we we learn and take away from that uh that mitigation should really taken us a lot less time right yeah great summary there so it was all hands on deck but perhaps a little too manual foot for our liking so but we've talked through now that the different phases of the incident life cycle I want to Pivot to our learnings repair items now when we talk about this accidental deletion it's important we should clarify for customers we try to resist framing anything as human error right if if one of our Engineers triggered an unexpected outcome that's obviously a signal that that our system or our process or our tooling didn't provide that operator with sufficient visibility into the state of the system or or perhaps we we lacked appropriate safeguards so I I wonder if you could talk us through what your team's doing to drisk this kind of accidental operation so the the operator that was that uh that to this incident we there's no blame for the operator ultimately as you said there was um we needed to have created more friction in some of these high Plus radius environments and ultimately given the right cues and the right automation to prevent such things from happening that is the one of the Les the key lessons that we've taken in in terms of and we put into effect immediately is that we create a lot more friction in the system for the short term because ultimately friction cannot solve everything but for the short term it allows us to slow down exercise more judgment ask more questions and you know look at the the specifics associated with these things with much more fine fine grain details the second set of learnings is that the obviously we talked about this is a multi- regional outage and the blast radius associated with those resource groups was just too large uh this is not our policy our policy is to have min Min uh minimize the blast radius by containing resources that are zonal and Regional at at the most clearly in this case we had a gap in the way that we were building things and the automation to test for these things for the uh the large blast radius subscriptions and resource groups and that's something we obviously will be addressing but we have address actually already um in in this particular case we've um accepted the the the changes that we're making is introduce the friction but we're also introducing better cues for the approvals in terms of when you come to do a do a mutating change to the service that you have a lot more details about the impact of that change this is going to be an evolving process for us clearly we understood this particular incident but our focus is now much wider than this very very specific incident we're looking across our service with different different eyes in the sense of understanding where else might we be exposed where else might we have problems and how do we not only fix it once but continue to Monitor and make sure this doesn't Cur again right we don't want another incident like this we don't want an incident anywhere near this level of severity for for customers so that's on us to build the tooling and make sure that we are um we monitoring this on a automated way so anytime there's a deviation or defect we able to flag it or fix it super thank you Ahmed and and it is difficult to go on a journey and unar blind spots I I I I I do empathize but but returning to the monitoring issue when we we spoke about okay this we spoke about an issue being able to detect this issue and I I do want to double click and to say how are we going to what's our approach to detect this class of issue more robustly in the future is there any more information that you can provide for us there yes there there's two types of things so one obviously I just talked about in the sense of making sure our service monitoring for the resource BL radius is is maintained and that's operational discipline we're building internally there is the second question about detection of um availability loss for customers and how do we make that more robust such that customers shouldn't have to tell us there's a problem at any scale and that is where we've already made changes we're working with some of our customers already to build Automation and monitoring that that looks at all of the resources rather than focusing on a canary or a specific indicator leading indicator that may not be affected by these resources like that happened here we're building automation that is much more wi spread or the monitoring of our uh end points that really takes care of all of a lot of our all of our customers experience and we are in the process of deploying that right now uh we've already built some automation we're fine-tuning it right now for this monitoring and we already seeing very small scale detection of fine grain issues that would have we are flagging to customers to some of our feel like Alpha Beta customers working with us on this but this is something that's already in place in the same way that we have instituted the changes for the blast radius already this is something already in place for we will continue to scale that uh throughout the next few weeks uh that allows us to not only monitor for this specific uh action but any kind of availability loss that customers experience and then we can detect and mitigate more quickly great thanks you ared that's a great summary of some of the detection improvements here you mentioned mitigation and I wanted just to to click into that as well because of this big longtail recovery for some customers can you talk a little more about those Investments that we're making to recover more quickly you mentioned that kind of tooling to to help solve some of these more manual repairs is something that will kind of set us up for Success moving forward yeah there a number of things um you the worst time obviously that you have uh you can find these lessons is during an outage that affects customers in a very negative ways and frankly we don't want to have that so a number of things that we are reviewing all of our recovery tools as part of the proc uh existing this repair set of items that we're working through and making sure that we accept we test for um ma large scale recoveries in this particular case we had procedures we had ability to do things we were able you know Theory we should have been able to recover more quickly unfortunately the testing and the scaling wasn't done as accurately as I would have liked and that's the pivot that we're making now is that we are going through systematically through all our recovery tools making sure that we test them to operate at the scale at a very large scale so we'll simulate multi- Regional outages we'll simulate large outages and actually put the operational tools through their paces and we'll do that on a regular basis so any regressions will be detected well ahead of any types of incident that operational posture helps us make sure that two things one our tools work and we are confident they work and operate as expected to create to help us with the mitigation in this particular outage we knew what the steps were to mitigate it's just the pace was slower than we like and some we obviously run into some Corner cases that we had to go work through so taking the lessons here we we applying those lessons to those specific tools for this particular outage and we are increasing our um focus to look at our repair tools for other types of incidents other typ of issues that could happen and the third part is we are build we are building a much more focused on operational practice to test these tools on a regular basis uh through through game days and operational reviews that will that will will probe and poke at these things more more carefully um so combining those actions with prevention detection you know prevention detection and mitigation we're proceeding down these three uh dimensions in parallel and we move in there very quickly we're not waiting for or another incident we're not waiting for learnings we're now forcing those uh scenarios and simulations so that we can take those learnings ourselves without customer impact super thank you very much Ahmed and Ahmed I know a lot of the times you say while outages are inevitable impact doesn't always have to be depending on how customers architect and which resiliency features in Azure how they architect their application there's a chance that they can protect themselves from Impact when there is an outage was there anything C customers could have done or is there any way they could have architected to survive or enjoy this kind of an issue um let me describe the uh the typical mitigations that customers can choose for themselves and then we can talk about why some of those did not necessarily apply here and what we doing about them to make sure they always apply in the future so as you know um we have a very extensive set of products and services that allow you to store things both within one zone across multiple zones or Z zon res and we also have multi-regional storage that allows you to create copies of your data in two different regions so should a a zone go down a region go down you can continue to operate your business and some of a lot of our customers take advantage of those resiliency features to that to put their most important data most critical business operations in those more resilient storage types and a lot of customers who for example use Zone resilience zonally resilient storage weren't affected by this particular outage that they were able to ride through that however customers also in this particular case there was a particularly nasty effect in that it was multi-regional so even though you may have chosen to put your data in two regions some of those customers depending on which regions were affected could not access all the data from the two regions so they affected them in two places but typically we would say if you have uh build your resilience for your data types according to how important it is for you for business continuity and we will work to make sure that you can have that backup options should you as you need it so one of the things that we are very clearly disappointed with was the multi- regional nature of this that precluded customers from being able to use the regional backup option and we are working through that to make sure that the blast radius of any future uh events will be limited at the most to one region ideally to a zone or less so these are things that we are continuing to work for to help customers when they make those choices to be able to use them in the events of outages as you said Sammy there's we live in a world of a very complex World um and there are lots of sophisticated software there's lots of sophisticated Hardware there's layers physical layers in the world that could through a random act random fact could cause some some kind of outage but we would like and we we want to be able to have customers depend on our resiliency features and that's the the thing we are frankly most disappointed with in this particular outage great thank you a it's good to hear there are some kind of storage resiliency best practices even though they might not have uh might not have saved customers from Impact for this specific outage um I'm I'm mindful of time so I've saved my last question for my co-host TSN Sammy runs our Azure Communications team so I wanted to say now that ahed has run us through the outage itself how did our Communications do did we keep customers up to date uh so so initially of course there was a delay in detection which led to a delay in communication so once we initially understood okay there was an issue we initially thought it was with data bricks so Azure data bricks were the first service to to raise a flag and say that they were impacted we immediately went to our public Banner uh we weren't able to identify our customers quickly and targeted enough but within 10 minutes I think of um of us raising a flag on our status page had sent out corresponding portal Communications um and and that that was at around 3:00 a.m. UTC plus or minus 5 minutes the the issue was really that we didn't communicate in a targeted fashion to the impacted storage customers until over an hour and a half later and so net net when customers think about it it was really 4 hours into the outage before a customer leveraging storage would have received an Azure service Health alert we don't want to rely on that status page it's not very durable customers shouldn't depend on it so there was a gap and it was great that we sent out portal coms pretty quickly for aure data breaks from when we knew not from when the impact started but but again kind of getting those the right message to the right customers in time was an issue as Ahmed alluded to the the the scenario is quite Edge and as a result we try and use we we spoke about um we speak about our brain automation which is our AI Ops which communicates to impacted customers within 15 minutes this isn't high up on the list to get in there we do like assess hey what is the value we're going to get for onboarding these services in in a very quick period of time and we got to look at kind of we got to maybe reevaluate the service but Communications and the customer experience for this was was pretty painful I would say thank you for watching this Azure incident retrospective at the scale of which our Cloud operates at incidents are inevitable just as Microsoft is always learning and improving we hope our customers and partners can learn from these too and provide a lot of reliability guidance through the Azure well architected framework to ensure that you get posted instant reviews after an outage and invites to join these live stream Q&A sessions please ensure that you have Azure service Health alert set up we really focused on being as trans arent as possible and showing up and being accountable after these major incidents whether it's an outage a security or a privacy event Microsoft is investing heavily in these events to ensure that we earn maintain and at times rebuild your trust thanks for joining us

2025-01-04 17:57

Show Video

Other news

Claude 4: Everything you need to know 2025-05-29 15:05

DigiPen Institute of Technology | 2025 DigiPen Europe - Bilbao Student Game Showcase 2025-05-27 01:26

Nvidia Opens AI Ecosystem to Rivals, Apple AI Struggles | Bloomberg Technology 2025-05-25 18:40