Azure Incident Retrospective: Entra ID Seamless SSO Connect Sync, Feb 2025 (Tracking ID: TMS9-J_8)

Show Video

- Welcome to our Azure Incident Retrospective. I'm David Steele. - I'm Sami Kubba.

We work in the Azure Communications Team. - In addition to providing a written post-incident review after major outages, we also now host these retrospective conversations. - You're about to watch a recording of a live stream where we invited impacted customers through Azure Service Health to join our panel of experts in a live Q&A.

- We had a conversation about reliability as a shared responsibility. So this includes our learnings at Microsoft as well as guidance for customers and partners to be more resilient. - We hope you enjoy the conversation that follows. Super, thank you.

And today's incident, we're gonna focus on an Entra ID issue. This was an issue that caused DNS authentication failures for customers using seamless single sign-on and Entra Connect Sync. While the issue lasted about eight, nine hours and it spanned from the 25th to the 26th of February, most customers around 94% of customers were mitigated within two hours.

But a small number of customers had an extended impact and a lot of our conversation today is gonna focus on that. It almost lasted for around nine hours or so. - Got it.

This was a big deal. This is why we have this conversation. So with that, I'd like to introduce our guest speakers, our engineering leaders who are gonna be talking us through the impact of this incident and what we learned from it. So I'd like to introduce that Soren Dreijer, who's our principal group engineering manager from Seamless SSO.

We've got Vojtech Vondra, our principal group engineering manager from the DNS and networking side here, as well as Ganesh Panchanathan, who's our principal group engineering manager from the Connect Sync side. So welcome, all. Thank you for being here. Soren, should we start with you? - Absolutely.

Soren, if you don't mind, could you please introduce us to some of the characters of the story? In particular, I'm very interested to know about what is Seamless SSO? Is this something customers provision or are they just given, or how do customers and services interact with Seamless SSO? - Yes, so let's start there. I think it's a good context. So the preferred method today to obtain single sign-on on Windows 10 and newer devices is through which are Entra Join and Entra Hybrid Join is through what's called primary refresh tokens, also known as PRTs. Seamless SSO was billed specifically for use cases where users cannot use these PRTs to obtain single sign-on such as older versions of Windows. So Windows 7, windows 8.1, as well as non-Entra Join windows machines that cannot use PRT.

So at a high level, Entra Seamless SSO is a silent authentication mechanism that provides single sign-on for users when they're on their corporate devices connected to the corporate network. So when enabled users don't need to manually provide their credentials, like you usually do when you go to login screen. You basically just get signed in automatically.

That's why we call it single sign-on. So to set the stage for this particular incident, let me quickly go through a high level overview of what or how seamless SSO works. I brought a diagram as well to source from the public documentation that might help sort of route this discussion. So first of all, it's worth pointing out that seamless SSO works only on cloud-managed domains.

So if you have a federated domain like using a DFS, that Seamless SSO does not apply. So Seamless SSO has essentially two usage patterns. It has the web browser authentication, which is used when you're in Outlook web access for instance, and you get redirected over to authenticate in your browser. And then you have native client authentication such as through the office desktop client like Word, Outlook, Excel, OneDrive, Entra Connect Sync. Since this incident primarily impacted native clients, we'll focus on that flow here.

So the way it works is when a use of transfer access, a native application could be, yeah, the client, and the user isn't already signed in, the native application goes to Entra ID to retrieve the username of the user. It gets a response back from Entra ID that says it should go over to this WS-Trust metadata exchange endpoint. This is also known as the max endpoint.

This endpoint is specifically used by Seamless SSO. It's not a general implementation of the WS trust protocol or anything like that. This endpoint tells the client that it can use, for instance, integrated Windows authentication, which is what SSO uses to do the magic behind the scenes. So you don't need to enter your credentials. Once it's gone to that point, Entra ID will issue a challenge, a Kerberos challenge to the client. The client will go to, for instance, active director on premises, get a Kerberos ticket, send it back to Entra ID, we'll crack it open, we'll validate that it's who you are, and basically, the rest of the answers will wrap it inside the SAML token, send it back to Entra ID, and you're assigned in, you get an access token, an ID token, and so on.

In this incident, the step that broke was around this meta data exchange endpoint, the maximum point, which is the very beginning, right step four in the diagram. Which meant that the endpoint where this makes endpoint is, is hosted, became unavailable because the DNS name disappeared. Clients couldn't resolve it anymore.

So they essentially couldn't initiate Seamless SSO. Now specifically, due to how these native clients handle network errors, network errors are common on client devices. Like you go from outside the coffee shop and you move around your house. They're resilient to these kinda things. But it also is meant that when they got this network error, they essentially didn't prompt for interactive auth, they just swallowed the error and users couldn't essentially lock in at these needed clients.

- Got it. Thank you. That's a great summary of kind of the characters of the story, the different pieces. Thank you for explaining the relevant architecture here, Soren.

So you started to mention some of what customer impact felt like, but there were some scenarios that weren't impacted. Could you help us to understand kind of what customers would've seen or what they wouldn't have been able to do? - Yes, of course. So again, like I said earlier, specifically, the impact was for users in cloud-managed domains. So not federated. And again, users that had PRTs, primary refresh tokens were not affected either. Now, if you're using the web browser, like I mentioned, there are these two flows, right? If you use the web browser to go to Outlook web access for instance, you would not have been able to get single sign-on because that was broken this time.

But you would've been able to log in interactively because it would fall back to your usual credentials like username and password. So the web-based flows were unimpacted from a login experience. However, as I mentioned earlier, the customers did notice impact on native clients due to this behavior they have around the network issues. So specifically apps like office, the word Outlook, Intune, OneDrive, the SQL client, these were the biggest impacted native apps we saw and Connect Sync interconnecting as well. I should mention like before these native clients are specifically fine tuned for handling transient network error, right? This is why we have this behavior and why they're not showing UI. But in this case, it actually meant that these users on these con devices couldn't actually log in.

And as before mentioned as well, since this is, in the background, these devices retry when the incident was mitigated, the users were locked in again automatically. - Super. Thank you very much. And Vojtech, if you don't mind, I'd like to turn my attention to you. Entra's a huge service.

We make changes to the service all of the time and this time, something didn't go to plan. What should have happened during this change and what actually did happen? - I come from the teams or the teams in my group own DNS traffic management and essentially global root global load balancing and routing to the right data center for all of the Entra use cases. So we partner with Soren teams and all the other authentication teams that together make up Entra and DNS changes are very well understood by us. Due to global nature of DNS, any change affects all customers at the same time. There are no natural save deployment procedures for gradual rollout. And generally, with these type of changes, we need to take very good care.

So internally, we have a wide set of tooling and sort of layers that should help us protect from these type of changes. If you resolve some of the main authentication endpoints, you will even find active active setups with multiple providers serving DNS. And we've even blogged about this as some of our core resiliency features for Seamless SSO and the auto log on endpoint, we had a single active setup in place.

We leverage Azure's traffic manager and Azure's DNS capabilities. And generally speaking, even with the single active hosting set up, we have multiple layers in place. So first of all, we track all of our configuration records in configuration files and have a desired state of the system in place. Naturally due to some circumstances, for example, during incident mitigation, those can come out of sync when a change is made in an expedited fashion.

And we have drift detection that detects if the desired configuration of production ever drifts is ever changed from the desired configuration. We also have manual process review. So change management procedures. You will know an engineer documents the intended change explains the deployment procedure, explains the rollback procedure, explains the health signals they would observe to understand if the change is continuing as it should, and under what circumstances they would abort the change and how the way they would revert to the last known good state. So unfortunately we had a miss in both of these areas, which ultimately led to the outage.

The DNS record that was in place was managed in a different environment than we typically scanned. But even then, we have a complete list or an exhaustive list of endpoints used for authentication. So it is in our means to scan all these endpoints for in use. We have an option to also substitute a temporary backend behind DNS record to assess if there is any traffic going in. And we have internal monitoring that would allow us to understand if there's any DNS resolutions occurring on this host name. And so what happened, the first thing is the drift detection system did not kick in.

And the endpoint was not correctly documented as being in use. Additionally, due to this endpoint being used by Kerberos authentication, which with some clients or a small portion of clients is sensitive to this being resolved using an A record. It was used as the alias DNS feature record and are just drift detection system does not support those fully or did not support those fully before this incident.

We've since put this in as a repair. And we typically do not use these. So this was a rare, this was a sort of snowflake type of type of situation. And then, for us, the change management process where we again declared intention, explain how the change is going to be made, what are we going to be watching, under what circumstances we would roll back and how would we roll back and what would be the typical time to mitigate for this has actually identified this change as high risk. The blast radius has been correctly declared as global, but their approvals within our tool are not configured correctly. And so it did not get escalated to the probable approvals and reviews.

So this has also led, this was a process miss for us. So because typically, we would have asked a question, can this done be safe? Can this change be done in a more safely? Does it need to be done at all? Do we need to do it right now? And these are the sort of questions that typically, we want the chain of approval to ask for these high risk changes. And a lot of the times the answer is we need to invest into a way to executing this change more safely. We might need to deprioritize our asks our investments that are features.

So this is essentially how we work with this. - Thank you, Vojtech. That's cool to understand kind of the different layers involved and I appreciate you kind of letting us in there.

You mentioned drift. you explained the kind of the defect And the drift detection system. I just wonder if we could pause briefly on the second safety check there. I think that process point is really important. So what should have happened in that scenario? 'Cause like you say, it was correctly flagged as something that could be high risk, but where did we kind of miss out on that next step? - So you can do the drift detection from multiple directions.

You can take a look at your reference configuration and then look at the production systems and flag any production resource that is not present. So let's assume, I expect there be to be a CNAME for this particular use case. And I go and resolve DNS from a public resolver and always verify that record is present ideally from multiple locations and/or it might be configured differently, pointing to a different record.

In those cases, that triggers an incident for internal teams to make sure to reconcile those configurations as fast as possible and block any changes to that zone before that change is reconciled. You can also take a look from the outside and this is where the gap happens. Our team serves what we call first party use cases. So that is Microsoft is our primary customers and the teams at Entra.

And we know that we have an exhaustive list of host names, endpoints use through authentication scenarios. So you go from the other direction, you exhaustively resolve those host names and for each of them you track what is the expected answer, essentially version of end-to-end testing. And that did not cover this endpoint or specifically for this endpoint, the alert was not correctly configured.

The third line of defenses scanning all resources in use across a set of subscriptions that we expect to be pointing to valid endpoint. And again, this has not been in place. So the primary gap in the drift detection system is correctly understanding the model you have that is deployed in production and making sure you exhaustively check all in use domains in regards to the desired configuration.

So, and in this case, as I said, and this is no excuse for us, the DNS record itself was not deleted. So autologon.microsoftazuread-sso.com was not deleted. It pointed as an alias record to a traffic manager. So a traffic manager is a dynamic DNS resolver that will give you a different set of a records or depending on where you call it from. So typically traffic managers and our use cases are used for two purposes, performs-based routing. So you authenticate against the closest data center so the latency of your authentication flow is lower.

And we also use it true to a healthy scale unit. So traffic managers automatically evaluate health of the upstream and if the upstream becomes unhealthy, they will take it out of rotation and give the client different set of a records to resolve to. And this traffic manager was in place because traffic managers allows us to shape traffic globally. And this got introduced during on trusted adoption of IPv6 in the previous three years. This was a big undertaking. Entra ID supports IPv6 fleet today.

And as part of the migration effort, this endpoint and this traffic manager were tracked in an unexpected location for us. Again, not in scope of the drift detection where we have this configuration and resources centralized and well-known subscriptions within Azure. - Vojtech, it's fascinating and it sounds like there's a lot of rigor that goes into making sure beforehand. I like the checks and balances and I like saying, hey, what happens here and running through the scenarios.

When this eventually happened, how did we detect this? How did we investigate this? It sounds like we've done a lot of the pre-work just in the checks and also if we can talk about why most customers were mitigated within two hours, 94%, but a lot of customers had a long tail recovery of almost up to nine hours maybe. Can we touch on those points as well? - Yeah. The incident actually got already outage got trigger got detected from three sources. The first one was from the authentication scenario itself, so a customer oriented scenario and that was the first actionable detection for us.

The second detection was strictly infrastructural. Does the DNS record point to a valid location? That detection was not actionable as it was escalated at a lower severity that would not trigger paging an engineer. And the third was from our own internal incident management system is a very core system that runs on the lowest layers of our infrastructure because it needs to be able to support us if some of our most foundational systems are down. And we have uncovered not a complete a partial dependency, especially in features where a feature have we call request assistance, where you request assistance from another subject matter expert was not working properly because it was actually affected on its own by the service to service flow. So the team very quickly identified that correlated it with the outages we declared and internally published around the authentication scenario. The incident had an unusually high time to engage and time to essentially start fixing the incident because this incident management system was affected as well.

This affected our capability to request assistance. This affected our capability to request bridges and we have contingencies for that. So this is not something we don't plan. This is something we do plan for and so we have backup solutions that could be in place, but using those, even though it's exercised is not the beaten path. So it's something that added some time to assembling all the right people on the bridge after we've connected all the dots, the mitigation was fairly quick.

With a big caveat. And that is the answer to your second question, Sami. The behavior of deleting a traffic manager, which is referenced by an alias record, is that if the traffic manager gets deleted, there's a cascading delete that sets the DNS record, which originally used to be an alias to an MTC name. And so our rollback essentially recreated the traffic manager internally but did not correctly set that record to point to that rollback to traffic manager. We've identified that very quickly because by that moment, we knew which the FQDN was affected, which host name was affected, but we did a partial rollback, so we used the CNAME record, not an alias record to point it to the traffic manager. This mitigated the impact for 94% of the customers and any client able to recursively resolve using CNAMEs or not being sensitive to what exact record gets returned, started working again.

And so within the next hour or two, we started observing additional requests mostly reported by customers that some flows are not working. And specifically clients like, I believe it was SQL Management Studio and some other flows have been still couldn't authenticate and were reporting issues. This time, it was not DNS resolutions anymore, but they were surfacing as 403 forbidden errors.

And the cost for that was because when Kerberos, during the flow of the protocol forms something called a service principle name, it is sensitive to the DNS record not being a CNAME but being an A record. And so this took us some time to root cause understand and then the fix was very quick. We restored the last known good configuration in its fullness. And so again, the learning for us there is that partial rollback is something that is always, is never good enough. You need to have 100% confidence You're exactly at the last known good state, which was not the case.

- That's a great summary, Vojtech. Appreciate you kind of walking us through those detection challenges and then ultimately how we mitigated it, including the long tail mitigation there. That's a great segue.

You started touching on some of our kind of repairs or learnings coming out of this. Could you talk us through how we're improving that change management element? That's where you started in terms of talking about how we were rolling out that change initially, but what are we gonna do to kind of make that even safer next time? - Any process, any system that's backed by a process relies on one of two things. Either people are perfectly trained to use it in the right direction or the gating doesn't allow any sort of bypass. And so in the case of the change management system, we're working to make sure that we link the control plane to verifying that these sort of approvals exist. And internally, we are actually in the process of rolling out a system that has a more rigorous approval escalation system.

So that is already in place. And by connecting these two, putting the system that has this rigor approval escalation process and making sure that the control plane gates any change by having essentially a ticket that has the right set of approvals is the process side of things. Again, if we would be doing a change like that today, it would still be considered high risk. So the second question you're asking is how do I make that risk lower? So first of all, execution of such changes needs to be done under more supervision. There are ways to verify that the domains are in use and again, the control plane operations can gate that they will not allow you to do a modification or deletion in this case without insuring or without you having a confirmation that a metric about DNS resolutions for this particular record is zero. The control plane will simply not let you execute a removal if it has any sort of signal that this domain is still in use.

So we're putting those checks that even if we didn't do the process side of things, these type of changes would not be possible. And lastly, we are reviewing and doing an inventory of the model of all the endpoints and host names we offer to customers for both core opportunistic authentication scenarios and ensuring we continuously monitor and evaluate. And that's yet another line of defense. - And Vojtech, on that, you mentioned something about the endpoints where you spoke earlier about the global nature of DNS.

And I know that SSO, you have global endpoints. Have you considered or is there any assessment or feasibility to regional endpoints so that blast radius isn't so big? - Yeah, so regionalization and availability and at the end of the day, for example, when it comes to compute availability, zone retrofitting or usage is a huge topic within both Azure and identity Entra. And so for us, whenever there's an opportunity to use and leverage regional endpoints or generally regionalize our services, that is by far the preferred version.

And especially in partnership with Azure, that is the solution in place today. For global authentication scenarios. And if you're familiar with, there your model of tenants and subscriptions, tenants are scoped globally. And so core authentication scenarios leverage global endpoints today.

But for every particular use case, we have service-to-service, user-to-service. We're always evaluating the feasibility. Even if it's a lot of work, can we put regionalization in place? Can we reduce the blast radius? So this is something that is a follow up for us as well in the midterm horizon to walk through the seamless sign-on and interconnect sync use cases, but also for all the other scenarios we support and evaluate whether regionalization is an available strategy. But no matter what we, loginmicrosoftonline.com,

loginlive.com for our consumers as well will be endpoints we be supporting indefinitely. And so this is something where we need to have, say strategies even if the endpoints are global.

- Sure, that's great to understand. It's good to know that you and the team are exploring those options anyway. I wanna bring Ganesh here quickly.

We save Ganesh from the Entra Connect side. I know we've heard a lot from Vojtech about how we're going to kind of improve our change management. Ganesh, can you talk us through what Entra Connect learn or whether there's any improvements or repairs from that side? - Absolutely. Thank you, David.

Just to set some background, Entra Connect Sync is used by our customers to synchronize their identity objects from their active directory to their Entra ID. It's a mission critical application that our customers use to seamlessly connect their on-premise and cloud infrastructure. And in the background, it does use username and password, a dedicated one to authenticate between the application and the cloud. And that user authentication flow was indeed impacted due to this incident.

So as a part of the incident review on the learning, we do recognize that username and passwords are not the most modern way of authenticating. So we are looking at ways of modernizing it by moving that authentication model to leverage applications and certificates. So stay tuned. This is something we are really working on as a part of secure feature initiative. And we expect to announce more about it by end of next month. That's a key learning for us is to move away from user-based authentication to move to application based authentication for interconnect.

- Super. Thank you, Ganesh. I'm a big fan of normal more usernames and passwords and having more sophisticated methods, so it's exciting. Soren, if I can turn my attention back to you again, was for our customers who are dialed in, for customers who are watching the stream later on YouTube, is there anything that they could have done to be resilient? Is there anything coming down the line that could help with a situation like this? Or do customers just have to endure these types of events? - No, I think we already shared some of the repair ends we're doing on our side to hopefully make sure these things don't happen. There are a few things I wanna share, general things maybe that the customers could think about. So in addition to the resilience work we're doing, like we mentioned earlier.

First of all, I would encourage customers to reevaluate whether Seamless SSO is still a feature they need in your organization. Like I mentioned in the beginning, customers using Windows 10, windows server 2016 are newer versions. They support primary refresh tokens, which is a much more reliable and secure alternative to Seamless SSO and it's covered by Entra ID's backup authentication system for additional reliability.

So I think in general, if you're no longer using CMS or any other off-feature really, I would recommend turning off that feature if you don't use it. And learning from the impact and Entra Connect like Ganesh mentioned, where possible, use managed identities for Azure resources in lieu of user accounts pretending to be service accounts. And if you can't use managed identities, look for confidential clients in the authentication library that we support through Microsoft.

And finally, I know this wouldn't have saved the day here, but as a general best practice, if you're building your own native clients, make sure you leverage the latest off libraries like Microsoft Authentication libraries because they include the saved by default implementations of authentication flows and they know how to fall back in cases not like this, but in general. - Got it. That's a great summary of some of our best practices to help improve here from the customer side of things. So thank you for that, Soren.

I've saved my last question from my co-host Sami. Sami runs our incident communications team. So in addition to hearing from our service team leaders about what went wrong, Sami, how did we do from a comms perspective? Were we able to keep impacted customers up-to-date with what was going on? - Yes and no. In the beginning, as Vojtech mentioned that when our RA system, our request assistance goes down, that impacted the communications team. So we weren't able to get there in quick fashion.

But we did get there at around two hours after Impact started is when we kind of started sharing updates because we didn't have the exact customer impacted tenant source subscriptions. We communicated broadly. Very soon after that, we communicated in the portal with what we knew. And I think some of the key things to note here, I mean, by the time it was two or three hours after we had mitigated as well as for the 94% of customers. So 94% of customers would've been mitigated and seen our comms.

And our comms was really relevant for that long tail of customers who are still experiencing pain. A couple of things to note as well. When we initially communicated, we had under communicated, we didn't have a full list of who was impacted. So some customers may have received just the final PIR or like as we kind of offered our preliminary PIR, they would've seen things. And the other thing to note is, which is a shame as well, is that today we don't have tenant level alerting in the portal. So we can communicate to customers in the Azure portal to their tenants.

But that last mile configuration where customers are configuring SMS and email alerts and plugging into solutions like PagerDuty and option in ServiceNow, that piece hasn't happened yet and that we're looking to land that in October. Lastly, I've had a few customers ask us about our brain and brain is our automatic communication system. It communicates to impacted subscriptions within 15 minutes. And this kind of scenario, brain doesn't work for everything, particularly when resources are healthy and networking is down.

Our brain system isn't great for dealing with scenarios like that. So we have a lot of work to do on the comms front, but particularly around, you know, there are delays if we're going round the beaten path as Vojtech referred to it as, but really making sure that those tenant level last mile alerts are set up is a priority for us. - Thank you for watching this Azure Incident Retrospective. - At the scale of which our cloud operates at, incidents are inevitable. - Just as Microsoft is always learning and improving, we hope our customers and partners can learn from these too and provide a lot of reliability guidance through the Azure Well-Architected Framework.

- To ensure that you get posted incident reviews after an outage and invites to join these livestream Q&A sessions, please ensure that you have Azure Service Health set up. - We really focused on being as transparent as possible and showing up and being accountable after these major incidents. - Whether it's an outage, a security, or a privacy event.

Microsoft is investing heavily in these events to ensure that we earn, maintain, and at times rebuild your trust. Thanks for joining us.

2025-03-29 03:19

Show Video

Other news

【离限户外】巅峰对决 2025！vivo X200 Ultra、OPPO Find X8 Ultra 、小米 15 Ultra 户外综合体验 2025-06-03 01:19

Varun Chhabra, Dell & Kari Briski, NVIDIA | Dell Technologies World 2025 2025-05-27 02:08

DigiPen Institute of Technology | 2025 DigiPen Europe - Bilbao Student Game Showcase 2025-05-27 01:26