Forum Discussion
Azure Virtual Desktop - Black Screens on logins - What we've tried so far
After a day of production (Mondays have been the worst by far), I can report that the KIR "fixed" the issue entirely. We are back to getting a very timely response on connections, and no user complaints at all. Ready to close out this incident, and we will apply the KIR to all of our AVD servers in our environments.
Strange that this issue and the KIR doesn't seem to be published (if I'm wrong please correct me). I'm only finding forum posts about it. I happened across this one by accident after noticing an interesting error code, and it turned out to be the exact root cause and mitigation we needed. Everyone's been yelling at us about this for almost a month and it went all the way up the chain and no one understood what to do until now.
Thanks everyone for sharing info.
- dit-chrisOct 15, 2024Brass ContributorWe've got one week old completely rebuilt from the ground up session host that is now causing us issue every day with AAD Broker issues, black screens, login delays, for the last couple of days - I'm even going to double check the KIR is definitely present on it! 🤯
- HelenaKohlerOct 15, 2024Copper ContributorMicrosoft finally announced there is an issue, but it’s weird to me because the mentioned KIR did not solve the issue entirely (at least for us).
https://learn.microsoft.com/en-us/windows/release-health/status-windows-10-22h2#3421msgdesc - dit-chrisOct 10, 2024Brass ContributorJPlendo my understanding is that the regression the KIR targets was introduced in July Preview/August Patch Tuesday releases relating to the AppX/AppReadiness services - now what I don't understand is why this didn't seen to start to impact people until mid-late September if the code had been out there of six weeks; my suspicion is that is maybe from August's preview did but not in big enough numbers, we (like I imagine most here) don't apply Week D Preview to our AVD hosts and aren't exactly "fast" on the Week B Patch Tuesday stuff unless there are thing we really need to target, so the Sept 10th releases we didn't roll out until the evening of Friday 20th to production.
Whilst the underlying issue the KIR targets isn't being fixed until October Week D/November Week B - the question is what else changed in September, my suspicion is that whilst the service itself was broken back in July/August its job is to process AppX packages and maybe some AppX package (which will be some core component of Windows as we don't have any others, MS only standard stuff) was not modified in August so the AppX service didn't have to process redeployment but lets say hypothetically the calculator app was modified from v1.1 to v1.2 in September then the AppX service needed to do stuff and maybe something in that AppX package triggered say a race condition in the AppX/AppReadiness service... so whilst the underlying issue in the service code is still there its possible that the October release also includes in it, or alongside it, updates to those system component/in-box AppX packages that the service was choking on. It all seems a bit "well try it, it might make things better, it might make things worse, we'll let you test it for us an let us know how it goes... but just in case make sure you take a backup so you can revert you environment" - JPlendoOct 10, 2024Brass Contributordit-chris Agreed. It's really funny that they will admit there is an issue when talking to them on the phone as our TAM also told us they acknowledge there was an issue. But to post to their known issues site would require them admitting publicly to another issue with AVD after the multiple outages we had in September. Besides the fact this is a publicly traded company, Microsoft has all these users and companies paying BIG money for support contracts and they aren't doing their part by giving full information when issues arise or having any sense of urgency to get them fixed. It's almost like Microsoft knows how big they are, they know people can't really leave as non IT folks aren't going to learn a whole new OS to do their jobs, so Microsoft is kind of saying "deal with these issues until we see fir to go fix them". It wasn't always this way. We used to get good support from them. If the issue is the folks Microsoft hands its support over to, then that needs to change as they aren't doing the job they are supposed to. We had a Sev A dropped down to a B now. I send emails back to our support and dont hear back for days sometimes. I ask a bunch of questions and all I get in reply is "let us k now when the issue occurs so we can look at it". I am sending them information on what is happening as well as questions on things we would like to try, but get no answers to those questions. I even submitted a survey about how bad their support has been with us recently. i got an email reply telling me how much they care about support and will work to make it better....basically a form letter/ Frustrated isn;t the word I'd use for how I feel.
All that aside, we were told to Install the October patches on our VMs and see how things went. But I see you were told the October patches released will NOT affect or fix this black screen issues right? - Marius62991325Oct 10, 2024Copper Contributor
Could not agree more. We're at our wits end and MS Case is going in circles. Tried so many different paths just to create workarounds is fruitless. As soon as we think we've got a 'temp fix' , that same 'fix' does not work a few hours later anymore.
- dit-chrisOct 09, 2024Brass ContributorJPlendo I could sort of just about give them the benefit of the doubt for not updating the known issues list for existing releases KBs until yesterday when they pushed out KB5044273 whilst still having no mention of the issue... clearly they know about it having acknowledged to people with support cases that the issue would continue in yesterday's week 10B release, and whilst at the same time sending out a KIR (the give away is in the name being that KIR stand for KNOWN ISSUE Rollback). That is now just unacceptable, they have an established method in the Known Issue list on the impacted KBs and also centrally on the Windows Release Health Dashboard here https://learn.microsoft.com/en-us/windows/release-health/status-windows-10-22H2. Ironically even with a load of notes on the case from previous engineers at the weekend I had a support engineer try to tell me there weren't any known issue because he could see it on that very release health dashboard!
Certainly my suspicion too is that they are shying way from publicly acknowledging there is an issue with one of their flagship enterprise desktop virtualisation solutions because it wouldn't be a good look... that said heaven forbid a prospective AVD customer wander over to the Microsoft Tech Community and looks at the AVD pages and the 120 odd replies in this thread!
You would think they would want to put a proper statement out there if only for damage containment purposes, setting out the issue, the work arounds, the KIR, the timescales on a fix. Ultimately I think most people who need to know about this are big enough to see beyond headlines - and frankly are probably more interesting in the detail and resolution and and how they issue is dealt with, ultimately as frustrating as it is, ultimately we probably all get that **** happens, then what matters more is how issue are dealt with; had we had a clearly published known issue, details of the fault, how to mitigate it and timescales then that would have saved the 3 or 4 calls I had at 1:40am every day from the CritSit managers as the 24x7 Sev A case was handed over between regions on a follow the sun basis - clearly the documentation exist internally, the wording that different people are eventually getting is virtually identical and it must be costing Microsoft a small fortunate in managing these cases when half the engineers you speak to are unaware of the issue.
Probably the best commentary I saw post the Crowdstrike problems in July was from the CTO of one of their major competitors whose view was basically like "there but for the grace of God go I", that having that kind of low kernel level bug is probably pretty much every security vendors' worst nightmare and that people who live in glass houses shouldn't throw stones as the reality is no one is totally immune from the risk of occasionally unintentionally pushing bad code, what then matters is that you find out why, communicate and remediate it and learn from it for future - you just hope it is someone else and not you and when it's your turn that it is turns out to be something fairly cosmetic and patchable, Crowdstrike's biggest headache wasn't that once they became aware of the issue they didn't pull it or to try to deny it or that they actually terminally broke stuff irrevocably, it was managing to then get devices up long enough to connect and pull down and apply the KIR in the race before they then crashed again. Leaving aside the "how did that get out the door in the first place" issue arguably they didn't do too bad a job at dealing with the "ok this had happened, this is what we have done to fix it, this is what you need to do to recover" side of things, yes it was painful for those having to manually remediate devices but actually those people pretty quickly had the information to know what they needed to be doing, it then just became a case of hours in the day and whether anyone else could use their out of band tooling to streamline the fix like vPro etc. - dit-chrisOct 09, 2024Brass Contributor
One day you might even be allowed to see my reply to KevHal's message at Oct 09 2024 02:45 PM with that screenshot in... I seem to have been sent to the
censors, sorry moderators, probably to be cancelled for not being very glowing about Microsoft's handling of this (not that there was anything that should have been remotely objectionable content wise), maybe Microsoft don't like critical analysis of what event log message say! 😂 - JPlendoOct 09, 2024Brass ContributorI dont know the answer to that one. In the mail lab we use our VMs, Teams is there, but doesnt autostart and isnt mandatory. Same with Outlook. I dont think anyone uses Outlook as these are students accessing various software packages for class.
- JPlendoOct 09, 2024Brass ContributorWe have mentioned that to our TAM....why is there is no statement or ack from MS? Is it because it would be another cloud issue and I'm sure they want to shy away from any public statements of something not working after the last month they had.
- Golo33Oct 09, 2024MCTFor me it's a very random error and I don't know its history on this pool.
- dit-chrisOct 09, 2024Brass ContributorThese Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy issues have been plaguing us for 4 years now in AVD and I'm not sure similar didn't apply to RDS 2016 and 2019 before that to AAD authentication and Outlook not authenticating and saying "Needs Password" as the error symptom is the same and the fix is the same, drain the host and reboot it.
- HelenaKohlerOct 09, 2024Copper ContributorSame here! You can mitigate impact by rolling out hosts from old image (19045.4780, August patch level) and disable further updates. This solved all problems like black screen or app loading issues for us, but it is not a recommended solution at all!
There is no public statement from Microsoft, no mentioning of a known issue, and I don't even want to start talking about the MS support quality! - dit-chrisOct 09, 2024Brass Contributor
Same here, normally log them off, stick the host into drain mode usually as normally if you get one failure you get repeated ones.
I am just looking at a newly built session host we put into production yesterday morning, 221 events for Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy from AppModel-State against 7 users over the last couple of days, some of them repeats for the same people... looking it appear to have had 60 user session on that host over if I am reading it correctly 89 individual connections.
Now looking through those events at least one of those users had reported an issue with Outlook (although having left it saying Loading Profile for 90 min whilst they did something else in the background it apparently has now started working... normally I would expect it to just say Needs Password... with no way to enter it... but maybe no one had left it 90min for Outlook to open....) however I have just checked and although Outlook is working ok neither Teams or Onedrive are authenticating. Normally my earlier indicator post login is "has OneDrive icon gone blue on the solid taskbar or is it stuck with a progress wheel on"
What I don't know is if you read the logs you get the following logged - is this actually just "expected" behaviour, that FSLogix breaks it roaming a user between hosts and it is self healing automatically (or trying to)?
- ERROR: Failure to load the application settings for package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy. Error Code: -2147024893
- ERROR:Failure to load the application settings for package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy. Error Code: -2147024893
- WARNING: Triggered repair of state locations because operation SettingsInitialize against package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy hit error -2147009096.
- WARNING: Repair of state locations for operation SettingsInitialize against package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy with error -2147009096 returned Error Code: -2147009275
- WARNING: Triggered repair because operation LocalSettings against package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy hit error -2147009096.
- WARNING: Repair for operation LocalSettings against package Microsoft.AAD.BrokerPlugin_cw5n1h2txyewy with error -2147009096 returned Error Code: 0
- KevHalOct 09, 2024Iron Contributor
That isn't practical at all. They really need to come up with some answers quickly
- djordan1910Oct 09, 2024Copper ContributorSame here, but at least its less pervasive.. only a few times a day for some users after doing the KIR. All we're doing is forcing their session to disconnect so FSLogix can have it back, then having them reconnect - which in most cases gets them onto a different host. They log right in with no issues.
- KevHalOct 09, 2024Iron Contributor
What is the resolution for these issues where Outlook and Teams does not authenticate, this is even after applying the KIR, there must be some kind of solution to this:
I am getting the point where i'm not sure I can carry on like this. It's being going on far too long.
- JPlendoOct 09, 2024Brass ContributorWe have a ticket open with MS Support, but they are not providing us with anything decent. They want to see the issue as it happens, but since ours is intermittent, its hard to give them that. I have asked them repeatedly about the answers some on this thread have gotten from MS and was told they wouldn't have that information as they are only support, which makes 0 sense. They take forever to answer our emails and provide nothing helpful. I've provided them the KBs listed in this thread, the KIR information and asked for info about Windows 11 and was told to run an update from October 2020 that isnt even compatible with Windows 11. If I could use expletives in the emails I would Im so frustrated with them. I dont think their support even talks to each other or has any communication between support centers. Its a f;n joke and I am sick and tired of it.
So I dont know what info I will get from the brain trust today, hopefully others have more luck.
Are we sure the October patches released yesterday do NOT contain anything to address this issue? - Oliver_KrageOct 08, 2024Brass ContributorYes and if any gets the information regarding a KIR for Windows 11 please do share here so others can use it.
- dit-chrisOct 08, 2024Brass Contributor
JPlendo so the installer for the KIR is called “Windows 10 20H2, 21H1, 21H2 and 22H2 KB5040525 241001_01051 Known Issue Rollback.msi” - KB5040525 being the July Preview release that initially introduced the regression causing the issue, that regression has persisted through 8B, 8D, 9B, 9D and is still present in 10B released yesterday but will be remediated in the 10D release later this month. That earlier KB number confused me at first as I thought that the issue was something introduced in the September 9B release (or the August week D preview release) but it appears the problematic change has been in the code much longer but maybe something else changed last month to “trigger” the issue to start happening… but that could be a change to an AppX package that it is trying to process I guess.
The way a Known Issue Rollback works is that when you enable it in the GPO it places an entry in the registry that when the machine boots it checks that known base path for entries which tell the system to basically disable (or enable for preview code) a specific fix id - the specific code change that this KIR disables is referenced by what the id value the string value is. So in theory you can check for its presence by running say REG QUERY against that key; equally you could roll out the setting centrally via an AD GPO rather than local group policy according to the guidance I had; something perhaps more applicable if you had a KIR for an issue on hundreds of endpoints.
As you say the equivalent Windows 11 update released in week 7D has a different KB number to the Windows 10 one which is why I said I was doubtful of this working based on how the KIR installer is named; but depending on what the component they were modifying with that change it is fairly conceivable that the same fundamental change could have been applied to both operating systems code bases bearing in mind how similar they are under the hood, now whether such a change in Windows 11 would have the same or a different fix/change ID is anyone’s guess.
If you are having similar issues with Windows 11 then I would certainly be pushing support to ask if there is a similar known issue in Windows 11 and a similar KIR as I note the KIR package I have seen appears to reference a Windows 10 KB ID.
- JPlendoOct 08, 2024Brass Contributor
dit-chris KB5040527 Appears to be the July 2024 Cumulative Preview Update for Windows 11. The KB you posted KB5040525 is the July 2024 Cumulative Preview Update for Windows 10. How is the July 2024 Cumulative Preview Update the KIR? The changes made in the that update would roll up with the following months and so on.
Are you saying that reg key value is what to change it BACK to as the update changed it to something different when it was applied months ago?
Wouldnt that already be installed on machines if you've patched since July? - dit-chrisOct 08, 2024Brass ContributorOliver_Krage sorry that should have been tagging you in that last post 🙂
- dit-chrisOct 08, 2024Brass Contributor
chrissmith585 I suspect it may not be as the KIR install is called KB5040525 and in the ADMX says <supportedOn ref="SUPPORTED_Windows_10_0_22H2To20H2_Only" />
Now there might be a different KIR for Windows 11 with a different fix code (or it might be the same) but I haven't see one or heard anyone reference there being one. Here is what the ADMX file actually adds... but if you were going to give it a try make sure you take a backup of the machine - although as you can deploy a KIR deploy it via a centralised Active Directory GPO I suspect it probably does nothing if it doesn't match a to fix id to target, after all you wouldn't want smoke to start pouring out of a machine if it got put in the wrong GPO or the WMI filter or group filter wasn't applied correctly.
As they say caveat emptor (or should that maybe be caveat lector) - the information below is provided for informational and educational purposes only and if you were to use it you'd do so at your own risk! 😂
Key = HKLM\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides
Type = DWORD
Name = 595276428
Decimal value = 0 - chrissmith585Oct 08, 2024Copper ContributorI assume it's W10/11, but I don't know. You would have to test it.