Azure Virtual Desktop - Black Screens on logins - What we've tried so far

dit-chris
Brass Contributor
Oct 15, 2024
We've got one week old completely rebuilt from the ground up session host that is now causing us issue every day with AAD Broker issues, black screens, login delays, for the last couple of days - I'm even going to double check the KIR is definitely present on it! 🤯
HelenaKohler
Copper Contributor
Oct 15, 2024
Microsoft finally announced there is an issue, but it’s weird to me because the mentioned KIR did not solve the issue entirely (at least for us).

https://learn.microsoft.com/en-us/windows/release-health/status-windows-10-22h2#3421msgdesc
dit-chris
Brass Contributor
Oct 10, 2024
JPlendo my understanding is that the regression the KIR targets was introduced in July Preview/August Patch Tuesday releases relating to the AppX/AppReadiness services - now what I don't understand is why this didn't seen to start to impact people until mid-late September if the code had been out there of six weeks; my suspicion is that is maybe from August's preview did but not in big enough numbers, we (like I imagine most here) don't apply Week D Preview to our AVD hosts and aren't exactly "fast" on the Week B Patch Tuesday stuff unless there are thing we really need to target, so the Sept 10th releases we didn't roll out until the evening of Friday 20th to production.

Whilst the underlying issue the KIR targets isn't being fixed until October Week D/November Week B - the question is what else changed in September, my suspicion is that whilst the service itself was broken back in July/August its job is to process AppX packages and maybe some AppX package (which will be some core component of Windows as we don't have any others, MS only standard stuff) was not modified in August so the AppX service didn't have to process redeployment but lets say hypothetically the calculator app was modified from v1.1 to v1.2 in September then the AppX service needed to do stuff and maybe something in that AppX package triggered say a race condition in the AppX/AppReadiness service... so whilst the underlying issue in the service code is still there its possible that the October release also includes in it, or alongside it, updates to those system component/in-box AppX packages that the service was choking on. It all seems a bit "well try it, it might make things better, it might make things worse, we'll let you test it for us an let us know how it goes... but just in case make sure you take a backup so you can revert you environment"
JPlendo
Brass Contributor
Oct 10, 2024
dit-chris Agreed. It's really funny that they will admit there is an issue when talking to them on the phone as our TAM also told us they acknowledge there was an issue. But to post to their known issues site would require them admitting publicly to another issue with AVD after the multiple outages we had in September. Besides the fact this is a publicly traded company, Microsoft has all these users and companies paying BIG money for support contracts and they aren't doing their part by giving full information when issues arise or having any sense of urgency to get them fixed. It's almost like Microsoft knows how big they are, they know people can't really leave as non IT folks aren't going to learn a whole new OS to do their jobs, so Microsoft is kind of saying "deal with these issues until we see fir to go fix them". It wasn't always this way. We used to get good support from them. If the issue is the folks Microsoft hands its support over to, then that needs to change as they aren't doing the job they are supposed to. We had a Sev A dropped down to a B now. I send emails back to our support and dont hear back for days sometimes. I ask a bunch of questions and all I get in reply is "let us k now when the issue occurs so we can look at it". I am sending them information on what is happening as well as questions on things we would like to try, but get no answers to those questions. I even submitted a survey about how bad their support has been with us recently. i got an email reply telling me how much they care about support and will work to make it better....basically a form letter/ Frustrated isn;t the word I'd use for how I feel.

All that aside, we were told to Install the October patches on our VMs and see how things went. But I see you were told the October patches released will NOT affect or fix this black screen issues right?
Marius62991325
Copper Contributor
Oct 10, 2024
dit-chris
Could not agree more. We're at our wits end and MS Case is going in circles. Tried so many different paths just to create workarounds is fruitless. As soon as we think we've got a 'temp fix' , that same 'fix' does not work a few hours later anymore.
dit-chris
Brass Contributor
Oct 09, 2024
JPlendo I could sort of just about give them the benefit of the doubt for not updating the known issues list for existing releases KBs until yesterday when they pushed out KB5044273 whilst still having no mention of the issue... clearly they know about it having acknowledged to people with support cases that the issue would continue in yesterday's week 10B release, and whilst at the same time sending out a KIR (the give away is in the name being that KIR stand for KNOWN ISSUE Rollback). That is now just unacceptable, they have an established method in the Known Issue list on the impacted KBs and also centrally on the Windows Release Health Dashboard here https://learn.microsoft.com/en-us/windows/release-health/status-windows-10-22H2. Ironically even with a load of notes on the case from previous engineers at the weekend I had a support engineer try to tell me there weren't any known issue because he could see it on that very release health dashboard!

Certainly my suspicion too is that they are shying way from publicly acknowledging there is an issue with one of their flagship enterprise desktop virtualisation solutions because it wouldn't be a good look... that said heaven forbid a prospective AVD customer wander over to the Microsoft Tech Community and looks at the AVD pages and the 120 odd replies in this thread!

You would think they would want to put a proper statement out there if only for damage containment purposes, setting out the issue, the work arounds, the KIR, the timescales on a fix. Ultimately I think most people who need to know about this are big enough to see beyond headlines - and frankly are probably more interesting in the detail and resolution and and how they issue is dealt with, ultimately as frustrating as it is, ultimately we probably all get that **** happens, then what matters more is how issue are dealt with; had we had a clearly published known issue, details of the fault, how to mitigate it and timescales then that would have saved the 3 or 4 calls I had at 1:40am every day from the CritSit managers as the 24x7 Sev A case was handed over between regions on a follow the sun basis - clearly the documentation exist internally, the wording that different people are eventually getting is virtually identical and it must be costing Microsoft a small fortunate in managing these cases when half the engineers you speak to are unaware of the issue.

Probably the best commentary I saw post the Crowdstrike problems in July was from the CTO of one of their major competitors whose view was basically like "there but for the grace of God go I", that having that kind of low kernel level bug is probably pretty much every security vendors' worst nightmare and that people who live in glass houses shouldn't throw stones as the reality is no one is totally immune from the risk of occasionally unintentionally pushing bad code, what then matters is that you find out why, communicate and remediate it and learn from it for future - you just hope it is someone else and not you and when it's your turn that it is turns out to be something fairly cosmetic and patchable, Crowdstrike's biggest headache wasn't that once they became aware of the issue they didn't pull it or to try to deny it or that they actually terminally broke stuff irrevocably, it was managing to then get devices up long enough to connect and pull down and apply the KIR in the race before they then crashed again. Leaving aside the "how did that get out the door in the first place" issue arguably they didn't do too bad a job at dealing with the "ok this had happened, this is what we have done to fix it, this is what you need to do to recover" side of things, yes it was painful for those having to manually remediate devices but actually those people pretty quickly had the information to know what they needed to be doing, it then just became a case of hours in the day and whether anyone else could use their out of band tooling to streamline the fix like vPro etc.

Forum Discussion

Azure Virtual Desktop - Black Screens on logins - What we've tried so far

Resources