Forum Discussion
JPlendo
Sep 19, 2024Brass Contributor
Azure Virtual Desktop - Black Screens on logins - What we've tried so far
TLDR - Azure Virtual Desktop Black Screens. Could be 2 Min long, could be much longer. Tried removing stuck profiles, spun up all new VMs to see if that would fix it, finally disabled an applicatio...
dit-chris
Oct 09, 2024Brass Contributor
JPlendo I could sort of just about give them the benefit of the doubt for not updating the known issues list for existing releases KBs until yesterday when they pushed out KB5044273 whilst still having no mention of the issue... clearly they know about it having acknowledged to people with support cases that the issue would continue in yesterday's week 10B release, and whilst at the same time sending out a KIR (the give away is in the name being that KIR stand for KNOWN ISSUE Rollback). That is now just unacceptable, they have an established method in the Known Issue list on the impacted KBs and also centrally on the Windows Release Health Dashboard here https://learn.microsoft.com/en-us/windows/release-health/status-windows-10-22H2. Ironically even with a load of notes on the case from previous engineers at the weekend I had a support engineer try to tell me there weren't any known issue because he could see it on that very release health dashboard!
Certainly my suspicion too is that they are shying way from publicly acknowledging there is an issue with one of their flagship enterprise desktop virtualisation solutions because it wouldn't be a good look... that said heaven forbid a prospective AVD customer wander over to the Microsoft Tech Community and looks at the AVD pages and the 120 odd replies in this thread!
You would think they would want to put a proper statement out there if only for damage containment purposes, setting out the issue, the work arounds, the KIR, the timescales on a fix. Ultimately I think most people who need to know about this are big enough to see beyond headlines - and frankly are probably more interesting in the detail and resolution and and how they issue is dealt with, ultimately as frustrating as it is, ultimately we probably all get that **** happens, then what matters more is how issue are dealt with; had we had a clearly published known issue, details of the fault, how to mitigate it and timescales then that would have saved the 3 or 4 calls I had at 1:40am every day from the CritSit managers as the 24x7 Sev A case was handed over between regions on a follow the sun basis - clearly the documentation exist internally, the wording that different people are eventually getting is virtually identical and it must be costing Microsoft a small fortunate in managing these cases when half the engineers you speak to are unaware of the issue.
Probably the best commentary I saw post the Crowdstrike problems in July was from the CTO of one of their major competitors whose view was basically like "there but for the grace of God go I", that having that kind of low kernel level bug is probably pretty much every security vendors' worst nightmare and that people who live in glass houses shouldn't throw stones as the reality is no one is totally immune from the risk of occasionally unintentionally pushing bad code, what then matters is that you find out why, communicate and remediate it and learn from it for future - you just hope it is someone else and not you and when it's your turn that it is turns out to be something fairly cosmetic and patchable, Crowdstrike's biggest headache wasn't that once they became aware of the issue they didn't pull it or to try to deny it or that they actually terminally broke stuff irrevocably, it was managing to then get devices up long enough to connect and pull down and apply the KIR in the race before they then crashed again. Leaving aside the "how did that get out the door in the first place" issue arguably they didn't do too bad a job at dealing with the "ok this had happened, this is what we have done to fix it, this is what you need to do to recover" side of things, yes it was painful for those having to manually remediate devices but actually those people pretty quickly had the information to know what they needed to be doing, it then just became a case of hours in the day and whether anyone else could use their out of band tooling to streamline the fix like vPro etc.
Certainly my suspicion too is that they are shying way from publicly acknowledging there is an issue with one of their flagship enterprise desktop virtualisation solutions because it wouldn't be a good look... that said heaven forbid a prospective AVD customer wander over to the Microsoft Tech Community and looks at the AVD pages and the 120 odd replies in this thread!
You would think they would want to put a proper statement out there if only for damage containment purposes, setting out the issue, the work arounds, the KIR, the timescales on a fix. Ultimately I think most people who need to know about this are big enough to see beyond headlines - and frankly are probably more interesting in the detail and resolution and and how they issue is dealt with, ultimately as frustrating as it is, ultimately we probably all get that **** happens, then what matters more is how issue are dealt with; had we had a clearly published known issue, details of the fault, how to mitigate it and timescales then that would have saved the 3 or 4 calls I had at 1:40am every day from the CritSit managers as the 24x7 Sev A case was handed over between regions on a follow the sun basis - clearly the documentation exist internally, the wording that different people are eventually getting is virtually identical and it must be costing Microsoft a small fortunate in managing these cases when half the engineers you speak to are unaware of the issue.
Probably the best commentary I saw post the Crowdstrike problems in July was from the CTO of one of their major competitors whose view was basically like "there but for the grace of God go I", that having that kind of low kernel level bug is probably pretty much every security vendors' worst nightmare and that people who live in glass houses shouldn't throw stones as the reality is no one is totally immune from the risk of occasionally unintentionally pushing bad code, what then matters is that you find out why, communicate and remediate it and learn from it for future - you just hope it is someone else and not you and when it's your turn that it is turns out to be something fairly cosmetic and patchable, Crowdstrike's biggest headache wasn't that once they became aware of the issue they didn't pull it or to try to deny it or that they actually terminally broke stuff irrevocably, it was managing to then get devices up long enough to connect and pull down and apply the KIR in the race before they then crashed again. Leaving aside the "how did that get out the door in the first place" issue arguably they didn't do too bad a job at dealing with the "ok this had happened, this is what we have done to fix it, this is what you need to do to recover" side of things, yes it was painful for those having to manually remediate devices but actually those people pretty quickly had the information to know what they needed to be doing, it then just became a case of hours in the day and whether anyone else could use their out of band tooling to streamline the fix like vPro etc.
Marius62991325
Oct 10, 2024Copper Contributor
Could not agree more. We're at our wits end and MS Case is going in circles. Tried so many different paths just to create workarounds is fruitless. As soon as we think we've got a 'temp fix' , that same 'fix' does not work a few hours later anymore.