Anyone experiencing session hosts becoming unavailable at random

Brass Contributor

Since the end of last week we have had three occasions where one of the session hosts randomly becomes unavailable. This happened in two separate AVD environments.

  • Users get kicked out of their session and cannot reconnect.
    • The user sessions are still marked as Active/Disconnected according to the Azure portal.
  • We cannot RDP to the session host through the internal network.

After we shutdown and reboot the session host, everything will work fine again.

 

We noticed the following notable things:

  1. There are no event logs generated at all, starting 30-60 min prior to the 'crash'.
  2. Since the 28th of October Event Viewer is getting spammed by the following warning:
    1. Microsoft.RDInfra.RDAgent.Service.AgentUpdateStateImpl
      1. Unexpected last recorded state
  3. The "Remote Desktop Services Infrastructure Agent" has been updated on the 25th of October, to version 1.0.5555.1008
  4. The "Remote Desktop Services SxS Network Stack" has been updated on the 31st of October, to version 1.0.2208.17300
    1. This is also the first day that we experienced the problem.

 

I have yet to find anything on this problem. Is anyone else experiencing this with their AVD environments?

92 Replies
Hi Dmvinay85, We had similar issue yesterday, but I'm not sure same problem on this post as we had that previously and seems different.

Please see this post: https://techcommunity.microsoft.com/t5/azure-virtual-desktop/some-session-hosts-became-unavailable-t...

I also logged a ticket. Today so far so good, but really need to get a handle on the issue.
I think it's caused by certain redirected client hardware. I turned off USB Redirection and it's looks much better so far....

@AndreasJ5325 we also have this issue running Citrix in multi user Win 10 in azure.  Out of 50+ session hosts running per day we get 2 or 3 isssues per week.  Session host becomes unavailable in Citrix, rdp won’t work.  We can still access the machine via unc path to c$ and get to services remotely.  The machine isn’t completely dead but all sessions freeze on it.  Event logs usually show a terminal services failure svchost.  I have been working with MS on a ticket for a few weeks with logs/memory dumps.  the host needs to be rebooted to bring it back to a working state.  It could be down to certain we redirect as we also usually see audiobuilder service issues around the same time but that could just be because termservice is dead.  The issue is completely random though intermittent we went a whole month between dec and jan with no issues and then started up again mid January.  We’ve had these issues since November.

@steveturnbull1975 

 

Any solution ? Agent is 1.0.6028.2200 and same problem, halo ?

We are not seeing this issue, so first step is opening a ticket with Microsoft. No issue on 40-50 hosts with 1.0.6028.2200. It could very well be the symptoms are the same, but this is a different issue.

@PioWi still waiting for MS to come back after analysing full memory dump

@steveturnbull1975 

Please configrm that you have the same : when pool will delleocate VM from pool it's not possible to power on it again, but whe you manually power on it it became in pool, the agent is newest 1.60 something, i have tried with Startr VM on connect but also no result 

Did you ever solve this?

@EricT8 Still working through this with MS, funny enough, their latest update appears to be pointing to a HW failure on AMD 64-bit CPU EPYC 7763 which is run on their servers cause by the GPU driver that's integrated with the CPU.

 

They are advising to update the graphics drivers on the VM, although that's not possible since our VMs are not GPU enabled and it's just using the Citrix display adapter.

 

So something running on the VM could be causing the HW failure with the type of AMD CPU used in the hosts.  

 

They've asked for a couple more things to check, it's totally random failures though, we haven't had any for a week or 2.  I wouldn't be surprised if some of the hosts in Azure are missing updated HW or GPU drivers and sometimes the VM's just happen to run on an out of date one causing the failure....

Thanks for the info. I tried switching to Intel after seeing your message but it didn't help at all, so perhaps don't waste too much time on that avenue. I'll let you know if we find anything.
Must be Microsoft going down another rabbit hole. These issues with vms crashing on avd we still have vms onprem on 2k19 and never see any of these failure types.
How do you access your profiles? Azure Blob/Cloud Cache or a share with SMB?
Is there already some more information about this case? I have the same issue with my AVD-environment.
This issue has long been resolved so whatever issue you have I'd suggest creating a new topic. As it's probably a totally different issue.

@KristofH Great! But i cannot find the solution in this topic...

I don't see the solution either, in the replies. Closest is to disable USB redirection. We still see the problem. RDagent randomly shutting down unexpectedly. Log says this has happened 139 times!
This ticket was about an issue that was caused by an internal Microsoft Azure thing. Your issues have nothing to do with this. Best thing to do is open a ticket with Microsoft as they can help troubleshoot.
Thank you! I already submitted a ticket and I will create a new post.
I have been experiencing this issue for a couple of months now. I contacted Microsoft support and they noticed we had scheduled updates turned on and validation environment also turned on. Advised that we turn off those features and I am currently monitoring to see if it goes down again. It has been really stressful having our AVD environment randomly going down and after a restart it get fixed.