SOLVED

Azure AD Endpoint Manager User Profile Corruption: Black Screen Flashing Taskbar Explorer Crash Loop

Brass Contributor

We are in the midst of a Azure/Endpoint Manager (Intune) Migration. 300+ Endpoints and are running into deployment nightmare:

 

We are experiencing a very odd, completely random issue when a previously Synced Hybrid Azure AD User logs into their endpoint (which was previously working without issue for weeks/months) and then suddenly fails to load. 

This issue only seems to occur when NEW endpoints are added to the Azure AD tenant/domain.

 

We know the issue is about to happen when you receive a call from an end-user stating their previously working credentials are "no longer working".

 

When the the user attempts to login via "other user"; The login will proceed, and the user will login to a black desktop/screen and flashing taskbar.

 

Windows Task Manager is not responsive; Safe-mode options will not produce a better end result.

 

Upon reviewing the logs you will see "explorer.exe" crash loop prompting urtcbase.dll.

 

  • Azure AD homed user accounts and local user accounts are able to login without issue into the endpoint.
  • The issue is only specific to Hybrid Azure AD User Profiles  (on-premise cached/home' d accounts). I'm thinking it has to do with a conflict of the on-premise SAM Account name. I'm not sure why adding new endpoints to the tenant causes the issue.

This particular issue is happening across all different makes, models, and Window Image variations. The issue is specific to only Azure AD Profiles that attempt to login to the endpoint.

 

Precursors:

Incorrect password prompt. Requires uses to select "other user"

After selecting other user, user profile experiences delayed "Welcome"

Black screen appears with flashing taskbar, rending the profile useless

 

If we attempt a Wipe/Restore the issue will randomly reoccur on another workstation.

 

I believe the issue is specific in the way Windows try to load/create the profile for Azure AD users. I'm not sure if AutoPilot is attempting to configure these endpoints in Hybrid mode. However we've noticed discrepancies in the naming convention of some profiles and domains. For example:

 

  • AzureAD\FirstLastName
  • shortdomain\FLast

I believe the User Profile Service is somehow bugged and causing a mismatch between the registry's SID for the user profile.

 

Has anyone else experienced this issue? We are desperate for answers; this is worse than any virus as its random intermittent nature will return after a fresh system restore.

 

I've received a call from another organization stating they are seeing the same issue occur throughout their deployment. I believe this is now a wide-spread issue.

 

We have a ticket opened with the Microsoft on this. Windows Performance Team is reaching out to Azure Team. 

49 Replies
did you have to reinstall/reset the hybrid client PC - did this fix it? or disable AD Sync? any other work-arounds/fixes? thanks.
Hope this helps some still fighting this issue. We had already reimaged the <20 impacted PC's to restore them back to service, but continued to work with MS on troubleshooting one of the devices. Per Microsoft, they confirmed the resolution was related to the ‘Default Impersonation Level’ and we were able to confirm that by changing the setting to the recommended setting it resolved the issue after the device was restarted. Unfortunately, Microsoft was unable to determine what/how the ‘Default Impersonation Level’ was changed in the first place. [Per Microsoft, only way to change settings is to perform below steps. 1. Run dcomcnfg.exe. 2. Expand “Component Services” 3. Expand Computers 4. Right click on “My Computer” and select “Properties” 5. Change to the “Default Properties” tab. 6. Under “Default Impersonation Level” select “Identify” 7. Reboot the machine. Per Microsoft, this cannot be done outside of changing the setting using dcomcnfg.exe.]

thanks for the info. I just reinstalled the 5 laptops in question. Microsoft (via O365/Azure support portal) provided no help. I gave them the link to this discussion, and they just closed the case after I said I reinstalled them! It appears to be related to the new Azure Cloud Sync (not AD connect). My specific tenant in question used to have AD Connect installed (but it was removed a couple of years ago), so I am unsure if this played a part in the corruption? Anyway, thanks for the workaround ...just concerned that MS won't fix the issue, so will be lumbered with more issue on some of the larger clients that I want to enable Cloud Sync ...I imagine I'll just use AD Connect on the larger clients.

Hello,
thanks a lot for the post. I am glad that we are not alone and that the problem is not on our side. Well, of course, I'm not happy that Microsoft, seeing this problem, cannot even triage it.

Briefly about the problem: a problem with the profile (or it looks like).
Temporary (crooked) solution: delete the profile before adding it to the tenant.

Briefly about my task: we are migrating from one AAD tenant to another. This is my first time participating in a migration project and it may not surprise anyone here, but it seemed to me that ours is not quite ordinary for the following reasons:
- we must disable ADsync (that's what I call the Azure AD Connect utility and will continue to call it that in the text) from the old tenant
- change upn suffixes for all users in the old tenant so that you can unbind the main domain name, let's call it contuso.com
- bind this domain name to a new tenant
- enable ADsync synchronization with the new tenant

As far as I understand, tenant migrations are usually stretched out in time and usually this is a merging/ acquisition with a change in upn suffixes.
In our case, the ground infrastructure remains and, accordingly, "On-premises SAM account name", "On-premises domain name"

So, in order:
- all users are created in the ground domain contuso.com,
- synced to AAD (including the aforementioned SAMaccounnt name etc attributes)
- then two options:
1) You add devices to the ground domain (then it syncs and becomes hybrid joined). In this scenario, if you remove the device from the domain and add it to AAD, there will be no problems (according to my tests). I think this is due to the fact that not UPN is involved in adding to the domain, but USER logon name (pre-windows 2000)
I was able to remove these devices from the ground domain without any problems and add them to the new tenant. As far as I can tell, there are no issues in this migration scenario.
2) You add the device to AAD and use the UPN when you sign in. If you go to the folder with user profiles, you will see that a profile will be created in the SAMAccountName format.
We will be interested in this particular option of adding, since we have more than 600 people connected by this type and this is just the target group that needs to be migrated to the new tenant.
By the way, this problem does not apply to Endpoint Manager (Intune), but is related to AAD in general. Even if you don't have Intune, you will run into this issue.

To test the device migration we did the following:
- deployed new ADsync
- set up intune, aad, licenses, etc. in the new tenant
- defined an OU scope for it, where we threw several accounts for synchronization
- for the convenience of testing set up a password hash sync
- for convenience, added a simple name contuso.io as an upn for users (so as not to print long and uncomfortable contusonew.onmicrosoft.com)
That is, we got a working migration, except that it will be "io" instead of "com"

When adding new "clean" devices to the new tenant, there are no problems, a profile is created with the same SAMaccountName as in AD. Everything works perfectly.
But, in the case where the device was previously added to the old tenant (there is already a profile), I get the same Explorer problem described in this thread.

And it’s quite funny, but in this blinking state you can call the Task Manager, through which you can start ProcMon, which I actually did and record what happens with Explorer.
Let's imagine that our user is John Snow. Pre-windows 2000 format: Contuso\JSnow.
In the old tenant, your profile will be JSnow. When added to a new tenant, a new JSnow.Contuso folder will be added, then if you have already done this more than once it will be .001...
The reason for the crash is that the system is trying to write not to the newly created profile folder, but to the already existing JSnow, for which there are not enough rights, ACL etc…..

Deleting the profile (through the profiles menu) helps to solve the problem, but in our case this is a potential loss of data / settings, because our migration plan included migrating the profile using a specialized utility.
We also decided to make an unusual workaround. It works, but it doesn't suit us as a company heavily tied to legacy applications.
Workaround: in ADsync, add the letter N to SAMaccountName to the rule. We get that in the new tenant there will be JSnowN in the profile. Thus, there will be no conflict, and the user will be able to log in without blinking Explorera.
Cons of this method: Kerberos, NTLM stops working in the “ground” domain, because it no longer links you to an existing user.

We created a support ticket, but to be honest, looking at this post and understanding how Microsoft works, I do not expect quick solutions.
I'm sure someone will benefit from our experiments and hope that Microsoft can fix this problem. Perhaps, after reading this text, you can offer something as a working workaround.
Thanks to all!

@creatoni41245 

 

It's been a few months since we had the recurrence of the problem. We ended up decommissioning our entire on-premises infrastructure and getting rid of Azure AD Connect and Azure AD Cloud Provisioning/Sync..

 

While Microsoft has not confirmed this, I'm almost positive the UPN mismatch issue is due to having both products installed at the same time or when transitioning from one to the other. Something about having Azure AD Connect and Azure AD Cloud Sync installed messes with Azure and prevents the endpoint from being able to choose which UPN to match the UIDSID when Win Logon is running.

We also ran into this issue and believe that it's a User Profile Service issue too. I don't experience your password/"other user" issue, but since this is the top result when searching for the problem I thought it best to add our solution.

We have this problem after joining existing PCs to Azure joined. You need to log in as an admin, open up the User Profile list (shortcut: rundll32.exe sysdm.cpl,EditUserProfiles) and delete BOTH profiles for your problem users - the old local profile and the Azure one. Make sure you have a backup of any data you need first.

Then log off admin, and the Azure version of the user profile will be able to log on without problems.
We have been experiencing the same issue and in our situation managed to find a fix. On machines with the issue we had two subkey under the following registry key, on working machine only one.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\IdentityStore\LogonCache\B16898C6-A148-4967-9171-64D755DA8520\SubPkgs

The keys should represent the NetBIOS name of the on-prem domain, in our situation we had ABC and ABC-GLOBAL, ABC represented the actual NetBIOS name while ABC-GLOBAL was the first part of the FQDN (looking at the domain properties in powershell it was it's "Name"). Both of these keys had a value for "AuthenticatingAuthorityDns" which matched our domain FQDN so the assumption here is some kind of conflict. The solution was to simply delete the key which wasn't out domain NetBIOS name (required taking ownership etc). After a reboot the users could login again.

The value for netBIOSName in AzureAD is populated using %Domain.Netbios% for AD Connect and
%DomainNetBios% in Cloud Sync. I've not been able to find anymore information about how these values are obtained. I can only assume that Cloud Sync isn't using exactly the same value as AD Connect. Maybe this only impacts where the "Name" and "NetbiosName" (from a get-addomain) are different.

Anyway that was my experience, I hope this helps people in the future or gives people pointers to allow cloud sync migrations without this issue,
best response confirmed by Edmundo Pena (Brass Contributor)
Solution

Daniel,

 

We just discovered the same thing and rolled out a fix for it in our environment. For users with an email address in on-prem AD, Azure AD Connect Sync was creating the accounts in Azure online with the pre-Windows 2000 NetBIOS domain name which matches the pre-Windows 2000 NetBIOS user logon name. However, for those without an email, it was creating the account in Azure with the subdomain of the domain FQDN instead of the pre-Windows 2000 name as specified on the account or in Domains and Trusts. Azure AD Cloud Sync was trying to update all accounts to the subdomain and completely ignoring the pre-Windows 2000 names entirely.

 

As far as experiencing the taskbar issue, once it occurred for one account on the machine, it would then impact all accounts on the machine both pre-existing and new sign-ins. However, accounts that did not have an AD mail attribute would not experience the issue. We found the same SubPkgs key and those that were in the NetBIOS subkeys would have the taskbar, permission, and general SID mismatch errors but those that were in the subdomain subkey would not.

 

We shut down our Azure AD Connect and are now relying entirely on Cloud Sync. Then, to fix the machines without a reimage, we performed a full Cloud Sync and then ran the following PowerShell script on Azure AD joined machines to clean up the broken accounts. This allowed users to sign in fresh with the subdomain instead of NetBIOS prefix and the issue has not reoccurred, but it's only been about a week.

 

 

$CurrentUserSID = (C:\Windows\System32\whoami.exe /User /Fo CSV | ConvertFrom-Csv).SID
$CachedAccounts = Get-CimInstance -Classname win32_userprofile | where-object { (!$_.Special) -And ($_.SID -like 'S-1-12-1-*') -And ($_.SID -NotLike $CurrentUserSID) }
foreach ($Account in $CachedAccounts) {
    $SIDtoUser = $null
    $SID = New-Object System.Security.Principal.SecurityIdentifier($Account.SID)
    try { 
        $SIDtoUser = $SID.Translate([System.Security.Principal.NTAccount])
        Write-Host "Removing $SIDtoUser from list of cached accounts."
        if ($SIDtoUser -ne $null) {
            $CachedAccounts = @($CachedAccounts | Where-Object SID -ne $SID)
        }
    } catch {
        Write-Host "Unable to translate SID ($SID) to user."
    }
}
if ($CachedAccounts.Count -gt 0) {
    Write-Host 'Accounts to be removed:'
    $CachedAccounts | Select LocalPath,SID | ft
    $Confirmation = Read-Host "Do you want to remove those accounts? (Yes or No)"
    if ($Confirmation.ToLower() -like "y*") {
        Write-Host "Removing accounts..."
        $CachedAccounts | Remove-CimInstance -Verbose
    } else {
        Write-Host 'Accounts not removed.'
    }
} else {
    Write-Host 'No accounts to remove.'
}

 

 

Hopefully, this script and info helps someone else.

 

Rexford

Thank you to the Microsoft Community for finally shedding light on this issue. Great work team!

@Daniel_Gatley I'm trying to take ownership of the key and I still receive the error you can't delete this key. It's frustrating that MS can't figure this out.