Blog Post

FastTrack for Azure

7 MIN READ

Troubleshooting Azure Stack HCI 23H2 Preview Deployments

Microsoft

Jan 22, 2024

With Azure Stack HCI release 23H2 preview, there are significant changes to how clusters are deployed, enabling low touch deployments in edge sites. Running these deployments in customer sites or lab environments may require some troubleshooting as kinks in the process are ironed out. This post aims to give guidance on this troubleshooting.

The following is written using a rapidly changing preview release, based on field and lab experience. We’re focused on how to start troubleshooting, rather than digging into specific issues you may encounter.

Understanding the deployment process

Deployment is completed in two steps: first, the target environment and configuration are validated, then the validated configuration is applied to the cluster nodes by a deployment. While ideally any issues with the configuration will be caught in validation, this is not always the case. Consequently, you may find yourself working through issues in validation only to also have more issues during deployment to troubleshoot. We’ll start with tips on working through validation issues then move to deployment issues. When the validation step completes, a ‘deploymentSettings’ sub-resource is created on your HCI cluster Azure resource.

Logs Everywhere!

When you run into errors in validation or deployment the error passed through to the Portal may not have enough information or context to understand exactly what is going on. To get to the details, we frequently need to dig into the log files on the HCI nodes. The validation and deployment processes pull in components used in Azure Stack Hub, resulting in log files in various locations, but most logs are on the seed node (the first node sorted by name).

Viewing Logs on Nodes

When connected to your HCI nodes with Remote Desktop, Notepad is available for opening log files and checking contents. Another useful trick is to use the PowerShell Get-Content command with the -wait parameter to follow a log and -last parameter to show only recent lines. This is especially helpful to watch the CloudDeployment log progress. For example:

Get-Content C:\CloudDeployment\Logs\CloudDeployment.2024-01-20.14-29-13.0.log -wait -last 150

Log File Locations

The table below describes important log locations and when to look in each:

Path	Content	When to use…
C:\CloudDeployment\Logs\summary*.xml	Summary of deployment status	A good starting place for rollup deployment status and events
C:\CloudDeployment\Logs\CloudDeployment*	Output of deployment operation	This is the primary log to monitor and troubleshoot deployment activity. Look here when a deployment fails or stalls
C:\CloudDeployment\Logs\EnvironmentValidatorFull*	Output of validation run	When your configuration fails a validation step
C:\MASLogs\LCMECELiteLogs\InitializeDeploymentService*	Logs related to the Life Cycle Manager (LCM) initial configuration	When you can’t start validation, the LCM service may not have been fully configured
C:\ECEStore\MASLogs	PowerShell script transcript for ECE activity	Shows more detail on scripts executed by ECE—this is a good place to look if CloudDeployment shows an error but not enough detail
C:\CloudDeployment\Logs\cluster\* C:\Windows\Temp\ StorageClusterValidationReport*	Cluster validation report	Cluster validation runs when the cluster is created; when validation fails, these logs tell you why

Retrying Validations and Deployments

Retrying Validation

In the Portal, you can usually retry validation with the “Try Again…” button. If you are using an ARM template, you can redeploy the template. During the Validation stage, your node is running a series of scripts and checks to ensure it is ready for deployment. Most of these scripts are part of the modules found here: C:\Program Files\WindowsPowerShell\Modules\AzStackHci.EnvironmentChecker. Sometimes it can be insightful to run the modules individually, with verbose or debug output enabled.

Retrying Deployment

The ‘deploymentSettings’ resource under your cluster contains the configuration to deploy and is used to track the status of your deployment. Sometimes it can be helpful to view this resource; an easy way to do this is to navigate to your Azure Stack HCI cluster in the Portal and append ‘deploymentsettings/default’ after your cluster name in the browser address bar.

Image 1 - the deploymentSettings Resource in the Portal

From the Portal

In the Portal, if your Deployment stage fails part-way through, you can usually restart the deployment by clicking the ‘Return Deployment’ button under Deployments at the cluster resource.

Image 2 - access the deployment in the Portal so you can retry

Alternatively, you can navigate to the cluster resource group deployments. Find the deployment matching the name of your cluster and initiate a redeploy using the Redeploy option.

Image 3 - the 'Redploy' button on the deployment view in the Portal

If Azure/the Portal show your deployment as still in progress, you won’t be able to start it again until you cancel it or it fails.

From an ARM Template

To retry a deployment when you used the ARM template approach, just resubmit the deployment. With the ARM template deployment, you submit the same template twice—once with deploymentMode: “Validate” and again with deploymentMode: “Deploy”. If you’re wanting to retry validation, use “Validate” and to retry deployment, use “Deploy”.

Image 4 - ARM template showing deploymentMode setting

Locally on the Seed Node

[Starting deployment manually from the seed node when the deployment is not in an in progress state in Azure is not supported. Instead, open a support case.]

~~In most cases, you’ll want to initiate deployment, validation, and retries from Azure. This ensures that your deploymentSettings resource is at the same stage as the local deployment.~~

However, in some instances, the deployment status as Azure understands it becomes out of sync with what is going on at the node level, leaving you unable to retry a stuck deployment. For example, Azure has your deploymentSettings status as “Provisioning” but the logs in CloudDeployment show the activity has stopped and/or the ‘LCMAzureStackDeploy’ scheduled task on the seed node is stopped. In this case, you may be able to rerun the deployment by restarting the ‘LCMAzureStackDeploy’ scheduled task on the seed node:

~~Start-ScheduledTask -TaskName LCMAzureStackDeploy~~

~~If this does not work, you may need to delete the deploymentSettings resource and start again. See: The big hammer: full reset.~~

Advanced Troubleshooting

Invoking Deployment from PowerShell

[Invoking deployment locally on the seed node when the deployment is not in an in progress state is unsupported because it can cause Azure and the local deployment to become out of sync. Depending on where you are at in the deployment, it can take quite a while for the Portal status to update in the Portal. Instead, open a support case.]

Although deployment activity has lots of logging, sometimes either you can’t find the right log file or seem to be missing what is causing the failure. In this case, it is sometimes helpful to retry the deployment directly in PowerShell, executing the script which is normally called by the Scheduled Task mentioned above.

Local Group Membership

In a few cases, we’ve found that the local Administrators group membership on the cluster nodes does not get populated with the necessary domain and virtual service account users. The issues this has caused have been difficult to track down through logs, and likely has a root cause which will soon be addressed.

Check group membership with: Get-LocalGroupMember Administrators

Add group membership with: Add-LocalGroupMember Administrators -Member <domain\username|local username|SID>[,…]

Here’s what we expect on a fully deployed cluster:

Type	Accounts	Comments
Domain Users	DOMAIN\<LCMUser>	This is the domain account created during AD Prep and specified during deployment
Local Users	AsBuiltInAdmin (renamed from Administrator) ECEAgentService HCIOrchestrator	These accounts don’t exist initially but are created at various stages during deployment. Try adding them—if they are not provisioned, you’ll get a message that they don’t exist.
Virtual Service Accounts	S-1-5-80-1219988713-3914384637-3737594822-3995804564-465921127 S-1-5-80-949177806-3234840615-1909846931-1246049756-1561060998 S-1-5-80-2317009167-4205082801-2802610810-1010696306-420449937 S-1-5-80-3388941609-3075472797-4147901968-645516609-2569184705 S-1-5-80-463755303-3006593990-2503049856-378038131-1830149429 S-1-5-80-649204155-2641226149-2469442942-1383527670-4182027938 S-1-5-80-1010727596-2478584333-3586378539-2366980476-4222230103 S-1-5-80-3588018000-3537420344-1342950521-2910154123-3958137386	These are the SIDs of the various virtual service accounts used to run services related to deployment and continued lifecycle management. The SIDs seem to be hard coded, so these can be added any time. When these accounts are missing, there are issues as early as the JEA deployment step.

ECEStore

The files in the ECEStore directory show state and status information of the ECE service, which handles some lifecycle and configuration management. The JSON files in this directory may be helpful to troubleshoot stuck states, but most events also seem to be reported in standard logs. The MASLogs directory in the ECEStore directory shows PowerShell transcripts, which can be helpful as well.

NUGET Packages

During initialization, several NuGet packages are downloaded and extracted on the seed node. We’ve seen issues where these packages are incomplete or corrupted—usually noted in the MASLogs directory. In this case, the The big hammer: full reset option seems to be required.

The Big Hammer: Full Reset for Failed Validation

[UPDATE 5/7/2024: Due to recent changes in the deployment engine, deleting directories as described below may lead to an unrecoverable scenario, forcing you to rebuild. If you are stuck in validation, we recommend opening a support cause to see if you can avoid a rebuild.]

If you’ve pulled the last of your hair out, the following steps usually perform a full reset of the environment, while avoiding needing to reinstall the OS and reconfigure networking, etc (the biggest hammer). This is not usually necessary and you don’t want to go through this only to run into the same problem, so spend some time with the other troubleshooting options first.

~~Uninstall the Arc agents on all nodes with the Remove-AzStackHciArcInitialization command~~
~~Delete the deploymentSettings resource in Azure~~
~~Delete the cluster resource in Azure~~
~~Reboot the seed node~~
~~Delete the following directories on the seed node:~~

~~C:\CloudContent~~
~~C:\CloudDeployment~~
~~C:\Deployment~~
~~C:\DeploymentPackage~~
~~C:\EceStore~~
~~C:\NugetStore~~

~~Remove the LCMAzureStackStampInformation\InitializationComplete registry property on the seed node:~~
~~Set-ItemProperty -path HKLM:\SOFTWARE\Microsoft\LCMAzureStackStampInformation -Name InitializationComplete -Value '' -WhatIf~~
~~Reinitialize Arc on each node with Invoke-AzStackHciArcInitialization and retry the complete deployment~~

Conclusion

Hopefully this guide has helped you troubleshoot issues with your deployment. Please feel free to comment with additional suggestions or questions and we’ll try to get those incorporated in this post.

If you’re still having issues, a Support Case is your next step!

Updated May 07, 2024

Version 10.0

Infra

mtbmsft

Microsoft

Joined March 29, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Toastgun

Copper Contributor

Jun 10, 2024

Hello

maybe someone know where my deployment goes wrong... My deployment always ends up with

2024-06-10 09:32:35 Warning  [DeploymentService:InvokeEnvironmentChecker] Task: Invocation of interface 'ValidateNetwork' of role 'Cloud\Infrastructure\EnvironmentValidator' failed: 

Type 'ValidateNetwork' of Role 'EnvironmentValidator' raised an exception:

No MSFT_NetIPAddress objects found with property 'InterfaceAlias' equal to 'Ethernet'.  Verify the value of the property and retry.

I am trying to deploy a single node with a single NIC.

jcookintegy
Copper Contributor
Apr 06, 2024
Hi, posting here out of desperation. I am consistently seeing the issue mentioned in this article where the local administrator group is not populated with the correct accounts and NT service accounts. The first failure this causes is at the "Install the update orchestrator" stage where we get this error

WatsonBuckets": null }, "Message": "AzureStack File Copy Agent is not running.", "Type": "HealthCheckException", "CallStack": "HealthCheckException: AzureStack File Copy Agent is not running.\r\n" } }, { "ErrorType": "ResourceFailureLocalGroupMembership", "ErrorInfo": { "Exception": { "ClassName": "Microsoft.AzureStack.Infrastructure.Orchestration.AgentLifecycleAgent.ResourceManagement.Exceptions.LocalGroupMembershipException", "Message": "Failed to set up membership for NT SERVICE\\AzureStack File Copy Agent", "Data": null, "InnerException": { "ClassName":

You can watch the event logs whilst the deployment running for it constantly trying to create this service, which I assume is failing due to it not being able to add the service account to the local admin group.

Permissions have been verified for the deployment user (its in the administrators group on both nodes). I have a long running case open with MS but we are going around in circles at the moment.

I notice this exact issue is mentioned here and thought i'd reach out to see if anyone else is having / has had this and knows of a cause of fix. I am nearly in double digits for the times i've flattened and rebuilt / re-deployed this cluster!

All help greatly appreciated whilst I still have some hair

Karl-WE

MVP

Feb 23, 2024

These steps helped here:

Workaround:
- removing LCM extension via PS or Portal
- creating the mentioned HKLM Key (as removing will delete it) via PS
- reboot the nodes via PS
- register LCM extension via PS

Root Cause:
The function Test-Nodeinitialization runs before or during LCM extension creation. It will test for a registry key which is not present before the LCM deployment is not already succeded.
Only after the successful creation of LCM extension this registry key is created and contains the expected String Value.

Solution:
The exception in the function should not terminate the deployment of the LCM extension but rather check for the existance of this registry key (Test-Path), and if it does not exist should create the registry key.

xtension Message: Enable-Extension.ps1 : Installing LCM extention failed: 
Extension Error: Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
does not exist.
At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
loymentScripts\LCMControllerWinService.psm1:128 char:30

the error refers to this function below and in fact the complete hive is empty. And on a fresh 23H2 box it does not exist.
NOTHING in the whole LCMControllerWinService.psm1 module does actually create this hive in the registry.
Especially NOT the function Install-LCMController.

Please try https://github.com/DellGEOS/AzureStackHOLs/tree/main/lab-guides/01a-DeployAzureStackHCICluster-CloudBasedDeployment in WestEU.

LCMControllerWinService.psm1 reference

function Test-NodeInitialization
{
    $existingService = Get-WmiObject Win32_Service | where {$_.Name -eq 'LcmController'}
    $eceLiteDir = Split-Path $existingService.PathName
    $eceLite = Join-Path $eceLiteDir "EnterpriseCloudEngine.psd1"
    $InitializeAction = "InitializeDeploymentService"
    $initializationStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation -Name InitializationComplete).InitializationComplete
    if (![string]::IsNullOrEmpty($initializationStatus) -and ($initializationStatus -eq "Complete"))
    {
        Write-Verbose "Node Initialization successfull."
    }
    elseif (Test-path -path $eceLite)
    {
        Import-Module $eceLite
        $actionProgress = Get-ActionProgress -ActionType $InitializeAction
        if (!$actionProgress)
        {
            throw "Node Initialization not started."
        }
        elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "Success")
        {
            Write-Verbose "Node Initialization successfull."
        }
        elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "Error")
        {
            Write-Verbose "Node Initialization failed with error: $($actionProgress.Attribute("Status").Parent.Value)"
            throw "Node Initialization failed with error: $($actionProgress.Attribute("Status").Parent.Value)"
        }
        elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "InProgress")
        {
            Write-Verbose "Initialization action plan status:$($actionProgress.Attribute("Status").Value)"
            throw "Node Initialization in progress."
        }
        else
        {
            Write-Verbose "Node Initialization failed."
        }
    }
    else
    {
        throw "Node Initialization failed."
    }
}

Karl-WE

MVP

Feb 23, 2024

mtbmsft Islam Gomaa on our Azure Stack HCI Slack there are several reports spanning for months that most failing resource in pre-deployment is LCM

This one is also handled here:
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/common-deployment-challenges-and-workarounds-for-hci-23h2/ba-p/4044172

Can you confirm that the telemetry collected with every 23H2 deployment and (failed) attempt is collected even at this command?

   Invoke-AzStackHciArcInitialization

----

Having the following outputs:

Azure Portal Azure Stack HCI Wizard page 1
Validate Selected Server:

Resource validation failed. Details: [{"Code":"ValidationFailed","Message":"Arc extensions installed on Arc Machine /subscriptions/a69d12f1-e62b-49b1-a483-6bc11b28a923/resourceGroups/AzSHCI-Clu-112-rg/providers/Microsoft.HybridCompute/machines/ASNode1 are 
- DeviceManagementExtension
- EdgeRemoteSupport
- TelemetryAndDiagnostics 

while required list of mandatory arc extensions are 
- DeviceManagementExtension, 
- LcmController,
- TelemetryAndDiagnostics, 
,"Target":null,"Details":null},

So LCMController is the culprit once again, also visible on both fresh nodes
CU: 23H2 MBR 709
LCM Version: 30.2402.0.15
Scriptversion: 10.2402.0.15

The error log in the extension is extremely helpful and verbose,but for me looks like an internal code issue.

Microsoft.AzureStack.Orchestration.LcmController
Status
Failed
Automatic upgrade
Not supported
Version
30.2402.0.15
Status level
Error
Status message
Extension Message: Enable-Extension.ps1 : Installing LCM extention failed: 
Extension Error: Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
does not exist.
At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
loymentScripts\LCMControllerWinService.psm1:128 char:30
+ ... onStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTW ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (HKEY_LOCAL_MACH...tampInformation:String) [Get-ItemProperty], ItemNotFo 
   undException
    + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemPropertyCommand
 
Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
does not exist.
At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
loymentScripts\LCMControllerWinService.psm1:128 char:30
+ ... onStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTW ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (HKEY_LOCAL_MACH...tampInformation:String) [Get-ItemProperty], ItemNotFo 
   undException
    + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemPropertyCommand
 
Transcript started, output file is C:\MASLogs\Install-LCMController_20240223-023842Z.log
VERBOSE: Loading module from path 
'C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Deplo
ymentScripts\LogmanHelpers.psm1'.
VERBOSE: Importing function 'Register-LogManScheduledTask'.
VERBOSE: Start-LCMControllerServiceLogman : Registering LogMan ScheduledTask.
VERBOSE: Register-LogManScheduledTask : starts logman script started.
VERBOSE: Register-LogManScheduledTask : Creating scheduled task: Start logman for LcmController service
VERBOSE: Register-LogManScheduledTask : Scheduled task Start logman for LcmController service created, register it to 
task scheduler

TaskPath                                       TaskName                          State     
--------                                       --------                          -----     
\                                              Start logman for LcmController... Ready     
VERBOSE: Register-LogManScheduledTask : Scheduled task Start logman for LcmController service is now registered
VERBOSE: Start-LCMControllerServiceLogman : Starting LogMan ScheduledTask.
VERBOSE:  : Installing LCMController Service.

Status      : Stopped
Name        : LcmController
DisplayName : LcmController

[SC] ChangeServiceConfig2 SUCCESS
VERBOSE:  : LcmController service created.
VERBOSE: LcmController starting.

__GENUS          : 2
__CLASS          : __PARAMETERS
__SUPERCLASS     : 
__DYNASTY        : __PARAMETERS
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
ReturnValue      : 0
PSComputerName   : 

VERBOSE:  : Waiting 300 seconds for LcmController to start, attempt 1 of 5 ...
VERBOSE:  : LcmController service Running.
VERBOSE: Loading module from path 
'C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Enter
priseCloudEngine.psd1'.
VERBOSE: Exporting function 'Get-EceInterfaceParameters'.
VERBOSE: Exporting function 'Test-EceInterface'.
VERBOSE: Exporting function 'Get-DeploymentActionPlanLog'.
VERBOSE: Exporting cmdlet 'Trace-ECEScript'.
VERBOSE: Exporting cmdlet 'Set-EceSecret'.
VERBOSE: Exporting cmdlet 'Get-EceConfiguration'.
VERBOSE: Exporting cmdlet 'Get-ActionProgress'.
VERBOSE: Exporting cmdlet 'Get-JsonTemplate'.
VERBOSE: Exporting cmdlet 'Invoke-EceAction'.
VERBOSE: Exporting cmdlet 'Join-RoleTemplate'.
VERBOSE: Exporting cmdlet 'Import-EceCustomerConfiguration'.
VERBOSE: Exporting cmdlet 'Set-RoleDefinition'.
VERBOSE: Importing cmdlet 'Get-ActionProgress'.
VERBOSE: Importing cmdlet 'Get-EceConfiguration'.
VERBOSE: Importing cmdlet 'Get-JsonTemplate'.
VERBOSE: Importing cmdlet 'Import-EceCustomerConfiguration'.
VERBOSE: Importing cmdlet 'Invoke-EceAction'.
VERBOSE: Importing cmdlet 'Join-RoleTemplate'.
VERBOSE: Importing cmdlet 'Set-EceSecret'.
VERBOSE: Importing cmdlet 'Set-RoleDefinition'.
VERBOSE: Importing cmdlet 'Trace-ECEScript'.
VERBOSE: Importing function 'Get-DeploymentActionPlanLog'.
VERBOSE: Importing function 'Get-EceInterfaceParameters'.
VERBOSE: Importing function 'Test-EceInterface'.
VERBOSE: Initialization action plan status:InProgress
VERBOSE: Node Initialization in progress.
Transcript stopped, output file is C:\MASLogs\Install-LCMController_20240223-023842Z.log

FMerizalde
Copper Contributor
Feb 21, 2024
This has been very helpful in figuring out the install process and navigating the logs thank you.
Karl-WE
MVP
Feb 08, 2024
mtbmsft Islam Gomaa is there a guidance or script that helps to cleanup objects in the Azure Portal aswell for removed or failed deployments?

Today found couple of applications.

Entra ID > Enterprise Applications:
Remove filter (All Applications)
clustername.arb
Islam Gomaa
Microsoft
Jan 22, 2024
Great Work mtbmsft