Blog Post

FastTrack for Azure
7 MIN READ

Troubleshooting Azure Stack HCI 23H2 Preview Deployments

mtbmsft's avatar
mtbmsft
Icon for Microsoft rankMicrosoft
Jan 22, 2024

With Azure Stack HCI release 23H2 preview, there are significant changes to how clusters are deployed, enabling low touch deployments in edge sites. Running these deployments in customer sites or lab environments may require some troubleshooting as kinks in the process are ironed out. This post aims to give guidance on this troubleshooting.

 

The following is written using a rapidly changing preview release, based on field and lab experience. We’re focused on how to start troubleshooting, rather than digging into specific issues you may encounter.

 

Understanding the deployment process

 

Deployment is completed in two steps: first, the target environment and configuration are validated, then the validated configuration is applied to the cluster nodes by a deployment. While ideally any issues with the configuration will be caught in validation, this is not always the case. Consequently, you may find yourself working through issues in validation only to also have more issues during deployment to troubleshoot. We’ll start with tips on working through validation issues then move to deployment issues. When the validation step completes, a ‘deploymentSettings’ sub-resource is created on your HCI cluster Azure resource.

Logs Everywhere!

 

When you run into errors in validation or deployment the error passed through to the Portal may not have enough information or context to understand exactly what is going on. To get to the details, we frequently need to dig into the log files on the HCI nodes. The validation and deployment processes pull in components used in Azure Stack Hub, resulting in log files in various locations, but most logs are on the seed node (the first node sorted by name).

 

Viewing Logs on Nodes

When connected to your HCI nodes with Remote Desktop, Notepad is available for opening log files and checking contents. Another useful trick is to use the PowerShell Get-Content command with the -wait parameter to follow a log and -last parameter to show only recent lines. This is especially helpful to watch the CloudDeployment log progress. For example:

 

Get-Content C:\CloudDeployment\Logs\CloudDeployment.2024-01-20.14-29-13.0.log -wait -last 150

Log File Locations

The table below describes important log locations and when to look in each:

Path

Content

When to use…

C:\CloudDeployment\Logs\summary*.xml

Summary of deployment status

A good starting place for rollup deployment status and events

C:\CloudDeployment\Logs\CloudDeployment*

Output of deployment operation

This is the primary log to monitor and troubleshoot deployment activity. Look here when a deployment fails or stalls

C:\CloudDeployment\Logs\EnvironmentValidatorFull*

Output of validation run

When your configuration fails a validation step

C:\MASLogs\LCMECELiteLogs\InitializeDeploymentService*

Logs related to the Life Cycle Manager (LCM) initial configuration

When you can’t start validation, the LCM service may not have been fully configured

C:\ECEStore\MASLogs

PowerShell script transcript for ECE activity

Shows more detail on scripts executed by ECE—this is a good place to look if CloudDeployment shows an error but not enough detail

C:\CloudDeployment\Logs\cluster\*
C:\Windows\Temp\ StorageClusterValidationReport*

Cluster validation report

Cluster validation runs when the cluster is created; when validation fails, these logs tell you why

 

Retrying Validations and Deployments

 

Retrying Validation

In the Portal, you can usually retry validation with the “Try Again…” button. If you are using an ARM template, you can redeploy the template. During the Validation stage, your node is running a series of scripts and checks to ensure it is ready for deployment. Most of these scripts are part of the modules found here: C:\Program Files\WindowsPowerShell\Modules\AzStackHci.EnvironmentChecker. Sometimes it can be insightful to run the modules individually, with verbose or debug output enabled.

Retrying Deployment

The ‘deploymentSettings’ resource under your cluster contains the configuration to deploy and is used to track the status of your deployment. Sometimes it can be helpful to view this resource; an easy way to do this is to navigate to your Azure Stack HCI cluster in the Portal and append ‘deploymentsettings/default’ after your cluster name in the browser address bar.

 

Image 1 - the deploymentSettings Resource in the Portal

 

From the Portal

In the Portal, if your Deployment stage fails part-way through, you can usually restart the deployment by clicking the ‘Return Deployment’ button under Deployments at the cluster resource.

 

Image 2 - access the deployment in the Portal so you can retry

Alternatively, you can navigate to the cluster resource group deployments. Find the deployment matching the name of your cluster and initiate a redeploy using the Redeploy option.

 

Image 3 - the 'Redploy' button on the deployment view in the Portal

If Azure/the Portal show your deployment as still in progress, you won’t be able to start it again until you cancel it or it fails.

 

From an ARM Template

To retry a deployment when you used the ARM template approach, just resubmit the deployment. With the ARM template deployment, you submit the same template twice—once with deploymentMode: “Validate” and again with deploymentMode: “Deploy”. If you’re wanting to retry validation, use “Validate” and to retry deployment, use “Deploy”.

Image 4 - ARM template showing deploymentMode setting

 

Locally on the Seed Node

[Starting deployment manually from the seed node when the deployment is not in an in progress state in Azure is not supported. Instead, open a support case.]

In most cases, you’ll want to initiate deployment, validation, and retries from Azure. This ensures that your deploymentSettings resource is at the same stage as the local deployment.

 

However, in some instances, the deployment status as Azure understands it becomes out of sync with what is going on at the node level, leaving you unable to retry a stuck deployment. For example, Azure has your deploymentSettings status as “Provisioning” but the logs in CloudDeployment show the activity has stopped and/or the ‘LCMAzureStackDeploy’ scheduled task on the seed node is stopped. In this case, you may be able to rerun the deployment by restarting the ‘LCMAzureStackDeploy’ scheduled task on the seed node:

Start-ScheduledTask -TaskName LCMAzureStackDeploy


If this does not work, you may need to delete the deploymentSettings resource and start again. See: The big hammer: full reset.

Advanced Troubleshooting

Invoking Deployment from PowerShell

[Invoking deployment locally on the seed node when the deployment is not in an in progress state is unsupported because it can cause Azure and the local deployment to become out of sync. Depending on where you are at in the deployment, it can take quite a while for the Portal status to update in the Portal. Instead, open a support case.]

Although deployment activity has lots of logging, sometimes either you can’t find the right log file or seem to be missing what is causing the failure. In this case, it is sometimes helpful to retry the deployment directly in PowerShell, executing the script which is normally called by the Scheduled Task mentioned above.

 

 

Local Group Membership

In a few cases, we’ve found that the local Administrators group membership on the cluster nodes does not get populated with the necessary domain and virtual service account users. The issues this has caused have been difficult to track down through logs, and likely has a root cause which will soon be addressed.

Check group membership with: Get-LocalGroupMember Administrators

Add group membership with: Add-LocalGroupMember Administrators -Member <domain\username|local username|SID>[,…]

Here’s what we expect on a fully deployed cluster:

Type

Accounts

Comments

Domain Users

DOMAIN\<LCMUser>

This is the domain account created during AD Prep and specified during deployment

Local Users

AsBuiltInAdmin (renamed from Administrator)

ECEAgentService
HCIOrchestrator

These accounts don’t exist initially but are created at various stages during deployment. Try adding them—if they are not provisioned, you’ll get a message that they don’t exist.

Virtual Service Accounts

S-1-5-80-1219988713-3914384637-3737594822-3995804564-465921127

S-1-5-80-949177806-3234840615-1909846931-1246049756-1561060998

S-1-5-80-2317009167-4205082801-2802610810-1010696306-420449937

S-1-5-80-3388941609-3075472797-4147901968-645516609-2569184705

S-1-5-80-463755303-3006593990-2503049856-378038131-1830149429

S-1-5-80-649204155-2641226149-2469442942-1383527670-4182027938

S-1-5-80-1010727596-2478584333-3586378539-2366980476-4222230103

S-1-5-80-3588018000-3537420344-1342950521-2910154123-3958137386

These are the SIDs of the various virtual service accounts used to run services related to deployment and continued lifecycle management. The SIDs seem to be hard coded, so these can be added any time. When these accounts are missing, there are issues as early as the JEA deployment step.

 

ECEStore

The files in the ECEStore directory show state and status information of the ECE service, which handles some lifecycle and configuration management. The JSON files in this directory may be helpful to troubleshoot stuck states, but most events also seem to be reported in standard logs. The MASLogs directory in the ECEStore directory shows PowerShell transcripts, which can be helpful as well.

NUGET Packages

During initialization, several NuGet packages are downloaded and extracted on the seed node. We’ve seen issues where these packages are incomplete or corrupted—usually noted in the MASLogs directory. In this case, the The big hammer: full reset option seems to be required.

The Big Hammer: Full Reset for Failed Validation

[UPDATE 5/7/2024: Due to recent changes in the deployment engine, deleting directories as described below may lead to an unrecoverable scenario, forcing you to rebuild. If you are stuck in validation, we recommend opening a support cause to see if you can avoid a rebuild.]

If you’ve pulled the last of your hair out, the following steps usually perform a full reset of the environment, while avoiding needing to reinstall the OS and reconfigure networking, etc (the biggest hammer). This is not usually necessary and you don’t want to go through this only to run into the same problem, so spend some time with the other troubleshooting options first.

  1. Uninstall the Arc agents on all nodes with the Remove-AzStackHciArcInitialization command
  2. Delete the deploymentSettings resource in Azure
  3. Delete the cluster resource in Azure
  4. Reboot the seed node
  5. Delete the following directories on the seed node:
    1. C:\CloudContent
    2. C:\CloudDeployment
    3. C:\Deployment
    4. C:\DeploymentPackage
    5. C:\EceStore
    6. C:\NugetStore
  1. Remove the LCMAzureStackStampInformation\InitializationComplete registry property on the seed node:
    Set-ItemProperty -path HKLM:\SOFTWARE\Microsoft\LCMAzureStackStampInformation -Name InitializationComplete -Value '' -WhatIf
  2. Reinitialize Arc on each node with Invoke-AzStackHciArcInitialization and retry the complete deployment

Conclusion

Hopefully this guide has helped you troubleshoot issues with your deployment. Please feel free to comment with additional suggestions or questions and we’ll try to get those incorporated in this post.

 

If you’re still having issues, a Support Case is your next step!

Updated May 07, 2024
Version 10.0
  • Toastgun's avatar
    Toastgun
    Copper Contributor

    Hello

     

    maybe someone know where my deployment goes wrong... My deployment always ends up with 

     

    2024-06-10 09:32:35 Warning  [DeploymentService:InvokeEnvironmentChecker] Task: Invocation of interface 'ValidateNetwork' of role 'Cloud\Infrastructure\EnvironmentValidator' failed: 
    
    Type 'ValidateNetwork' of Role 'EnvironmentValidator' raised an exception:
    
    No MSFT_NetIPAddress objects found with property 'InterfaceAlias' equal to 'Ethernet'.  Verify the value of the property and retry.

     

     

    I am trying to deploy a single node with a single NIC.

     

  • jcookintegy's avatar
    jcookintegy
    Copper Contributor

    Hi, posting here out of desperation. I am consistently seeing the issue mentioned in this article where the local administrator group is not populated with the correct accounts and NT service accounts. The first failure this causes is at the "Install the update orchestrator" stage where we get this error

     

    WatsonBuckets": null }, "Message": "AzureStack File Copy Agent is not running.", "Type": "HealthCheckException", "CallStack": "HealthCheckException: AzureStack File Copy Agent is not running.\r\n" } }, { "ErrorType": "ResourceFailureLocalGroupMembership", "ErrorInfo": { "Exception": { "ClassName": "Microsoft.AzureStack.Infrastructure.Orchestration.AgentLifecycleAgent.ResourceManagement.Exceptions.LocalGroupMembershipException", "Message": "Failed to set up membership for NT SERVICE\\AzureStack File Copy Agent", "Data": null, "InnerException": { "ClassName":

     

    You can watch the event logs whilst the deployment running for it constantly trying to create this service, which I assume is failing due to it not being able to add the service account to the local admin group.

     

    Permissions have been verified for the deployment user (its in the administrators group on both nodes). I have a long running case open with MS but we are going around in circles at the moment. 

     

    I notice this exact issue is mentioned here and thought i'd reach out to see if anyone else is having / has had this and knows of a cause of fix. I am nearly in double digits for the times i've flattened and rebuilt / re-deployed this cluster!

     

    All help greatly appreciated whilst I still have some hair

     

  • These steps helped here:

    Workaround:
    - removing LCM extension via PS or Portal
    - creating the mentioned HKLM Key (as removing will delete it) via PS
    - reboot the nodes via PS
    - register LCM extension via PS

    Root Cause:
    The function Test-Nodeinitialization runs before or during LCM extension creation. It will test for a registry key which is not present before the LCM deployment is not already succeded.
    Only after the successful creation of LCM extension this registry key is created and contains the expected String Value.

    Solution:
    The exception in the function should not terminate the deployment of the LCM extension but rather check for the existance of this registry key (Test-Path), and if it does not exist should create the registry key.

     

     

    xtension Message: Enable-Extension.ps1 : Installing LCM extention failed: 
    Extension Error: Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
    does not exist.
    At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
    loymentScripts\LCMControllerWinService.psm1:128 char:30

     

     

     

     

    the error refers to this function below and in fact the complete hive is empty. And on a fresh 23H2 box it does not exist.
    NOTHING in the whole LCMControllerWinService.psm1 module does actually create this hive in the registry.
    Especially NOT the function Install-LCMController.

    Please try https://github.com/DellGEOS/AzureStackHOLs/tree/main/lab-guides/01a-DeployAzureStackHCICluster-CloudBasedDeployment in WestEU.

    LCMControllerWinService.psm1 reference 

     

     

     

    function Test-NodeInitialization
    {
        $existingService = Get-WmiObject Win32_Service | where {$_.Name -eq 'LcmController'}
        $eceLiteDir = Split-Path $existingService.PathName
        $eceLite = Join-Path $eceLiteDir "EnterpriseCloudEngine.psd1"
        $InitializeAction = "InitializeDeploymentService"
        $initializationStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation -Name InitializationComplete).InitializationComplete
        if (![string]::IsNullOrEmpty($initializationStatus) -and ($initializationStatus -eq "Complete"))
        {
            Write-Verbose "Node Initialization successfull."
        }
        elseif (Test-path -path $eceLite)
        {
            Import-Module $eceLite
            $actionProgress = Get-ActionProgress -ActionType $InitializeAction
            if (!$actionProgress)
            {
                throw "Node Initialization not started."
            }
            elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "Success")
            {
                Write-Verbose "Node Initialization successfull."
            }
            elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "Error")
            {
                Write-Verbose "Node Initialization failed with error: $($actionProgress.Attribute("Status").Parent.Value)"
                throw "Node Initialization failed with error: $($actionProgress.Attribute("Status").Parent.Value)"
            }
            elseif ($actionProgress -and $actionProgress.Attribute("Status").Value -eq "InProgress")
            {
                Write-Verbose "Initialization action plan status:$($actionProgress.Attribute("Status").Value)"
                throw "Node Initialization in progress."
            }
            else
            {
                Write-Verbose "Node Initialization failed."
            }
        }
        else
        {
            throw "Node Initialization failed."
        }
    }

     

     

     

     

  • mtbmsft  Islam Gomaa on our Azure Stack HCI Slack there are several reports spanning for months that most failing resource in pre-deployment is LCM

    This one is also handled here:
    https://techcommunity.microsoft.com/t5/fasttrack-for-azure/common-deployment-challenges-and-workarounds-for-hci-23h2/ba-p/4044172


    Can you confirm that the telemetry collected with every 23H2 deployment and (failed) attempt is collected even at this command?

       Invoke-AzStackHciArcInitialization



    ----

     
    Having the following outputs:

     

     

    Azure Portal Azure Stack HCI Wizard page 1
    Validate Selected Server:
    
    Resource validation failed. Details: [{"Code":"ValidationFailed","Message":"Arc extensions installed on Arc Machine /subscriptions/a69d12f1-e62b-49b1-a483-6bc11b28a923/resourceGroups/AzSHCI-Clu-112-rg/providers/Microsoft.HybridCompute/machines/ASNode1 are 
    - DeviceManagementExtension
    - EdgeRemoteSupport
    - TelemetryAndDiagnostics 
    
    while required list of mandatory arc extensions are 
    - DeviceManagementExtension, 
    - LcmController,
    - TelemetryAndDiagnostics, 
    ,"Target":null,"Details":null},
    
    

     

     

     


    So LCMController is the culprit once again, also visible on both fresh nodes
    CU: 23H2 MBR 709
    LCM Version: 30.2402.0.15
    Scriptversion: 10.2402.0.15

    The error log in the extension is extremely helpful and verbose,but for me looks like an internal code issue.

     

     

    Microsoft.AzureStack.Orchestration.LcmController
    Status
    Failed
    Automatic upgrade
    Not supported
    Version
    30.2402.0.15
    Status level
    Error
    Status message
    Extension Message: Enable-Extension.ps1 : Installing LCM extention failed: 
    Extension Error: Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
    does not exist.
    At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
    loymentScripts\LCMControllerWinService.psm1:128 char:30
    + ... onStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTW ...
    +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : ObjectNotFound: (HKEY_LOCAL_MACH...tampInformation:String) [Get-ItemProperty], ItemNotFo 
       undException
        + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemPropertyCommand
     
    Get-ItemProperty : Cannot find path 'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\LCMAzureStackStampInformation' because it 
    does not exist.
    At C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Dep
    loymentScripts\LCMControllerWinService.psm1:128 char:30
    + ... onStatus = (Get-ItemProperty -path Registry::HKEY_LOCAL_MACHINE\SOFTW ...
    +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        + CategoryInfo          : ObjectNotFound: (HKEY_LOCAL_MACH...tampInformation:String) [Get-ItemProperty], ItemNotFo 
       undException
        + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetItemPropertyCommand
     
    Transcript started, output file is C:\MASLogs\Install-LCMController_20240223-023842Z.log
    VERBOSE: Loading module from path 
    'C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Deplo
    ymentScripts\LogmanHelpers.psm1'.
    VERBOSE: Importing function 'Register-LogManScheduledTask'.
    VERBOSE: Start-LCMControllerServiceLogman : Registering LogMan ScheduledTask.
    VERBOSE: Register-LogManScheduledTask : starts logman script started.
    VERBOSE: Register-LogManScheduledTask : Creating scheduled task: Start logman for LcmController service
    VERBOSE: Register-LogManScheduledTask : Scheduled task Start logman for LcmController service created, register it to 
    task scheduler
    
    TaskPath                                       TaskName                          State     
    --------                                       --------                          -----     
    \                                              Start logman for LcmController... Ready     
    VERBOSE: Register-LogManScheduledTask : Scheduled task Start logman for LcmController service is now registered
    VERBOSE: Start-LCMControllerServiceLogman : Starting LogMan ScheduledTask.
    VERBOSE:  : Installing LCMController Service.
    
    Status      : Stopped
    Name        : LcmController
    DisplayName : LcmController
    
    [SC] ChangeServiceConfig2 SUCCESS
    VERBOSE:  : LcmController service created.
    VERBOSE: LcmController starting.
    
    __GENUS          : 2
    __CLASS          : __PARAMETERS
    __SUPERCLASS     : 
    __DYNASTY        : __PARAMETERS
    __RELPATH        : 
    __PROPERTY_COUNT : 1
    __DERIVATION     : {}
    __SERVER         : 
    __NAMESPACE      : 
    __PATH           : 
    ReturnValue      : 0
    PSComputerName   : 
    
    VERBOSE:  : Waiting 300 seconds for LcmController to start, attempt 1 of 5 ...
    VERBOSE:  : LcmController service Running.
    VERBOSE: Loading module from path 
    'C:\NugetStore\Microsoft.AzureStack.Solution.LCMControllerWinService.10.2402.0.15\content\LCMControllerWinService\Enter
    priseCloudEngine.psd1'.
    VERBOSE: Exporting function 'Get-EceInterfaceParameters'.
    VERBOSE: Exporting function 'Test-EceInterface'.
    VERBOSE: Exporting function 'Get-DeploymentActionPlanLog'.
    VERBOSE: Exporting cmdlet 'Trace-ECEScript'.
    VERBOSE: Exporting cmdlet 'Set-EceSecret'.
    VERBOSE: Exporting cmdlet 'Get-EceConfiguration'.
    VERBOSE: Exporting cmdlet 'Get-ActionProgress'.
    VERBOSE: Exporting cmdlet 'Get-JsonTemplate'.
    VERBOSE: Exporting cmdlet 'Invoke-EceAction'.
    VERBOSE: Exporting cmdlet 'Join-RoleTemplate'.
    VERBOSE: Exporting cmdlet 'Import-EceCustomerConfiguration'.
    VERBOSE: Exporting cmdlet 'Set-RoleDefinition'.
    VERBOSE: Importing cmdlet 'Get-ActionProgress'.
    VERBOSE: Importing cmdlet 'Get-EceConfiguration'.
    VERBOSE: Importing cmdlet 'Get-JsonTemplate'.
    VERBOSE: Importing cmdlet 'Import-EceCustomerConfiguration'.
    VERBOSE: Importing cmdlet 'Invoke-EceAction'.
    VERBOSE: Importing cmdlet 'Join-RoleTemplate'.
    VERBOSE: Importing cmdlet 'Set-EceSecret'.
    VERBOSE: Importing cmdlet 'Set-RoleDefinition'.
    VERBOSE: Importing cmdlet 'Trace-ECEScript'.
    VERBOSE: Importing function 'Get-DeploymentActionPlanLog'.
    VERBOSE: Importing function 'Get-EceInterfaceParameters'.
    VERBOSE: Importing function 'Test-EceInterface'.
    VERBOSE: Initialization action plan status:InProgress
    VERBOSE: Node Initialization in progress.
    Transcript stopped, output file is C:\MASLogs\Install-LCMController_20240223-023842Z.log
    

     

     

  • FMerizalde's avatar
    FMerizalde
    Copper Contributor

    This has been very helpful in figuring out the install process and navigating the logs thank you.

  • mtbmsft Islam Gomaa is there a guidance or script that helps to cleanup objects in the Azure Portal aswell for removed or failed deployments?

    Today found couple of applications.

    Entra ID > Enterprise Applications:
    Remove filter (All Applications)
    clustername.arb