Forum Discussion

shotime's avatar
shotime
Brass Contributor
Jan 15, 2024

About HPC run MPI Service

Hi All,

 

I am a beginner in MS HPC.

My environment, Head Node HA and Compute Node, they are in different network segments but only have one network adapter. My diagnostic tests for MPI will fail, but it is possible to test on Head Nodes in the same network segment, without a firewall,

will MPI use broadcast , so Compute Nodes in different network segments will cause failures?

 

Thank you

 

Warning

  • This node did not return diagnostics results. Possible reasons are: there was a network issue, or the files that are required for running the diagnostic test are not available on the node. If this is a custom diagnostic test that you added to the cluster, you need to verify that you have copied to all the nodes the files that are required for running the diagnostic test, especially to any new nodes that have joined the cluster.
  • Different Network Segments: Your Head Nodes (in a High Availability setup) and Compute Nodes are on different network segments and have only one network adapter each. This setup can cause MPI communication issues if the nodes cannot "see" each other on the same network.

    Diagnostic Test Failure: The error message indicates that the diagnostic tests for MPI are failing. The reasons mentioned include:

    Network issues between nodes.
    Missing files required for running the test on Compute Nodes.
    Firewall rules blocking the necessary communication ports or types.

    MPI Configuration:

    Verify that your MPI settings are correct. Ensure that all nodes are properly configured in the MPI environment and can resolve each other's hostnames.
    You may need to configure the MPI environment to use a specific interface or network adapter that has access across segments.

    Diagnostics and Logs:

    Review the diagnostic logs from the HPC cluster to identify any specific errors or messages related to network or file access issues.
    Use tools like ping or tracert to check the network connectivity between nodes. Additionally, try running simpler tests or diagnostics to isolate network-related issues.

    File Availability:

    Ensure that any files needed for the MPI diagnostic tests are present on all nodes. If this is a custom diagnostic test, double-check that all required files have been deployed to every Compute Node, especially new ones.

  • kyazaferr's avatar
    kyazaferr
    Iron Contributor
    Different Network Segments: Your Head Nodes (in a High Availability setup) and Compute Nodes are on different network segments and have only one network adapter each. This setup can cause MPI communication issues if the nodes cannot "see" each other on the same network.

    Diagnostic Test Failure: The error message indicates that the diagnostic tests for MPI are failing. The reasons mentioned include:

    Network issues between nodes.
    Missing files required for running the test on Compute Nodes.
    Firewall rules blocking the necessary communication ports or types.

    MPI Configuration:

    Verify that your MPI settings are correct. Ensure that all nodes are properly configured in the MPI environment and can resolve each other's hostnames.
    You may need to configure the MPI environment to use a specific interface or network adapter that has access across segments.

    Diagnostics and Logs:

    Review the diagnostic logs from the HPC cluster to identify any specific errors or messages related to network or file access issues.
    Use tools like ping or tracert to check the network connectivity between nodes. Additionally, try running simpler tests or diagnostics to isolate network-related issues.

    File Availability:

    Ensure that any files needed for the MPI diagnostic tests are present on all nodes. If this is a custom diagnostic test, double-check that all required files have been deployed to every Compute Node, especially new ones.

Resources