Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. With HDInsight, you can use open-source frameworks such as, Apache Spark, Apache Hive, LLAP, Apache Kafka, Hadoop and more, in your Azure environment.
Microsoft periodically upgrades the open source frameworks and the HDInsight Resource provider to include new improvements and features.
HDInsight, 5.1 is Generally available from November 1, 2023, this release contains all the latest versions of supported software. It comes with all the improvements made in open-source versions and the integrations from Microsoft. This version is fully compatible with HDI 5.0 and HDI 4.0 versions. HDInsight has discontinued Sqoop and Pig add-ons from HDInsight 5.1 version.
This guide contains detailed information about new version and migration steps for Hadoop cluster
Introduction
Key advantages
- Latest Version – HDInsight 5.1 comes with the latest stable open-source version available. Customers can benefit from all latest features, improvements, and Bug fixes.
- Secure – The new versions come with more secure controls and fixes. Open-source security fixes are part of this release as well the improvements in security by Microsoft.
- Lower TCO – new version is better in performance. Customers can leverage performance improvements to reduce the operating cost.
For Hadoop cluster upgrade the below points needs to be considered:
Migration guide for Ambari and Oozie
Ambari User Configs Migration
After setting up HDI 5.1, it is necessary to update the user-defined configurations from the HDI 4.0 cluster. Unfortunately, Ambari does not currently provide a feature to export and import configurations. To overcome this limitation, we have created a script that facilitates downloading the configurations and comparing them across clusters. However, this process involves a few manual steps, such as uploading the configurations to a storage directory, downloading them and then comparing them.
Script Details
- This process contains 2 python scripts.
- Script to download the local cluster service configs from Ambari.
- Script to compare the service config files and generate the differences.
- All service configurations will be downloaded, but certain services and properties have been excluded from the comparison process. These excluded services and properties are as follows: 'dfs.namenode.shared.edits.dir','hadoop.registry.zk.quorum','ha.zookeeper.quorum','hive.llap.zk.sm.connectionString','hive.cluster.delegation.token.store.zookeeper.connectString','hive.zookeeper.quorum','hive.metastore.uris','yarn.resourcemanager.hostname','hadoop.registry.zk.quorum','yarn.resourcemanager.hostname','yarn.node-labels.fs-store.root','javax.jdo.option.ConnectionURL','javax.jdo.option.ConnectionUserName','hive_database_name','hive_existing_mssql_server_database','yarn.log.server.url','yarn.timeline-service.sqldb-store.connection-username','yarn.timeline-service.sqldb-store.connection-url','fs.defaultFS', 'address'
- Excluded Services: 'AMBARI_METRICS','WEBHCAT'
Workflow
To execute the migration process, follow the steps outlined below:
- Run the script on the HDI 4.0 cluster to obtain the current service configurations from Ambari. The output will be saved on the local system.
- Upload the output file to a public storage location, as it will be required for downloading on the HDI 5.1 cluster.
- Execute the script on the HDI 5.1 cluster to retrieve the current service configurations from Ambari. Save the output on the local system.
- Download the HDI 4.0 cluster configurations from the storage account to the HDI 5.1 cluster.
- Run the script on the HDI 5.x cluster, where both the HDI 4.0 and HDI 5.1 configurations are present.
Execution
On HDI 4 Cluster (Old Cluster)
ssh to headnode.
mkdir hdinsights_ambari_utils
cd hdinsights_ambari_utils
Download the ambari_export_cluster_configs.py
Execute the script
python ambari_export_cluster_configs.py
Check for the configs files
ls –ltr
On the above we can see a out file with cluster name. Plutos.out
Upload the file to a storage container. Hence that will be downloaded on the new cluster.
On HDI 5 Cluster (New Cluster)
- ssh to headnode.
- mkdir hdinsights_ambari_utils
- cd hdinsights_ambari_utils
- Download the ambari_export_cluster_configs.py
- Execute the script
python ambari_export_cluster_configs.py
- Check for the configs files
ls –ltr
- On the above we can see a out file with cluster name. Sugar.out
- Download the old cluster out file. In our case its uploaded our storage contianer.
- Download the compare_ambari_cluster_configs.py script.
- Run the compare_ambari_cluster_configs.py script.
sshuser@hn0-sugar:~/hdinsights_ambari_utils$ python compare_ambari_cluster_configs.py plutos.out sugar.out
- Difference will be printed in the console.
- Adding to this difference between the cluster configs will saved in local.
- ls –ltr
Oozie DB Migration
In the context of migrating Oozie jobs from HDInsight version 4.0 to HDInsight version 5.1, a common approach involves the need to manually edit the 'job.properties' file and subsequently resubmit the Oozie jobs. Customers who have access to the 'job.properties' file can simply copy the content from HDI4 to HDI5 and resubmit the jobs. Post copying the content modify the namenode and Resourcemanager address and validate the static url. We provided a sed command to replace it automatically.
However, for a few customers who don't have the 'job.properties' file, there is a risk of losing the job, and recreating the 'job.properties' becomes necessary. To address this requirement, we have developed a utility that can download the 'job.properties' from the Oozie database for all job types. By default, the script also automates the replacement of the Namenode and Resourcemanager hostname references within the 'job.properties' file.
Key Consideration
- If the job.properties file is available:
- During the migration process, in both cases it is necessary to copy the entire content, including 'workflow.xml' and other dependency libraries. After copying, the Namenode (NN) and Resourcemanager (RM) URLs need to be modified.
- For NN and RM modification we provide a sed command for replacement
- If the job.properties file is not available:
- This utility provides a convenient solution to avoid manual edits and submissions of 'job.properties' during the migration process.
- This utility is not designed for resuming jobs from an HDI 4 cluster.
- It does not initiate job starts.
- Default behavior: Automatic replacement of NN and RM addresses based on property file keywords.
- In both cases:
- Copying the job configs and libs are required.
- Manual review of the 'job.properties' is required, as apart from NN and RM, any properties referring to specific storage or static URLs need validation. The output of each 'job.properties' will be saved in a designated directory.
Workflow
- Copy the job content from HDI 4.x – HDI 5.x.
- If job.properties is available.
- Run sed command to modify the namenode and resourcemanager url.
- Review and validate the configs.
- If job.proeprties not available.
- If custom DB. (Only MSSQL connector is supported)
- Run the Utility.
- Sort and recover the jobs properties required.
- If External DB.
- Retrieve the DB connection string and password.
- Supply the inputs.
- Run the Utility.
- Sort and recover the jobs properties required.
- Manual Validation
- Submit jobs.
Execution
Type 1 – job.properties file is available.
- Ensure that oozie server is stopped and no jobs run running state.
- Copy the job content from HDI-4.x HDFS to HDI-5.X Local
- In HDI 5.x VM, run the sed command on the local filsystem.
- Switch to the job directory
- Run the below on the job.properties or job.xml file
- sed -n '/<name>fs.default/,/<\/value>/p' /etc/hadoop/conf/core-site.xml
- sed -i 's/namenode=.*$/namenode=${currentWash}/' job.properties
- Review and validate it.
- Upload the content to HDFS.
- Submit the jobs.
Type 2a– job.properties file is not available, and Oozie DB Type is custom DB
- Ensure that oozie server is stopped and no jobs run running state.
- Copy the job content from HDI-4.x HDFS to HDI-5.X Local
- Run the below commands
- mkdir oozieMigrationUtility ; cd oozieMigrationUtility
- wget https://hdiconfigactions2.blob.core.windows.net/hdi-sre-workspace/hdinsights_upgrade_utils/hdinsights_oozie_migration_utility/JavaPlayground-1.0-SNAPSHOT-jar-with-dependencies.jar
- wget https://hdiconfigactions2.blob.core.windows.net/hdi-sre-workspace/hdinsights_upgrade_utils/hdinsights_oozie_migration_utility/hdi_sre_oozie_utility.py
- Edit hdi_sre_oozie_utility.py
- By default it is set to custom db and the start day is set to 1. Hence the script will pull the jobs for last one day. Note: pulling down the jobs for long interval might cause slowness or will consume more space. Hence consider the startDay variable.
- Execute the command to download the job.properties.
- python hdi_sre_oozie_utility.py
- oozie_jobs_output dir will be created.
- Copy the required jobs.
- Review and validate it.
- Upload the content to HDFS.
- Submit the jobs.
Type 2b– job.properties file is not available, and Oozie DB Type is External DB
- From Type 2a follow execute from step 1 – 3
- Edit hdi_sre_oozie_utility.py
- somment the #dbType='custom'
- Uncomment the below,
- dbType='external'
- ext_sql_server=''
- ext_db_name=''
- ext_db_username=''
- ext_db+password=''
- Set the value for the above variables.
- Execute the command to download the job.properties.
- python hdi_sre_oozie_utility.py
- oozie_jobs_output dir will be created.
- Copy the required jobs.
- Review and validate it.
- Upload the content to HDFS.
- Submit the jobs.
Migration guide for HIVE/LLAP
HDInsight 4.0 HIVE service is fully compatible with HDI 5.1 version. Users can reuse the same metastore and storage container in the new version. The application code does not require any changes.
For avoiding conflicts with the existing clusters its recommended to copy the metastore and use it in the new cluster.
- If the cluster uses a default Hive metastore, follow this guide to export metadata to an external metastore. Then, create a copy of the external metastore for upgrade.
- If the cluster uses an external Hive metastore, create a copy of it. Options include export/import and point-in-time restore.
The new and old HDInsight clusters must have access to the same Storage Accounts.
While creating the cluster choose the copied metastore database and the storage account mentioned above.
In case you have further questions, please reach out to Azure Support.