HDInsight 4.0 brings upgraded versions for all Apache components , but for this lab we specifically focus on the Hive versions.
Component | HDInsight 4.0 | HDInsight 3.6 |
---|---|---|
Hive | 3.1.0 | 2.1.0 ; 1.2.1 |
Some key areas that Hive 3.x differs from earlier versions are.
Client
Transaction Processing
Catalogs
For an exhaustive overview of advancements in Hive 3.0 , listen to this presentation.
This lab is a simulation of a real migration and will consist of the following steps in sequence.
Run the same TPCDS tests to ensure everything is working as intended.
1. Create storage account and click create
2. Populate the Basics tab of Storage account with the below information.
3. Populate the Advanced tab of Storage account with the below information.
< Insert Image >
4. Make no changes to the Tags Tab and post validation click Create on the Review + Create tab to create an Azure Blob storage account.
5. In the next section we would create the external metastore for the HDInsight Hadoop 3.6 cluster.
1. On the Azure Portal Navigate to the SQL Database blade and click create
2. Populate the Basics tab of SQL Server with the below information.
Resource group: Use the same resource group used for storage
Database Details :
Database name: Choose any allowed name for the metastore ( I used aghive36db to represent a Hive 2.1 metastore on an HDInsight 3.6 cluster type
Server :- Click Create new
Want to use SQL Elastic Pool : N
3. Compute+Storage : Click on the Configure Database link to Navigate to the database configuration page
4. In the Database Configuration page select the Standard Tier with the below settings and click Apply
5. In the Networking tab ensure the following settings are met
6.Leave the Additional settings and Tags to their default state
7. In the Review+Create tab click Create
8. In this section we created an External Hive Metastore(aghive36db) which we will use subsequently in an HDInsight 3.6 cluster.
To provision HDInsight LLAP with Azure Management Portal, perform the below steps.
Go to the Azure Portal portal.azure.com. Login using your azure account credentials.
Select Create a resource -> Azure HDInsight -> Create
Click the Custom(size ,settings, apps) slider
2. In the Basics tab populate the following values.
5. In the Storage tab populate the following values.
6. Metastore Settings: Enter the name for the SQL Database/SQL Server combination that was created in the last step
7. Click Authenticate to enter the username and password for the Metastore. Enter the username and password that was set for the SQL Server in the last exercise.
8. In the Configuration+Pricing tab select the node sizes for the cluster. There are no hard and fast rules and the recommendation is to select larger nodes for faster data processing. Note that choosing nodes that are too small may result in failures.
9. In the Review+Create tab , review the cluster specifics and click Create.
10. In this step we created an HDInsight 3.6 Hadoop cluster with preconfigured external metastore and storage account.
11. In the next step we will create some test data in the cluster to represent a production workload.
Goal of this step is to help generate TPCDS data with hive as a close representation of production data.
git clone https://github.com/hdinsight/tpcds-hdinsight && cd tpcds-hdinsight
hdfs dfs -copyFromLocal resources /tmp
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=1 -hiveconf PARTS=1 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=`grep -A 1 "fs.defaultFS" /etc/hadoop/conf/core-site.xml | grep -o "wasb[^<]*"`/tmp/resources
SCALE
is a scale factor for TPCDS. Scale factor 10 roughly generates 10 GB data, Scale factor 1000 generates 1 TB of data and so on.
PARTS
is a number of task to use for datagen (parrellelization). This should be set to the same value as SCALE
.
LOCATION
is the directory where the data will be stored on HDFS.
TPCHBIN
is where the resources are found. You can specify specific settings in settings.hql file.
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCDS/ -hiveconf DBNAME=tpcds
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpcds_orc -hiveconf SOURCE=tpcds
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpcds_orc
beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/query12.sql
In this section we have created test data on the cluster and then tested a few queries representative of production datasets.
In the next section we would upgrade the Hive Metastore from 2.1.1. to 3.1.
2. Click on the Metastore to open the SQL DB portal.
3. On the SQL DB portal click on Restore.
Click OK to continue with creation of a copy
5. After creation, the new SQL db(aghive40db) appears as an additional database in the same SQL Server.
2. On the Script action page populated the parameters as described below and click Create
https://hdiconfigactions.blob.core.windows.net/hivemetastoreschemaupgrade/launch-schema-upgrade.sh
SQLservername NEWSQLdbname SQLServerusername SQLServerpassword
Note : Use single space with no commas or semicolons Example: agclusterdbserver aghive40db username password
3. The script action starts on the cluster
4. The script comes back with a green check mark which indicates successful completion.
After the script finishes , we would need to validate if the Hive Metastore is indeed upgraded
Enter the below query in the query editor and click Run. Select * from [dbo].version
Validate to see the if the schema version was upgraded to 3.1.0 . This would indicate that the Hive metastore was succesfully upgraded.
4. Post up-gradation , delete the older HDInsight 3.6 cluster.
In this section we upgraded the new Hive Metastore aghive40db from version 2.1.2 to 3.1.0 post which we deleted the older HDInsight cluster.
In the next section , we would create a new HDInsight 4.0( Hive 3.1) cluster with the new Hive Metastore and the older storage account.
To provision HDInsight LLAP with Azure Management Portal, perform the below steps.
Go to the Azure Portal portal.azure.com. Login using your azure account credentials.
Select Create a resource -> Azure HDInsight -> Create
Click the Custom(size ,settings, apps) slider
In the Basics tab populate the following values.
6. Click Authenticate to enter the username and password for the Metastore. Enter the username and password that was set for the SQL Server.
7. In the Configuration+Pricing tab select the node sizes for the cluster. There are no hard and fast rules and the recommendation is to select larger nodes for faster data processing. Note that choosing nodes that are too small may result in failures.
8. In the Review+Create tab , review the cluster specifics and click Create.
In this step we created an HDInsight 4.0 Hadoop cluster with preconfigured upgraded external metastore and mapped its storage to a preexisting storage container.
In the next step we will create some test data in the cluster to represent a production workload.
Goal of this step is to run TPCDS tests with an upgraded Hive Metastore to represent regression tests of Hive workloads.
The TPCDS repo should already be cloned and TPCDS data should already exist from what we created earlier.
SSH into the cluster using the ssh credentials provided durinng cluster creation.
Run a few TPCDS queries to represent a production regression test . Change the query number in the end to test various queries.
beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/query12.sql
Further reference:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.