Blobs with Index Tags - How to identify the blobs with blob index tags and how to remove those tags
Published Nov 15 2023 08:49 AM 10.2K Views
Microsoft

Background

 

This article describes how to identify the blobs with index tags and how to remove those tags using the Blob Inventory Service and Python SDK. 

 

Approach

 

This article is divided into two sections. These sections are independent, which means that you can perform the steps in section 1 and not perform the steps in section 2, or vice versa:

  1. Use the Blob Inventory service to identify the blobs with index tags

    • You should follow the steps in this section if you want to know/identify which blobs in your Storage Account have index tags, and to know which index tags are associated with them.
  2. Remove the Blob Index Tags

    • You should follow the steps in this section if you want to remove all index tags from the Blobs under a specific container.

 

1. Use the Blob Inventory Service to identify the blobs with index tags

 

Azure Storage blob inventory provides a list of the containers, blobs, blob versions, and snapshots in your storage account, along with their associated properties. It generates an output report in either comma-separated values (CSV) or Apache Parquet format on a daily or weekly basis. You can use the report to audit retention, legal hold or encryption status of your storage account contents, or you can use it to understand the total data size, age, tier distribution, or other attributes of your data. Please see more information here Azure Storage blob inventory.

The steps to enable inventory report are presented here Enable Azure Storage blob inventory reports.
 

Please see below how to define a blob inventory rule to search all blobs:

  1. Open the Blob Inventory area in your storage account
  2. Add a new inventory rule by filling in the following field
    1. Rule name: The name of your blob inventory rule
    2. Container: Container to place inventory file in.
    3. Object type to inventory: Select blob
    4. Blob types: Select all [Block blobs, Page blobs, Append blobs]
    5. Subtypes: Select [Include blob versions, Include snapshots]
    6. Blob inventory fields: Please see here the Custom schema fields supported for blob inventory
      1. In this scenario, we need to select at least the following fields [Name, Etag, Snapshot, Current version status, VersionId, Tags, TagCount]
    7. Inventory frequency: A blob inventory run is automatically scheduled every day when daily is chosen. Selecting weekly schedule will only trigger the inventory run on Sundays.
      1. In this scenario, to get the results faster, it is better to select daily.
    8. Export format: The export format. Could be a csv file or a parquet file
    9. Prefix match: Filter blobs by name or first letters. To find items in a specific container, enter the name of the container followed by a forward slash, then the blob name or first letters. For example, to show all blobs starting with “a”, type: “myContainer/a”.
      1. Here is the place to add the path where to start collecting the blob information. Could be the container name.

 

The blob inventory result will have the information as follows:

 

ruineiva_0-1696237545622.png

 

Support documentation:

  • Inventory run - A blob inventory run is automatically scheduled every day. It can take up to 24 hours for an inventory run to complete.
    • Since you have a huge number of blobs, this could take more time to run and because of that, it could be better to create a rule for each container. Using a prefix can help achieve this.
  • Inventory output - Each inventory rule generates a set of files in the specified inventory destination container for that rule.
  • Inventory files - Each inventory run for a rule generates the following files: Inventory file, Checksum file, Manifest file.
  • Pricing and billing - Pricing for inventory is based on the number of blobs and containers that are scanned during the billing period.
  • Known issues - Please find the kwon issues associated with blob inventory here.

 

2. Remove the Blob Index Tags

 

In this section, you can find a script to remove the blob index tags from all the blobs under a specific container using Python SDK.


Please note that once these blob index tags are removed, they cannot be recovered. So, apply these steps only when you are sure that you no longer need to use your blob index tags.

 

Prerequisites

 

Download or use any Python IDE of your choice.

 

Sample scripts:

 

Special notes:

  • These scripts were developed and tested using the following versions but it is expected to work with previous versions:
    • Python 3.11.5
    • azure-storage-blob 12.18.2
  • The computer / Virtual Machine (VM) specifications have a significant impact on the script performance. 
  • Using a VM within the same region of the storage account could also increase the script performance.
  • To understand how to define the number of concurrent threads to use on Script 2, the value defined on that parameter concurrency (line 15), and used on line 32 please review here concurrent.futures — Launching parallel tasks — Python the Python documentation about the max_workers.

 

If you executed the Blob Inventory Report to identify the blobs with blob index tags (Section 1), you can use the script below (Script 1) to identify the containers with Blobs that have index tags. If you already know the name of the containers, please skip this script.

 

Script 1

 

# Please update the below parameter with your own information before executing this script:
   # inventoryPath: The path to the blob inventory reprot file

import pandas as pd

inventoryPath = "C:\\XXX\\blobindextagsruleFILE.csv"

df = pd.read_csv(inventoryPath, sep = ",")

df['container'] = df['Name'].str.split('/').str[0]

df = df[df['TagCount'] > 0]

df = df['container'].drop_duplicates()

for i in df:
    print(i)

 

After identifying the containers with Blobs with index tags, you can run the next script below (Script 2) to remove all index tags. We advise you to run the script once for each container. Please note that you can run several script instances in parallel.

 

Script 2

 

# Please update the below parameters with your own information before executing this script:
   # account_name: Storage account name.
   # account_key: Storage account key.
   # container_name: Name of the container where the blobs with index tags are.

from azure.storage.blob import BlobServiceClient
from concurrent.futures import ThreadPoolExecutor

# Define your storage account name and key, and the container name
account_name = "XXX"
account_key = "XXX"
container_name = "XXX"

# Define the number of concurrent threads
concurrency = 250

# Count the number of blob with index tags
blob_count = 0

# Create a BlobServiceClient object
blob_service_client = BlobServiceClient(account_url=f"https://{account_name}.blob.core.windows.net", credential=account_key)

# Function to remove index tag from a blob
def remove_blob_index_tag(blob_name):
    # Get the blob client
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

    # Remove the index tag
    blob_client.set_blob_tags(tags=None)

# Create a ThreadPoolExecutor with the specified concurrency
with ThreadPoolExecutor(max_workers=concurrency) as executor:
    container_client = blob_service_client.get_container_client(container_name)

    for blob in container_client.list_blobs():
        # Get the blob client
        blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob.name)

        blob_tags = blob_client.get_blob_tags()

        # Check if index tag exists
        if blob_tags:
            futures = [executor.submit(remove_blob_index_tag, blob.name)]
            blob_count += 1        

print(f"This script removed index tags on {blob_count} blobs")

 

Disclaimer:

  • These steps are provided for the purpose of illustration only. 
  • These steps and any related information are provided "as is" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
  • We grant You a nonexclusive, royalty-free right to use and modify the Steps and to reproduce and distribute the steps, provided that. You agree:
    • to not use Our name, logo, or trademarks to market Your software product in which the steps are embedded;
    • to include a valid copyright notice on Your software product in which the steps are embedded; and
    • to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of steps.
2 Comments
Co-Authors
Version history
Last update:
‎Nov 16 2023 01:20 AM
Updated by: