azure storage
122 TopicsHow to get blob Total Blob Count and Total Capacity with Blob Inventory
Approach This article presents how to take advantage of the Blob Inventory service to get the total blob count and the total capacity per storage account, per container or per directory. I will present the steps to create the blob inventory rule and how to get the needed information without having to process the blob inventory rule results just by using the prefix match field. Additional support documentation it is presented at the end of the article. Introduction to the Blob Inventory Service Azure Storage blob inventory provides a list of the containers, blobs, blob versions, and snapshots in your storage account, along with their associated properties. It generates an output report in either comma-separated values (csv) or Apache Parquet format on a daily or weekly basis. You can use the report to audit retention, legal hold or encryption status of your storage account contents, or you can use it to understand the total data size, age, tier distribution, or other attributes of your data. Please find here our documentation about the blob inventory service. On this article, I will focus on using this service to get the blob count and the capacity. Steps to enable inventory report Please see below how to define a blob inventory rule to get the intended information, using the Azure Portal: Sign in to the Azure portal to get started. Locate your storage account and display the account overview. Under Data management, select Blob inventory. Select Add your first inventory rule, if you do not have any rule defined, or select Add a rule, in case that you already have at least one rule defined. Add a new inventory rule by filling in the following fields: Rule name: The name of your blob inventory rule. Container: Container to store the result of the blob inventory rule execution. Object type to inventory: Select blob Blob types: Blob Storage: Select all (Block blobs, Page blobs, Append blobs). Data Lake Storage: Select all (Block blobs, Append blobs). Subtypes: Blob Storage: Select all (Include blob versions, Include snapshots, Include deleted blobs). Data Lake Storage: Select all (Include snapshots, Include deleted blobs). Blob inventory fields: Please find here all custom schema fields supported for blob inventory. In this scenario, we need to select at least the following fields: Blob Storage: Name, Creation-Time, ETag, Content-Length, Snapshot, VersionId, IsCurrentVersion, Deleted, RemainingRetentionDays. Data Lake Storage: Name, Creation-Time, ETag, Content-Length, Snapshot, DeletionId, Deleted, DeletedTime, RemainingRetentionDays. Inventory frequency: A blob inventory run is automatically scheduled every day when daily is chosen. Selecting weekly schedule will only trigger the inventory run on Sundays. A daily execution will return results faster. Export format: The export format. Could be a csv file or a parquet file. Prefix match: Filter blobs by name or first letters. To find items in a specific container, enter the name of the container followed by a forward slash, then the blob name or first letters. For example, to show all blobs starting with “a”, type: “myContainer/a”. Here is the place to add the path where to start collecting the blob information. The step 5.9 presented above (the prefix match field) it is the main point of this article. Considering that we have a Storage Account with a container named work, and a directory named items, inside the container named work. Please see below how to configure the prefix match field to get the needed result: Leave it empty to get the information at the storage account level. Add the container name in the prefix match field to get the information at the container level. Put prefix match = work/ Add the directory path in the prefix match field to get the information at the directory level. Put prefix match = work/items/ The blob inventory execution will generate a file named <ruleName>-manifest.json, please see more information about this file in the support documentation section. This file captures the rule definition provided by the user and the path to the inventory for that rule, and the information that we want without having to process the blob inventory rule files. { "destinationContainer" : "inventory-destination-container", "endpoint" : "https://testaccount.blob.core.windows.net", "files" : [ { "blob" : "2021/05/26/13-25-36/Rule_1/Rule_1.csv", "size" : 12710092 } ], "inventoryCompletionTime" : "2021-05-26T13:35:56Z", "inventoryStartTime" : "2021-05-26T13:25:36Z", "ruleDefinition" : { "filters" : { "blobTypes" : [ "blockBlob" ], "includeBlobVersions" : false, "includeSnapshots" : false, "prefixMatch" : [ "penner-test-container-100003" ] }, "format" : "csv", "objectType" : "blob", "schedule" : "daily", "schemaFields" : [ "Name", "Creation-Time", "BlobType", "Content-Length", "LastAccessTime", "Last-Modified", "Metadata", "AccessTier" ] }, "ruleName" : "Rule_1", "status" : "Succeeded", "summary" : { "objectCount" : 110000, "totalObjectSize" : 23789775 }, "version" : "1.0" } The objectCount value is the total blob count, and the totalObjectSize is the total capacity in bytes. Special notes: A rule needs to be defined for each path (container or directory) to get the total blob count and the total capacity. The blob inventory rule generates a CSV or Apache Parquet formatted file(s). These files should be deleted if the blob inventory rule is only to get the information presented on this article. Support Documentation Topic Some highlights Enable Azure Storage blob inventory reports The steps to enable inventory report. Inventory run If you configure a rule to run daily, then it will be scheduled to run every day. If you configure a rule to run weekly, then it will be scheduled to run each week on Sunday UTC time. The time taken to generate an inventory report depends on various factors and the maximum amount of time that an inventory run can complete before it fails is six days. Inventory output Each inventory rule generates a set of files in the specified inventory destination container for that rule. The inventory output is generated under the following path: https://<accountName>.blob.core.windows.net/<inventory-destination-container>/YYYY/MM/DD/HH-MM-SS/<ruleName where: accountName is your Azure Blob Storage account name. inventory-destination-container is the destination container you specified in the inventory rule. YYYY/MM/DD/HH-MM-SS is the time when the inventory began to run. ruleName is the inventory rule name. Inventory files Each inventory run for a rule generates the following files: Inventory file: An inventory run for a rule generates a CSV or Apache Parquet formatted file. Each such file contains matched objects and their metadata. Checksum file: A checksum file contains the MD5 checksum of the contents of manifest.json file. The name of the checksum file is <ruleName>-manifest.checksum. Generation of the checksum file marks the completion of an inventory rule run Manifest file: A manifest.json file contains the details of the inventory file(s) generated for that rule. The name of the file is <ruleName>-manifest.json. This file also captures the rule definition provided by the user and the path to the inventory for that rule. Pricing and billing Pricing for inventory is based on the number of blobs and containers that are scanned during the billing period. Known Issues and Limitations This section describes limitations and known issues of the Azure Storage blob inventory feature. Disclaimer These steps are provided for the purpose of illustration only. These steps and any related information are provided "as is" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. We grant You a nonexclusive, royalty-free right to use and modify the Steps and to reproduce and distribute the steps, provided that. You agree: to not use Our name, logo, or trademarks to market Your software product in which the steps are embedded; to include a valid copyright notice on Your software product in which the steps are embedded; and to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of steps.191Views0likes0CommentsRemove Unnecessary Azure Storage Account Dependencies in VM Diagnostics
This post explains how to reduce unnecessary Azure Storage Account dependencies—and associated SAS token usage—by simplifying VM diagnostics configurations: specifically, by removing the retiring legacy IaaS Diagnostics extension and migrating VM boot diagnostics from customer-managed Storage Accounts to Microsoft‑managed storage. Using Azure Resource Graph to identify affected virtual machines at scale, the article shows that both changes can be implemented without VM reboots or guest OS impact, reduce storage sprawl and operational overhead, and help organizations stay ahead of platform deprecations, with automation options available to standardize these improvements across environments.SSMS 21/22 Error Upload BACPAC file to Azure Storage
Hello All In my SSMS 20, I can use "Export Data-tier Application" to export an BACPAC file of Azure SQL database and upload to Azure storage in the same machine, the SSMS 21 gives error message when doing the same export, it created the BACPAC files but failed on the last step, "Uploading BACPAC file to Microsoft Azure Storage", The error message is "Could not load file or assembly 'System.IO.Hashing, Version=6.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51' or one of its dependencies. The system cannot find the file specified. (Azure.Storage.Blobs)" I tried the fresh installation of SSMS 21 in a brand-new machine (Windows 11), same issue, Can anyone advice? Thanks389Views0likes5Comments- 7.8KViews0likes2Comments
[Design Pattern] Handling race conditions and state in serverless data pipelines
Hello community, I recently faced a tricky data engineering challenge involving a lot of Parquet files (about 2 million records) that needed to be ingested, transformed, and split into different entities. The hard part wasn't the volume, but the logic. We needed to generate globally unique, sequential IDs for specific columns while keeping the execution time under two hours. We were restricted to using only Azure Functions, ADF, and Storage. This created a conflict: we needed parallel processing to meet the time limit, but parallel processing usually breaks sequential ID generation due to race conditions on the counters. I documented the three architecture patterns we tested to solve this: Sequential processing with ADF (Safe, but failed the 2-hour time limit). 2. Parallel processing with external locking/e-tags on Table Storage (Too complex and we still hit issues with inserts). 3. A "Fan-Out/Fan-In" pattern using Azure Durable Functions and Durable Entities. We ended up going with Durable Entities. Since they act as stateful actors, they allowed us to handle the ID counter state sequentially in memory while the heavy lifting (transformation) ran in parallel. It solved the race condition issue without killing performance. I wrote a detailed breakdown of the logic and trade-offs here if anyone is interested in the implementation details: https://medium.com/@yahiachames/data-ingestion-pipeline-a-data-engineers-dilemma-and-azure-solutions-7c4b36f11351 I am curious if others have used Durable Entities for this kind of ETL work, or if you usually rely on an external database sequence to handle ID generation in serverless setups? Thanks, Chameseddine127Views0likes1CommentAzure SQL Database : Can I use same primary key column and foreign key column for multiple tables?
CREATE TABLE Table1( PRIMARY KEY (Table1ID), Column2 int ); CREATE TABLE Table2( PRIMARY KEY (Table1ID), Column2 int, FOREIGN KEY (Table1ID) REFERENCES Table1(Table1ID) ); CREATE TABLE Table3( PRIMARY KEY (Table1ID), Column2 int, FOREIGN KEY (Table1ID) REFERENCES Table1(Table1ID) );343Views0likes1CommentAzure Logic App workflow (Standard) Resubmit and Retry
Hello Experts, A workflow is scheduled to run daily at a specific time and retrieves data from different systems using REST API Calls (8-9). The data is then sent to another system through API calls using multiple child flows. We receive more than 1500 input data, and for each data, an API call needs to be made. During the API invocation process, there is a possibility of failure due to server errors (5xx) and client errors (4xx). To handle this, we have implemented a "Retry" mechanism with a fixed interval. However, there is still a chance of flow failure due to various reasons. Although there is a "Resubmit" feature available at the action level, I cannot apply it in this case because we are using multiple child workflows and the response is sent back from one flow to another. Is it necessary to utilize the "Resubmit" functionality? The Retry Functionality has been developed to handle any Server API errors (5xx) that may occur with Connectors (both Custom and Standard), including client API errors 408 and 429. In this specific scenario, it is reasonable to attempt retrying or resubmitting the API Call from the Azure Logic Apps workflow. Nevertheless, there are other situations where implementing the retry and resubmit logic would result in the same error outcome. Is it acceptable to proceed with the Retry functionality in this particular scenario? It would be highly appreciated if you could provide guidance on the appropriate methodology. Thanks -Sri1KViews0likes1CommentDeriving expiry days and remaining retention days for blobs through blob inventory
In managing data within Azure blob storage accounts and Azure data lake gen 2 storage accounts, organizations often encounter scenarios where blobs have been deleted but remain in a soft-deleted state. To calculate the remaining retention days for all such blobs across an entire storage account can be a critical requirement for customers seeking to optimize data management and ensure compliance with retention policies. Additionally, certain blobs may have an expiry time set, scheduling their deletion for a future date. To facilitate the identification and monitoring of these blobs and their respective expiry times, a custom query has been written to efficiently list and calculate expiry information, enabling users to proactively manage their storage resources. The expiry time for Azure blobs is set using the Set Blob Expiry operation. This feature is present in only Hierarchical namespace enabled storage accounts. We can set the expiry with below steps: i) Azure Storage action- About Azure Storage Actions - Azure Storage Actions | Microsoft Learn Storage action can be used to set blob expiry, share a high-level snippet for the operation below ii) REST API- Set Blob Expiry (REST API) - Azure Storage | Microsoft Learn to set the expiry time for your blobs. This ensures that each blob has a defined lifecycle and will be deleted after the specified period. In this blog, it is a step-by-step process of listing the expiry time and retention of the blobs using Blob Inventory report and then parsing it using Synapse. 1. Set blob inventory rule Get CSV file from blob inventory run. Go to the container where inventory reports are getting stored. Navigate to the recent date folder and get url of Blob Inventory csv life. Sharing the below snippet for reference: 2. Create an Azure Synapse workspace Next, create an Azure Synapse workspace where you will execute a SQL query to report the inventory results. Create the SQL query: After you create your Azure Synapse workspace, do the following steps. Navigate to https://web.azuresynapse.net. Select the Develop tab on the left edge. Select the large plus sign (+) to add an item. Select SQL script. 3. Use the sample query below to get the expiry time and remaining retention days of blob respectively select LEFT([Name], CHARINDEX('/', [Name]) - 1) AS Container, RIGHT([Name], LEN([Name])- CHARINDEX('/',[Name])) AS Blob, [Expiry-time] from OPENROWSET( bulk '<URL to your inventory CSV file>', format='csv', parser_version='2.0', header_row=true ) as Source For blobs which got deleted directly, you can calculate the remaining retention days since the data is present in soft deleted state and will be deleted permanently after the retention days complete. select LEFT([Name], CHARINDEX('/', [Name]) - 1) AS Container, RIGHT([Name], LEN([Name])- CHARINDEX('/',[Name])) AS Blob, [Expiry-time], RemainingRetentionDays from OPENROWSET( bulk '<URL to your inventory CSV file>', format='csv', parser_version='2.0', header_row=true ) as Source In the above snippet, the Null value represents that the blob is not deleted and no expiry time is set on the blob yet. Please Note: Calculating blob expiry from blob inventory is one way, customer can explore other options such as Powershell and Azure CLI to achieve the same. Reference links:- Set Blob Expiry (REST API) - Azure Storage | Microsoft Learn Create a storage task - Azure Storage Actions | Microsoft Learn Azure Storage blob inventory | Microsoft Learn Calculate blob count and size using Azure Storage inventory | Microsoft Learn352Views0likes0CommentsHow to configure directory level permission for SFTP local user
SFTP is a feature which is supported for Azure Blob Storage with hierarchical namespace (ADLS Gen2 Storage Account). As documented, the permission system used by SFTP feature is different from normal permission system in Azure Storage Account. It’s using a form of identity management called local users. Normally the permission which user can set up on local users while creating them is on container level. But in real user case, it’s usual that user needs to configure multiple local users, and each local user only has permission on one specific directory. In this scenario, using ACLs (Access control lists) for local users will be a great solution. In this blog, we’ll set up an environment using ACLs for local users and see how it meets the above aim. Attention! As mentioned in Caution part of the document, the ACLs for local users are supported, but also still in preview. Please do not use this for your production environment. Preparation Before configuring local users and ACLs, the following things are already prepared: One ADLS Gen2 Storage Account. (In this example, it’s called zhangjerryadlsgen2) A container (testsftp) with two directories. (dir1 and dir2) One file uploaded into each directory. (test1.txt and test2.txt) The file system in this blog is like: Aim The aim is to have user1 which can only list files saved in dir1 and user2 which can only list files saved in dir2. Both of them should be unable to do any other operations in the matching directory (dir1 for user1 and dir2 for user2) and should be unable to do any operations in root directory and the other directory. Configuring local users From Azure Portal, it’s easy to enable SFTP feature and create local users. Here except user1 and user2, another additional user is also necessary. It will be used as the administrator to assign ACLs on user1 and user2. In this blog, it’s called admin. While creating the admin, its landing directory should be the root directory of the container and the permissions should be all given. While creating the user1 and user2, as the permission will be controlled by using ACLs, the containers and permissions should be left empty and the Allow ACL authorization should be checked. The landing directory should be configured to the directory which this user should have permission later. (In this blog, user1 should be on dir1 and user2 should be on dir2.) User1: User2: After local users are created, one more step which is needed before configuring ACL is to note down the user ID of user1 and user2. By clicking the created local user, a page as following to edit local user should show out and the user ID will be included there. In this blog, the user ID of user1 is 1002 and user ID of user2 is 1003. Configuring ACLs Before starting configuring ACLs, clarifying which permissions to assign is necessary. As explained in this document, the ACLs contains three different permissions: Read(R), Write(W) and Execute(X). And from the “Common scenarios related to ACL permissions” part of the same document, there is a table which contains most operations and their corresponding required permissions. Since the aim of this blog is to allow user1 only to list the dir1, according to table, we know that correct permission for user1 should be X on root directory, R and X on dir1. (For user2, it’s X on root directory, R and X on dir2). After clarifying the needed permissions, the next step is to assign ACLs. The first step is to connect to the Storage Account using SFTP as admin: (In this blog, the PowerShell session + OpenSSL is used but it’s not the only way. Users can also use any other way to build SFTP connection to the Storage Account.) Since assigning ACLs for local users is not possible to a specific user, and the owner of root directory is a built-in user which is controlled by Azure, the easiest way here is to give X permissions to all other users. (For concept of other users, please refer to this document) Next step is to assign R and X permission. But considering the same reason, it’s impossible to give R and X permissions for all other users again. Because if it’s done, user1 will also have R and X permissions on dir2, which does not match the aim. The best way here is to change the owner of the directory. Here we should change the owner of dir1 to user1 and dir2 to user2. (By this way, user1 will not have permission to touch dir2.) After above configurations, while connecting to the Storage Account by SFTP connection using user1 and user2, only listing file operation under corresponding directory is allowed. User1: User2: (The following test result proves that only list operation under /dir2 is allowed. All other operations will return permission denied or not found error.) About landing directory What will happen if all other configurations are correct but the landing directory is configured as root directory for user1 or user2? The answer to the above question is quite simple: The configuration will still work, but will impact the user experience. To show the the result of that case, one more local user called user3 with user ID 1005 is created but its landing directory is configured as admin, which is on root directory. The ACL permission assigned on it is same as user2 (change owner of dir2 to user3.) While connecting to the Storage Account by SFTP using user3, it will be landing on root directory. But per ACLs configuration, it only has permission to list files in dir2, hence the operations in root directory and dir1 are expected to fail. To apply further operation, user needs to add dir2/ in the command or cd dir2 at first.3.1KViews1like1CommentExclude Prefix in Azure Storage Action: Smarter Blob Management
Azure Storage Actions is a powerful platform for automating data management tasks across Blob and Data Lake Storage. Among its many features, Exclude Prefix stands out as a subtle yet critical capability that helps fine-tune task assignments. What Is the "Exclude Prefix" Feature? The Exclude Prefix option allows users to omit specific blobs or folders from being targeted by Azure Storage Actions. This is particularly useful when applying actions such as: Moving blobs to a cooler tier Deleting blobs Rehydrating archived blobs Triggering workflows based on blob changes For example, if you're running a task to archive blobs older than 30 days, but you want to exclude logs or config files, you can define a prefix like logs/ or config/ in the exclusion list. How to Use It — Example Scenario: In the following example, blobs across the entire storage account were deleted based on a condition: if a blob’s access tier was set to Hot, it was deleted except for those blobs or paths explicitly listed under the Exclude blob prefixes property. Create a Task: - Navigate to the Azure portal and search for Storage tasks. Then, under Services, click on Storage tasks – Azure Storage Actions On the Azure Storage Actions | Storage Tasks page, click Create to begin configuring a new task. Complete all the required fields, then click Next to proceed to the Conditions page. To configure blob deletion, add the following conditions on the Conditions page. Add the Assignment :- Click Add assignment in the Select scope section, choose your subscription and storage account, then provide a name for the assignment. In the Role assignment section, select Storage Blob Data Owner from the Role drop-down list to assign this role to the system-assigned managed identity of the storage task. In the Filter objects section, specify the Exclude Blob Prefix filter if you want to exclude specific blobs or folders from the task. In the example specified above, blobs will be deleted—except for those under the path “excludefiles” listed in the Exclude blobprefixes property. In the Trigger details section, choose the runs of the task and then select the container where you'd like to store the execution reports. Select Add. In the Tags tab, select Next and in the Review + Create tab, select Review + create. When the task is deployed, your deployment is complete page appears and select Go to resource to open the Overview page of the storage task. Enable the Task Assignment: - In the Trigger details section, we have a Enable task assignment checkbox which is checked by default. If the Enable task assignments checkbox is unchecked, you can still enable assignments manually from the Assignments page. To do this, go to Assignments, select the relevant assignment, and then click Enable. The task assignment is queued to run and will run at the specified time. Monitoring the runs:- After the task completes running, you can view the results of the run. With the Assignments page still open, select View task runs. Select the View report link to download a report. You can also view these comma-separated reports in the container that you specified when you configured the assignment. Conclusion: The Exclude Prefix feature in Azure Storage Actions provides enhanced control and flexibility when managing blob data at scale. By allowing you to exclude specific prefixes from actions like delete or tier changes, it helps you safeguard critical data, reduce mistakes, and fine-tune automation workflows. This targeted approach not only improves operational efficiency but also supports more granular data in Azure Blob Storage. Note:- Azure Storage Actions are generally available in the following public regions: https://learn.microsoft.com/en-us/azure/storage-actions/overview#supported-regions We can also exclude certain blobs using the “Not”operator when building task conditions. Blobs may be excluded based on specific blob or container attributes from the task conditions side as well—not just through task assignments. In the screenshot below, we are using the Not operator (!) to exclude blobs where the blob name is equal to "Test". Please refer: https://learn.microsoft.com/en-us/azure/storage-actions/storage-tasks/storage-task-conditions#multiple-clauses-in-a-condition. Reference Links:- About Azure Storage Actions - Azure Storage Actions | Microsoft Learn Storage task best practices - Azure Storage Actions | Microsoft Learn Known issues and limitations with storage tasks - Azure Storage Actions | Microsoft Learn295Views1like1Comment