Azure Storage blob inventory
Azure Storage blob inventory provides a list of the containers, blobs, blob versions, and snapshots in your storage account, along with their associated properties. It generates an output report in either comma-separated values (CSV) or Apache Parquet format on a daily or weekly basis. You can use the report to audit retention, legal hold or encryption status of your storage account contents, or you can use it to understand the total data size, age, tier distribution, or other attributes of your data. You can also use blob inventory to simplify your business workflows or speed up data processing jobs, by using blob inventory as a scheduled automation of the List Containers and List Blobs APIs. Blob inventory rules allow you to filter the contents of the report by blob type, prefix, or by selecting the blob properties to include in the report.
Azure Storage blob inventory is available for the following types of storage accounts:
- Standard general-purpose v2
- Premium block blob storage
- Blob storage
Inventory features
The following list describes features and capabilities that are available in the current release of Azure Storage blob inventory.
Inventory reports for blobs and containers
You can generate inventory reports for blobs and containers. A report for blobs can contain base blobs, snapshots, content length, blob versions and their associated properties such as creation time, last modified time. Empty containers aren't listed in the blob inventory report. A report for containers describes containers and their associated properties such as immutability policy status, legal hold status.
Custom Schema
You can choose which fields appear in reports. Choose from a list of supported fields. That list appears later in this article.
CSV and Apache Parquet output format
You can generate an inventory report in either CSV or Apache Parquet output format.
Manifest file and Azure Event Grid event per inventory report
A manifest file and an Azure Event Grid event are generated per inventory report. These are described later in this article.
Enabling inventory reports
Enable blob inventory reports by adding a policy with one or more rules to your storage account. For guidance, see Enable Azure Storage blob inventory reports.
Upgrading an inventory policy
If you're an existing Azure Storage blob inventory user who has configured inventory prior to June 2021, you can start using the new features by loading the policy, and then saving the policy back after making changes. When you reload the policy, the new fields in the policy will be populated with default values. You can change these values if you want. Also, the following two features will be available.
A destination container is now supported for every rule instead of just being supported for the policy.
A manifest file and Azure Event Grid event are now generated per rule instead of per policy.
Inventory policy
An inventory report is configured by adding an inventory policy with one or more rules. An inventory policy is a collection of rules in a JSON document.
{
"enabled": true,
"rules": [
{
"enabled": true,
"name": "inventoryrule1",
"destination": "inventory-destination-container",
"definition": {. . .}
},
{
"enabled": true,
"name": "inventoryrule2",
"destination": "inventory-destination-container",
"definition": {. . .}
}]
}
View the JSON for an inventory policy by selecting the Code view tab in the Blob inventory section of the Azure portal.
Parameter name | Parameter type | Notes | Required? |
---|---|---|---|
enabled | boolean | Used to disable the entire policy. When set to true, the rule level enabled field overrides this parameter. When disabled, inventory for all rules will be disabled. | Yes |
rules | Array of rule objects | At least one rule is required in a policy. Up to 100 rules are supported per policy. | Yes |
Inventory rules
A rule captures the filtering conditions and output parameters for generating an inventory report. Each rule creates an inventory report. Rules can have overlapping prefixes. A blob can appear in more than one inventory depending on rule definitions.
Each rule within the policy has several parameters:
Parameter name | Parameter type | Notes | Required? |
---|---|---|---|
name | string | A rule name can include up to 256 case-sensitive alphanumeric characters. The name must be unique within a policy. | Yes |
enabled | boolean | A flag allowing a rule to be enabled or disabled. The default value is true. | Yes |
definition | JSON inventory rule definition | Each definition is made up of a rule filter set. | Yes |
destination | string | The destination container where all inventory files are generated. The destination container must already exist. |
The global Blob inventory enabled flag takes precedence over the enabled parameter in a rule.
Rule definition
Parameter name | Parameter type | Notes | Required |
---|---|---|---|
filters | json | Filters decide whether a blob or container is part of inventory or not. | Yes |
format | string | Determines the output of the inventory file. Valid values are csv (For CSV format) and parquet (For Apache Parquet format). |
Yes |
objectType | string | Denotes whether this is an inventory rule for blobs or containers. Valid values are blob and container . |
Yes |
schedule | string | Schedule on which to run this rule. Valid values are daily and weekly . |
Yes |
schemaFields | Json array | List of Schema fields to be part of inventory. | Yes |
Rule filters
Several filters are available for customizing a blob inventory report:
Filter name | Filter type | Notes | Required? |
---|---|---|---|
blobTypes | Array of predefined enum values | Valid values are blockBlob and appendBlob for hierarchical namespace enabled accounts, and blockBlob , appendBlob , and pageBlob for other accounts. This field isn't applicable for inventory on a container, (objectType: container ). |
Yes |
creationTime | Number | Specifies the number of days ago within which the blob must have been created. For example, a value of 3 includes in the report only those blobs, which were created in the last three days. |
No |
prefixMatch | Array of up to 10 strings for prefixes to be matched. | If you don't define prefixMatch or provide an empty prefix, the rule applies to all blobs within the storage account. A prefix must be a container name prefix or a container name. For example, container , container1/foo . |
No |
excludePrefix | Array of up to 10 strings for prefixes to be excluded. | Specifies the blob paths to exclude from the inventory report. An excludePrefix must be a container name prefix or a container name. An empty excludePrefix would mean that all blobs with names matching any prefixMatch string will be listed. If you want to include a certain prefix, but exclude some specific subset from it, then you could use the excludePrefix filter. For example, if you want to include all blobs under container-a except those under the folder container-a/folder , then prefixMatch should be set to container-a and excludePrefix should be set to container-a/folder . |
No |
includeSnapshots | boolean | Specifies whether the inventory should include snapshots. Default is false . This field isn't applicable for inventory on a container, (objectType: container ). |
No |
includeBlobVersions | boolean | Specifies whether the inventory should include blob versions. Default is false . This field isn't applicable for inventory on a container, (objectType: container ). |
No |
includeDeleted | boolean | Specifies whether the inventory should include deleted blobs. Default is false . In accounts that have a hierarchical namespace, this filter includes folders and also includes blobs that are in a soft-deleted state. Only the folders and files (blobs) that are explicitly deleted appear in reports. Child folders and files that are deleted as a result of deleting a parent folder aren't included in the report. |
No |
View the JSON for inventory rules by selecting the Code view tab in the Blob inventory section of the Azure portal. Filters are specified within a rule definition.
{
"destination": "inventory-destination-container",
"enabled": true,
"rules": [
{
"definition": {
"filters": {
"blobTypes": ["blockBlob", "appendBlob", "pageBlob"],
"prefixMatch": ["inventorytestcontainer1", "inventorytestcontainer2/abcd", "etc"],
"excludePrefix": ["inventorytestcontainer10", "etc/logs"],
"includeSnapshots": false,
"includeBlobVersions": true,
},
"format": "csv",
"objectType": "blob",
"schedule": "daily",
"schemaFields": ["Name", "Creation-Time"]
},
"enabled": true,
"name": "blobinventorytest",
"destination": "inventorydestinationContainer"
},
{
"definition": {
"filters": {
"prefixMatch": ["inventorytestcontainer1", "inventorytestcontainer2/abcd", "etc"]
},
"format": "csv",
"objectType": "container",
"schedule": "weekly",
"schemaFields": ["Name", "HasImmutabilityPolicy", "HasLegalHold"]
},
"enabled": true,
"name": "containerinventorytest",
"destination": "inventorydestinationContainer"
}
]
}
Custom schema fields supported for blob inventory
Note
The Data Lake Storage column shows support in accounts that have the hierarchical namespace feature enabled.
Field | Blob Storage (default support) | Data Lake Storage |
---|---|---|
Name (Required) | ||
Creation-Time | ||
Last-Modified | ||
LastAccessTime1 | ||
ETag | ||
Content-Length | ||
Content-Type | ||
Content-Encoding | ||
Content-Language | ||
Content-CRC64 | ||
Content-MD5 | ||
Cache-Control | ||
Cache-Disposition | ||
BlobType | ||
AccessTier | ||
AccessTierChangeTime | ||
LeaseStatus | ||
LeaseState | ||
ServerEncrypted | ||
CustomerProvidedKeySHA256 | ||
Metadata | ||
Expiry-Time | ||
hdi_isfolder | ||
Owner | ||
Group | ||
Permissions | ||
Acl | ||
Snapshot (Available and required when you choose to include snapshots in your report) | ||
Deleted | ||
DeletedId | ||
DeletedTime | ||
RemainingRetentionDays | ||
VersionId (Available and required when you choose to include blob versions in your report) | ||
IsCurrentVersion (Available and required when you choose to include blob versions in your report) | ||
TagCount | ||
Tags | ||
CopyId | ||
CopySource | ||
CopyStatus | ||
CopyProgress | ||
CopyCompletionTime | ||
CopyStatusDescription | ||
ImmutabilityPolicyUntilDate | ||
ImmutabilityPolicyMode | ||
LegalHold | ||
RehydratePriority | ||
ArchiveStatus | ||
EncryptionScope | ||
IncrementalCopy | ||
x-ms-blob-sequence-number |
1 Disabled by default. Optionally enable access time tracking.
Custom schema fields supported for container inventory
Note
The Data Lake Storage column shows support in accounts that have the hierarchical namespace feature enabled.
Field | Blob Storage (default support) | Data Lake Storage |
---|---|---|
Name (Required) | ||
Last-Modified | ||
ETag | ||
LeaseStatus | ||
LeaseState | ||
LeaseDuration | ||
Metadata | ||
PublicAccess | ||
DefaultEncryptionScope | ||
DenyEncryptionScopeOverride | ||
HasImmutabilityPolicy | ||
HasLegalHold | ||
ImmutableStorageWithVersioningEnabled | ||
Deleted (Appears only if include deleted containers is selected) | ||
Version (Appears only if include deleted containers is selected) | ||
DeletedTime (Will appear only if include deleted containers is selected) | ||
RemainingRetentionDays (Will appear only if include deleted containers is selected) |
Inventory run
If you configure a rule to run daily, then it will be scheduled to run every day. If you configure a rule to run weekly, then it will be scheduled to run each week on Sunday UTC time.
Most inventory runs complete within 24 hours. For hierarchical namespace enabled accounts, a run can take as long as two days, and depending on the number of files being processed, the run might not complete by end of that two days. The maximum amount of time that a run can complete before it fails is six days.
Runs don't overlap so a run must complete before another run of the same rule can begin. For example, if a rule is scheduled to run daily, but the previous day's run of that same rule is still in progress, then a new run won't be initiated that day. Rules that are scheduled to run weekly will run each Sunday regardless of whether a previous run succeeds or fails. If a run doesn't complete successfully, check subsequent runs to see if they complete before contacting support. The performance of a run can vary, so if a run doesn't complete, it's possible that subsequent runs will.
Inventory policies are read or written in full. Partial updates aren't supported. Inventory rules are evaluated daily. Therefore, if you change the definition of a rule, but the rules of a policy have already been evaluated for that day, then your updates won't be evaluated until the following day.
Inventory completed event
The BlobInventoryPolicyCompleted
event is generated when the inventory run completes for a rule. This event also occurs if the inventory run fails with a user error before it starts to run. For example, an invalid policy, or an error that occurs when a destination container isn't present will trigger the event. The following json shows an example BlobInventoryPolicyCompleted
event.
{
"topic": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/BlobInventory/providers/Microsoft.EventGrid/topics/BlobInventoryTopic",
"subject": "BlobDataManagement/BlobInventory",
"eventType": "Microsoft.Storage.BlobInventoryPolicyCompleted",
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"data": {
"scheduleDateTime": "2021-05-28T03:50:27Z",
"accountName": "testaccount",
"ruleName": "Rule_1",
"policyRunStatus": "Succeeded",
"policyRunStatusMessage": "Inventory run succeeded, refer manifest file for inventory details.",
"policyRunId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"manifestBlobUrl": "https://testaccount.blob.core.chinacloudapi.cn/inventory-destination-container/2021/05/26/13-25-36/Rule_1/Rule_1-manifest.json"
},
"dataVersion": "1.0",
"metadataVersion": "1",
"eventTime": "2021-05-28T15:03:18Z"
}
The following table describes the schema of the BlobInventoryPolicyCompleted
event.
Field | Type | Description |
---|---|---|
scheduleDateTime | string | The time that the inventory rule was scheduled. |
accountName | string | The storage account name. |
ruleName | string | The rule name. |
policyRunStatus | string | The status of inventory run. Possible values are Succeeded , PartiallySucceeded , and Failed . |
policyRunStatusMessage | string | The status message for the inventory run. |
policyRunId | string | The policy run ID for the inventory run. |
manifestBlobUrl | string | The blob URL for manifest file for inventory run. |
Inventory output
Each inventory rule generates a set of files in the specified inventory destination container for that rule. The inventory output is generated under the following path:
https://<accountName>.blob.core.chinacloudapi.cn/<inventory-destination-container>/YYYY/MM/DD/HH-MM-SS/<ruleName
where:
- accountName is your Azure Blob Storage account name.
- inventory-destination-container is the destination container you specified in the inventory rule.
- YYYY/MM/DD/HH-MM-SS is the time when the inventory began to run.
- ruleName is the inventory rule name.
Inventory files
Each inventory run for a rule generates the following files:
Inventory file: An inventory run for a rule generates a CSV or Apache Parquet formatted file. Each such file contains matched objects and their metadata.
Important
Starting in October 2023, inventory runs will produce multiple files if the object count is large. To learn more, see Multiple inventory file output FAQ.
Reports in the Apache Parquet format present dates in the following format:
timestamp_millis [number of milliseconds since 1970-01-01 00:00:00 UTC
]. For a CSV formatted file, the first row is always the schema row. The following image shows an inventory CSV file opened in Microsoft Excel.Important
The blob paths that appear in an inventory file might not appear in any particular order.
Checksum file: A checksum file contains the MD5 checksum of the contents of manifest.json file. The name of the checksum file is
<ruleName>-manifest.checksum
. Generation of the checksum file marks the completion of an inventory rule run.Manifest file: A manifest.json file contains the details of the inventory file(s) generated for that rule. The name of the file is
<ruleName>-manifest.json
. This file also captures the rule definition provided by the user and the path to the inventory for that rule. The following json shows the contents of a sample manifest.json file.{ "destinationContainer" : "inventory-destination-container", "endpoint" : "https://testaccount.blob.core.chinacloudapi.cn", "files" : [ { "blob" : "2021/05/26/13-25-36/Rule_1/Rule_1.csv", "size" : 12710092 } ], "inventoryCompletionTime" : "2021-05-26T13:35:56Z", "inventoryStartTime" : "2021-05-26T13:25:36Z", "ruleDefinition" : { "filters" : { "blobTypes" : [ "blockBlob" ], "includeBlobVersions" : false, "includeSnapshots" : false, "prefixMatch" : [ "penner-test-container-100003" ] }, "format" : "csv", "objectType" : "blob", "schedule" : "daily", "schemaFields" : [ "Name", "Creation-Time", "BlobType", "Content-Length", "LastAccessTime", "Last-Modified", "Metadata", "AccessTier" ] }, "ruleName" : "Rule_1", "status" : "Succeeded", "summary" : { "objectCount" : 110000, "totalObjectSize" : 23789775 }, "version" : "1.0" }
This file is created when the run begins. The
status
field of this file is set toPending
until the run completes. After the run completes, this field is set to a completion status (For example:Succeeded
orFailed
).
Feature support
Support for this feature might be impacted by enabling Data Lake Storage Gen2, Network File System (NFS) 3.0 protocol, or the SSH File Transfer Protocol (SFTP). If you've enabled any of these capabilities, see Blob Storage feature support in Azure Storage accounts to assess support for this feature.
Known issues and limitations
This section describes limitations and known issues of the Azure Storage blob inventory feature.
Inventory report object count and data size should not be compared to billing
An inventory report does not include metadata, system logs, and properties, so it shouldn't be compared to the billed object count and data size for the storage account.
Inventory jobs take a longer time to complete in certain cases
An inventory job can take a longer amount of time in these cases:
A large amount new data is added
A rule or set of rules is being run for the first time
The inventory run might take longer time to run as compared to the subsequent inventory runs.
An inventory run is processing a large amount of data in hierarchical namespace enabled accounts
An inventory job might take more than one day to complete for hierarchical namespace enabled accounts that have hundreds of millions of blobs. Sometimes the inventory job fails and doesn't create an inventory file. If a job doesn't complete successfully, check subsequent jobs to see if they're complete before contacting support.
There's no option to generate a report retrospectively for a particular date.
Inventory jobs can't write reports to containers that have an object replication policy
An object replication policy can prevent an inventory job from writing inventory reports to the destination container. Some other scenarios can archive the reports or make the reports immutable when they're partially completed which can cause inventory jobs to fail.
Inventory and Immutable Storage
You can't configure an inventory policy in the account if support for version-level immutability is enabled on that account, or if support for version-level immutability is enabled on the destination container that is defined in the inventory policy.
Reports might exclude soft-deleted blobs in accounts that have a hierarchical namespace
If a container or directory is deleted with soft-delete enabled, then the container or directory and all its contents are marked as soft-deleted. However, only the container or directory (reported as a zero-length blob) appears in an inventory report and not the soft-deleted blobs in that container or directory even if you set the includeDeleted
field of the policy to true. This can lead to a difference between what appears in capacity metrics that you obtain in the Azure portal and what is reported by an inventory report.
Only blobs that are explicitly deleted appear in reports. Therefore, to obtain a complete listing of all soft-deleted blobs (directory and all child blobs), workloads should delete each blob in a directory before deleting the directory itself.