Edit

Share via


Indexing blobs and files to produce multiple search documents

Applies to: Blob indexers, File indexers

By default, an indexer treats the contents of a blob or file as a single search document. If you want a more granular representation in a search index, you can set parsingMode values to create multiple search documents from one blob or file. The parsingMode values that result in many search documents include delimitedText (for CSV), and jsonArray or jsonLines (for JSON).

When you use any of these parsing modes, the new search documents that emerge must have unique document keys, and a problem arises in determining where that value comes from. The parent blob has at least one unique value in the form of metadata_storage_path property, but if it contributes that value to more than one search document, the key is no longer unique in the index.

To address this problem, the blob indexer generates an AzureSearch_DocumentKey that uniquely identifies each child search document created from the single blob parent. This article explains how this feature works.

One-to-many document key

A document key uniquely identifies each document in an index. When no parsing mode is specified, and if there's no explicit field mapping in the indexer definition for the search document key, the blob indexer automatically maps the metadata_storage_path property as the document key. This default mapping ensures that each blob appears as a distinct search document. It also eliminates the need for you to manually create this field mapping. Normally, fields with identical names and types are the only ones mapped automatically.

In a one-to-many search document scenario, an implicit document key based on metadata_storage_path property isn't possible. As a workaround, Azure AI Search can generate a document key for each individual entity extracted from a blob. The system generates a key called AzureSearch_DocumentKey and adds it to each search document. The indexer keeps track of the "many documents" created from each blob, and can target updates to the search index when source data changes over time.

By default, when no explicit field mappings for the key index field are specified, the AzureSearch_DocumentKey is mapped to it, using the base64Encode field-mapping function.

Example

Assume an index definition with the following fields:

  • id
  • temperature
  • pressure
  • timestamp

And your blob container has blobs with the following structure:

Blob1.json

{ "temperature": 100, "pressure": 100, "timestamp": "2024-02-13T00:00:00Z" }
{ "temperature" : 33, "pressure" : 30, "timestamp": "2024-02-14T00:00:00Z" }

Blob2.json

{ "temperature": 1, "pressure": 1, "timestamp": "2023-01-12T00:00:00Z" }
{ "temperature" : 120, "pressure" : 3, "timestamp": "2022-05-11T00:00:00Z" }

When you create an indexer and set the parsingMode to jsonLines - without specifying any explicit field mappings for the key field, the following mapping is applied implicitly.

{
    "sourceFieldName" : "AzureSearch_DocumentKey",
    "targetFieldName": "id",
    "mappingFunction": { "name" : "base64Encode" }
}

This setup results in disambiguated document keys, similar to the following illustration (base64-encoded ID shortened for brevity).

ID temperature pressure timestamp
aHR0 ... YjEuanNvbjsx 100 100 2024-02-13T00:00:00Z
aHR0 ... YjEuanNvbjsy 33 30 2024-02-14T00:00:00Z
aHR0 ... YjIuanNvbjsx 1 1 2023-01-12T00:00:00Z
aHR0 ... YjIuanNvbjsy 120 3 2022-05-11T00:00:00Z

Custom field mapping for index key field

Assuming the same index definition as the previous example, suppose your blob container has blobs with the following structure:

Blob1.json

recordid, temperature, pressure, timestamp
1, 100, 100,"2024-02-13T00:00:00Z" 
2, 33, 30,"2024-02-14T00:00:00Z" 

Blob2.json

recordid, temperature, pressure, timestamp
1, 1, 1,"20123-01-12T00:00:00Z" 
2, 120, 3,"2022-05-11T00:00:00Z" 

When you create an indexer with delimitedText parsingMode, it might feel natural to set up a field-mapping function to the key field as follows:

{
    "sourceFieldName" : "recordid",
    "targetFieldName": "id"
}

However, this mapping doesn't result in four documents showing up in the index because the recordid field isn't unique across blobs. Hence, we recommend you to make use of the implicit field mapping applied from the AzureSearch_DocumentKey property to the key index field for "one-to-many" parsing modes.

If you do want to set up an explicit field mapping, make sure that the sourceField is distinct for each individual entity across all blobs.

Note

The approach used by AzureSearch_DocumentKey of ensuring uniqueness per extracted entity is subject to change and therefore you shouldn't rely on its value for your application's needs.

Specify the index key field in your data

Assuming the same index definition as the previous example and parsingMode is set to jsonLines without specifying any explicit field mappings so the mappings look like in the first example, suppose your blob container has blobs with the following structure:

Blob1.json

id, temperature, pressure, timestamp
1, 100, 100,"2024-02-13T00:00:00Z" 
2, 33, 30,"2024-02-14T00:00:00Z"

Blob2.json

id, temperature, pressure, timestamp
1, 1, 1,"2023-01-12T00:00:00Z" 
2, 120, 3,"2022-05-11T00:00:00Z" 

Each document contains the id field, which is defined as the key field in the index. In this situation, the system generates a unique AzureSearch_DocumentKeyfor the document, but it isn't used as the "key." Instead, the value of theidfield is mapped to thekey` field.

Similar to the previous example, this mapping doesn't result in four documents showing up in the index because the id field isn't unique across blobs. When this situation occurs, any JSON entry that specifies an id causes a merge with the existing document instead of uploading a new one. The index then reflects the latest state of the entry with the specified id.

Limitations

When a document entry in the index is created from a line in a file, as explained in this article, deleting that line from the file does'nt automatically remove the corresponding entry from the index. To delete the document entry, you must manually submit a deletion request to the index using the REST API deletion operation.

Next steps

If you aren't already familiar with the basic structure and workflow of blob indexing, you should review Indexing Azure Blob Storage with Azure AI Search first. For more information about parsing modes for different blob content types, review the following articles.