次の方法で共有


ai_parse_document function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Important

This feature is in Beta.

The ai_parse_document() function invokes a state-of-the-art generative AI model from Databricks Foundation Model APIs to extract structured content from unstructured documents.

Requirements

Important

The model powering this function is part of the Llama family of models and is made available using Mosaic AI Model Serving Foundation Model APIs. See Applicable model developer licenses and terms for information about which Llama models are available on Databricks and the licenses and policies that govern the use of those models. If models emerge in the future that perform better according to Databricks's internal benchmarks, Databricks may change the models and update the documentation.

  • A workspace in one of the supported regions: eastus, eastus2, westus, centralus, or northcentralus.
  • Mosaic AI Agent Bricks Beta enabled.
  • Databricks Runtime 16.4 LTS or above.
  • If you are using serverless compute, the following is also required:
    • Must be compatible with Databricks Runtime 16.4 or above.
    • The serverless environment version must be set to 2, as this enables features like VARIANT.
    • Must use either Python or SQL. For additional serverless features and limitations, see Serverless compute limitations.
  • The ai_parse_document function is available using Databricks notebooks, SQL editor, Databricks workflows, jobs, or Lakeflow Declarative Pipelines.
  • See the Beta products pricing page for billing details.

Data security

Your document data is processed within the Databricks security perimeter. Databricks does not store the parameters that are passed into the ai_parse_document function calls, but does retain metadata run details, such as the Databricks Runtime version used.

Supported input file formats

Your input data files must be stored as blob data in bytes, meaning a binary type column in a dataframe or Delta table. If the source documents are stored in a Unity Catalog volume, the binary type column can be generated using Spark binaryFile format reader.

The following file formats are supported:

  • PDF
  • JPG / JPEG
  • PNG

Syntax

ai_parse_document(content)
ai_parse_document(content, Map("version" -> "1.0"))

Arguments

  • content: A BINARY expression representing the input byte array data.
  • version: The version of the output schema, supported: "1.0".

Returns

The ai_parse_document function extracts the contextual layout metadata from the document, like page_number, header, footer. It also extracts the content of the document such as text paragraphs or tables and represents it in markdown. The output is of VARIANT type.

Important

The function output schema is versioned using a major.minor format like, "1.0". Databricks might upgrade the supported or default version to reflect improved representations based on ongoing research.

  • Minor version upgrades are backward-compatible and might only introduce new fields.
  • Major version upgrades might include breaking changes such as field additions, removals, or renamings.

The following is the output schema:

{
  "document": {
    "pages": [
      {
        "id": INT,                 // 0-based page index
        "page_number": STRING,     // Extracted page number (NULL if not found)
        "header": STRING,          // Extracted page header (NULL if not found)
        "footer": STRING,          // Extracted page footer (NULL if not found)
        "content": STRING          // Text content (markdown) of the entire page
      }
    ],
    "elements": [
      {
        "id": INT,                 // 0-based element index
        "type": STRING,            // Supported: text, table, figure
        "content": STRING,         // Text content (markdown) of the target element
        "page_id": INT             // 0-based page index where the element appears
      }
    ],
  },
  "corrupted_data": [
    {
      "malformed_response": STRING  // The response in malformed json format
      "page_id": INT                // 0-based page index
    }
  ],
  "error_status": [
    {
      "error_message": STRING       // The detailed error message
      "page_id": INT                // 0-based page index
    }
  ],
  "metadata": {
    "version": STRING,              // The version of the output schema
    "backend_id": STRING            // The backend id where the document is parsed
  }
}

Examples

The following example uses ai_parse_document to extract document layouts as VARIANT output.

SQL

SELECT
  path,
  ai_parse_document(content)
FROM READ_FILES('/Volumes/path/to/source/file.pdf', format => 'binaryFile');

Python

from pyspark.sql.functions import *


df = spark.read.format("binaryFile") \
  .load("/Volumes/path/to/source/file.pdf") \
  .withColumn(
    "parsed",
    ai_parse_document("content"))
display(df)

Scala

import org.apache.spark.sql.functions._


val df = spark.read.format("binaryFile")
  .load("/Volumes/path/to/source/file.pdf")
  .withColumn(
    "parsed",
    ai_parse_document($"content"))
display(df)

The following example uses ai_parse_document to separate each top-level field of the output for example, document.pages, document.elements, corrupted_data, error_status, and metadata into individual columns.

SQL

WITH corpus AS (
  SELECT
    path,
    ai_parse_document(content) AS parsed
  FROM
    READ_FILES('/Volumes/path/to/source/file.pdf', format => 'binaryFile')
)
SELECT
  path,
  parsed:document:pages,
  parsed:document:elements,
  parsed:corrupted_data,
  parsed:error_status,
  parsed:metadata
FROM corpus;

Python

from pyspark.sql.functions import *


df = spark.read.format("binaryFile") \
 .load("/Volumes/path/to/source/file.pdf") \
 .withColumn(
   "parsed",
   ai_parse_document("content")) \
 .withColumn(
   "parsed_json",
   parse_json(col("parsed").cast("string"))) \
 .selectExpr(
   "path",
   "parsed_json:document:pages",
   "parsed_json:document:elements",
   "parsed_json:corrupted_data",
   "parsed_json:error_status",
   "parsed_json:metadata")
display(df)

Scala


import com.databricks.sql.catalyst.unstructured.DocumentParseResultV1_0
import org.apache.spark.sql.functions._


val df = spark.read.format("binaryFile")
 .load("/Volumes/path/to/source/file.pdf")
 .withColumn(
   "parsed",
   ai_parse_document($"content").cast(DocumentParseResultV1_0.SCHEMA))
 .select(
   $"path",
   $"parsed.*")
display(df)

Limitations

  • While Databricks is continuously working to improve all of its features, LLMs are an emerging technology and may produce errors.
  • The ai_parse_document function can take time to extract document content while preserving structural information, especially for documents that contain highly dense content or content with poor resolution. In some cases, the function may take a while to run or ignore content. Databricks is continuously working to improve latency.
  • See Supported input file formats. Databricks welcomes feedback on which additional formats are most important for your organization.
  • Customizing the model that powers ai_parse_document or using a customer-provided model for ai_parse_document is not supported.
  • The underlying model may not perform optimally when handling images using text of non-Latin alphabets, such as Japanese or Korean.
  • Documents with digital signatures may not be processed accurately.