非構造化データ用の取得ツールのビルドとトレース

2025-06-11

Mosaic AI Agent Framework を使用して、AI エージェントがドキュメントのコレクションなどの非構造化データに対してクエリを実行できるようにするツールを構築します。このページでは、次の方法を示します。

レトリバーをローカルで開発する
Unity カタログ関数を使用して取得機能を作成する
外部ベクターインデックスのクエリを実行する
監視のために MLflow トレースを追加する

エージェントツールの詳細については、 AI エージェントツールに関するページを参照してください。

AI Bridge を使用してベクター検索取得ツールをローカルで開発する

Databricks Vector Search レトリバーツールの構築を開始する最も速い方法は、やdatabricks-langchainなどの databricks-openaiを使用してローカルで開発してテストすることです。

LangChain/LangGraph

Databricks AI Bridge を含む最新バージョンの databricks-langchain をインストールします。

%pip install --upgrade databricks-langchain

次のコードは、仮想ベクトル検索インデックスに対してクエリを実行し、それを LLM にローカルでバインドするレトリバーツールをプロトタイプ化して、ツール呼び出し動作をテストできるようにします。

エージェントがツールを理解し、ツールを呼び出すタイミングを判断するのに役立つわかりやすい tool_description を提供します。

from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-claude-3-7-sonnet")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

セルフマネージド埋め込みを使用して直接アクセスインデックスまたは差分同期インデックスを使用するシナリオでは、 VectorSearchRetrieverTool を構成し、カスタム埋め込みモデルとテキスト列を指定する必要があります。埋め込みを提供するためのオプションを参照してください。

次の例では、VectorSearchRetrieverToolキーとcolumns キーを使用してembeddingを構成する方法を示します。

from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
    endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
  num_results=5, # Max number of documents to return
  columns=["primary_key", "text_column"], # List of columns to include in the search
  filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
  query_type="ANN", # Query type ("ANN" or "HYBRID").
  tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
  tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
  text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
  embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

詳細については、の VectorSearchRetrieverToolを参照してください。

オープンAI

Databricks AI Bridge を含む最新バージョンの databricks-openai をインストールします。

%pip install --upgrade databricks-openai

次のコードは、架空のベクター検索インデックスに対してクエリを実行し、OpenAI の GPT モデルと統合するレトリバーをプロトタイプ化します。

エージェントがツールを理解し、ツールを呼び出すタイミングを判断するのに役立つわかりやすい tool_description を提供します。

ツールに関する OpenAI の推奨事項の詳細については、 OpenAI 関数呼び出しに関するドキュメントを参照してください。

from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)

# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation"
)

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": "Using the Databricks documentation, answer what is Spark?"
  }
]
first_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"])  # For self-managed embeddings, optionally pass in openai_client=client

# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
  "role": "tool",
  "tool_call_id": tool_call.id,
  "content": json.dumps(result)
})
second_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

次の例では、VectorSearchRetrieverToolキーとcolumns キーを使用してembeddingを構成する方法を示します。

from databricks_openai import VectorSearchRetrieverTool

vs_tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
    num_results=5, # Max number of documents to return
    columns=["primary_key", "text_column"], # List of columns to include in the search
    filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
    query_type="ANN", # Query type ("ANN" or "HYBRID").
    tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
    tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
    text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
    embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

詳細については、の VectorSearchRetrieverToolを参照してください。

ローカルツールの準備ができたら、エージェントコードの一部として直接運用するか、Unity カタログ関数に移行できます。これによって、検出可能性とガバナンスが向上しますが、一定の制限があります。

次のセクションでは、取得元を Unity カタログ関数に移行する方法について説明します。

Unity カタログ関数を用いたベクター検索の取得ツール

モザイク AI ベクター検索インデックスクエリをラップする Unity Catalog 関数を作成できます。この方法の特徴は次のとおりです。

ガバナンスと検出可能性を備えた運用環境のユースケースをサポートします
内部で vector_search() SQL 関数を使用します
自動 MLflow トレースをサポート
- とpage_contentエイリアスを使用して、関数の出力を metadataに合わせる必要があります。
- 最上位レベルの出力キーとしてではなく、metadataを使用して、追加のメタデータ列を列に追加する必要があります。

ノートブックまたは SQL エディターで次のコードを実行して、関数を作成します。

CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
  -- The agent uses this comment to determine how to generate the query string parameter.
  query STRING
  COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
  chunked_text as page_content,
  map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
  vector_search(
    -- Specify your Vector Search index name here
    index => 'catalog.schema.databricks_docs_index',
    query => query,
    num_results => 5
  )

このレトリバーツールを AI エージェントで使用するには、UCFunctionToolkitでラップします。これにより、MLflow ログで RETRIEVER スパンの種類が自動的に生成され、MLflow を介した自動トレースが可能になります。

from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
    function_names=[
        "main.default.databricks_docs_vector_search"
    ]
)
tools = toolkit.tools

Unity カタログ取得ツールには、次の注意事項があります。

SQL クライアントでは、返される行またはバイトの最大数が制限される場合があります。データの切り捨てを防ぐには、UDF によって返される列値を切り捨てる必要があります。たとえば、substring(chunked_text, 0, 8192) を使用して、大きなコンテンツ列のサイズを小さくし、実行中に行の切り捨てを回避できます。
このツールは vector_search() 関数のラッパーであるため、vector_search() 関数と同じ制限が適用されます。制限事項を参照してください。

UCFunctionToolkitの詳細については、Unity カタログのドキュメントを参照してください。

Databricks の外部でホストされているベクターインデックスに対してクエリを実行するリトリーバー

ベクターインデックスが Azure Databricks の外部でホストされている場合は、Unity カタログ接続を作成して外部サービスに接続し、エージェントコードでその接続を使用できます。「AI エージェントツールを外部サービスに接続する」を参照してください。

次の例では、PyFunc フレーバーエージェントに対して Databricks の外部でホストされているベクターインデックスを呼び出すレトリバーを作成します。

外部サービス (この場合は Azure) への Unity カタログ接続を作成します。

CREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
  host 'https://example.search.windows.net',
  base_path '/',
  bearer_token secret ('<secret-scope>','<secret-key>')
);

Unity カタログ接続を使用して、エージェントコードで取得ツールを定義します。この例では、MLflow デコレーターを使用してエージェントトレースを有効にします。

注

MLflow レトリバースキーマに準拠するには、retriever 関数は List[Document] オブジェクトを返し、Document クラスの metadata フィールドを使用して、 doc_uri や similarity_scoreなど、返されるドキュメントに属性を追加する必要があります。 MLflow ドキュメントを参照してください。

import mlflow
import json

from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict

class VectorSearchRetriever:
  """
  Class using Databricks Vector Search to retrieve relevant documents.
  """

  def __init__(self):
    self.azure_search_index = "hotels_vector_index"

  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
      query: Search query.
      score_threshold: Score threshold to use for the query.

    Returns:
      List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
      "count": true,
      "select": "HotelId, HotelName, Description, Category",
      "vectorQueries": [
        {
          "vector": query_vector,
          "k": 7,
          "fields": "DescriptionVector",
          "kind": "vector",
          "exhaustive": true,
        }
      ],
    }

    response = (
      WorkspaceClient()
      .serving_endpoints.http_request(
        conn=connection_name,
        method=ExternalFunctionRequestHttpMethod.POST,
        path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
        json=json,
      )
      .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

  @mlflow.trace(span_type="PARSER")
  def convert_vector_search_to_documents(
    self, vs_results, score_threshold
  ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
      score = item.get("@search.score", 0)

      if score >= score_threshold:
        metadata = {
          "score": score,
          "HotelName": item.get("HotelName"),
          "Category": item.get("Category"),
        }

        doc = Document(
          page_content=item.get("Description", ""),
          metadata=metadata,
          id=item.get("HotelId"),
        )
        docs.append(doc)

    return docs

レトリバーを実行するには、次の Python コードを実行します。必要に応じて、要求にベクター検索フィルターを含めて結果をフィルター処理できます。
```
retriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . .  TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)
```

リトリーバーにトレースを追加する

MLflow トレースを追加して、レトリバーを監視およびデバッグします。トレースを使用すると、実行の各ステップの入力、出力、およびメタデータを表示できます。

前の例では、メソッドと解析メソッドの両方に __call__を追加しています。デコレーターは、関数が呼び出されたときに開始し、戻ったときに終了するスパンを作成します。 MLflow は、関数の入力と出力、および発生した例外を自動的に記録します。

注

LangChain、LlamaIndex、OpenAI ライブラリのユーザーは、デコレーターを使用してトレースを手動で定義するだけでなく、MLflow 自動ログを使用できます。「アプリのインストルメント化: トレースアプローチ」を参照してください。

import mlflow
from mlflow.entities import Document

## This code snippet has been truncated for brevity, see the full retriever example above
class VectorSearchRetriever:
  ...

  # Create a RETRIEVER span. The span name must match the retriever schema name.
  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(...) -> List[Document]:
    ...

  # Create a PARSER span.
  @mlflow.trace(span_type="PARSER")
  def parse_results(...) -> List[Document]:
    ...

Agent Evaluation や AI Playground などのダウンストリームアプリケーションでレトリバートレースが正しくレンダリングされるようにするには、デコレーターが次の要件を満たしていることを確認します。

(https://mlflow.org/docs/latest/tracing/tracing-schema.html#retriever-spans) を使用し、関数が List[Document] オブジェクトを返すようにします。
トレースを正しく構成するには、トレース名と retriever_schema 名が一致している必要があります。取得スキーマを設定する方法については、次のセクションを参照してください。

MLflow の互換性を確保するようにレトリバースキーマを設定する

取得元または span_type="RETRIEVER" から返されたトレースが MLflow の標準取得スキーマに準拠していない場合は、返されたスキーマを MLflow の予期されるフィールドに手動でマップする必要があります。これにより、MLflow がリトリーバーを適切にトレースし、ダウンストリームアプリケーションでトレースをレンダリングできるようになります。

取得元スキーマを手動で設定するには:

エージェントを定義するときにmlflow.models.set_retriever_schemaを呼び出します。 set_retriever_schemaを使用して、返されたテーブル内の列名を、primary_key、text_column、doc_uriなどの MLflow の予期されるフィールドにマップします。
```
# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
  name="vector_search",
  primary_key="chunk_id",
  text_column="text_column",
  doc_uri="doc_uri"
  # other_columns=["column1", "column2"],
)
```
other_columns フィールドを含む列名の一覧を指定して、取得元のスキーマに追加の列を指定します。
複数のレトリバーがある場合は、各レトリバースキーマに一意の名前を使用して、複数のスキーマを定義できます。

エージェントの作成時に設定された取得スキーマは、レビューアプリや評価セットなどのダウンストリームアプリケーションとワークフローに影響します。具体的には、doc_uri 列は、取得元によって返されるドキュメントのプライマリ識別子として機能します。

レビューアプリには、レビュー担当者が応答を評価し、ドキュメントの配信元を追跡するのに役立つdoc_uriが表示されます。アプリ UI の確認を参照してください。
評価セット では、 doc_uri を使用して、取得元の結果を定義済みの評価データセットと比較して、取得元の再現率と精度を判断します。評価セット (MLflow 2) を参照してください。

次のステップ

レトリバーを構築した後、最後の手順は AI エージェント定義に統合することです。エージェントにツールを追加する方法については、「エージェントに Unity カタログツールを追加する」を参照してください。

次の方法で共有

非構造化データ用の取得ツールのビルドとトレース

AI Bridge を使用してベクター検索取得ツールをローカルで開発する

LangChain/LangGraph

オープンAI

Unity カタログ関数を用いたベクター検索の取得ツール

Databricks の外部でホストされているベクター インデックスに対してクエリを実行するリトリーバー

リトリーバーにトレースを追加する

MLflow の互換性を確保するようにレトリバー スキーマを設定する

次のステップ

フィードバック

その他のリソース

Databricks の外部でホストされているベクターインデックスに対してクエリを実行するリトリーバー

MLflow の互換性を確保するようにレトリバースキーマを設定する