你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

Using LangChain with Azure Database for PostgreSQL

Azure Database for PostgreSQL seamlessly integrates with leading large language model (LLM) orchestration packages such as LangChain, enabling developers to harness the power of advanced AI capabilities within their applications. LangChain can streamline the management and use of LLMs, embedding models, and databases, making it even easier to develop Generative AI applications.

This tutorial shows you how to use Azure Database for PostgreSQL integrated vector database to store and manage documents in collections with LangChain. It also shows how to create indices and perform vector search queries using approximate nearest neighbor algorithms such as Cosine Distance, L2 (Euclidean distance), and IP (inner product) to locate documents close to the query vectors.

Vector Support

Azure Database for PostgreSQL - Flexible Server enables you to efficiently store and query millions of vector embeddings in PostgreSQL and scale your AI use cases from POC (proof of concept) to production:

  • Provides a familiar SQL interface for querying vector embeddings and relational data.
  • Boosts pgvector with a faster and more precise similarity search across 100M+ vectors using DiskANN indexing algorithm.
  • Simplifies operations by integrating relational metadata, vector embeddings, and time-series data into a single database.
  • Uses the power of the robust PostgreSQL ecosystem and Azure Cloud for enterprise-grade features including replication, and high availability.

Authentication

Azure Database for PostgreSQL - Flexible Server supports password-based as well as Microsoft Entra (formerly Azure Active Directory) authentication. Entra authentication allows you to use Entra identity to authenticate to your PostgreSQL server. The Entra ID eliminates the need to manage separate usernames and passwords for your database users, and allows you to use the same security mechanisms that you use for other Azure services.

This notebook is set up to use either authentication method. You can configure whether or not to use Entra authentication later in the notebook.

Setup

Azure Database for PostgreSQL uses the open-source LangChain's Postgres support to connect to Azure Database for PostgreSQL. First download the partner package:

%pip install -qU langchain_postgres
%pip install -qU langchain-openai
%pip install -qU azure-identity

Enable pgvector on Azure Database for PostgreSQL - Flexible Server

See enablement instructions for Azure Database for PostgreSQL.

Credentials

You need your Azure Database for PostgreSQL connection details and add them as environment variables to run this notebook.

Set the USE_ENTRA_AUTH flag to True if you want to use Microsoft Entra authentication. If using Entra authentication, you only need to supply the host and database name. If using password authentication, you'll also need to set the username and password.

import getpass
import os

USE_ENTRA_AUTH = True

# Supply the connection details for the database
os.environ["DBHOST"] = "<server-name>"
os.environ["DBNAME"] = "<database-name>"
os.environ["SSLMODE"] = "require"

if not USE_ENTRA_AUTH:
    # If using a username and password, supply them here
    os.environ["DBUSER"] = "<username>"
    os.environ["DBPASSWORD"] = getpass.getpass("Database Password:")

Setup Azure OpenAI Embeddings

os.environ["AZURE_OPENAI_ENDPOINT"] = "<azure-openai-endpoint>"
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass("Azure OpenAI API Key:")
AZURE_OPENAI_ENDPOINT = os.environ["AZURE_OPENAI_ENDPOINT"]
AZURE_OPENAI_API_KEY = os.environ["AZURE_OPENAI_API_KEY"]

from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment="text-embedding-3-small",
)

Initialization

Microsoft Entra Authentication

The cell below contains functions that set up LangChain to use Entra authentication. It provides a function get_token_and_username that retrieves tokens for the Azure Databases for PostgreSQL service using DefaultAzureCredential from the azure.identity library. It ensures the sqlalchemy engine has a valid token with which to create new connections. It also parses the token, which is a Java Web Token (JWT), to extract the username that is used to connect to the database.

The create_postgres_engine function creates a sqlalchemy Engine that dynamically sets the username and password based on the token fetched from the TokenManager. This Engine can be passed into the connection parameter of the PGVector LangChain VectorStore.

Logging into Azure

To log into Azure, ensure you have the Azure CLI installed. You need to run the following command in your terminal:

az login

Once you log in, the following code fetches the token.

import base64
import json
from functools import lru_cache

from azure.identity import DefaultAzureCredential
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL


@lru_cache(maxsize=1)
def get_credential():
    """Memoized function to create the Azure credential, which caches tokens."""
    return DefaultAzureCredential()


def decode_jwt(token):
    """Decode the JWT payload to extract claims."""
    payload = token.split(".")[1]
    padding = "=" * (4 - len(payload) % 4)
    decoded_payload = base64.urlsafe_b64decode(payload + padding)
    return json.loads(decoded_payload)


def get_token_and_username():
    """Fetches a token returns the username and token."""
    # Fetch a new token and extract the username
    token = get_credential().get_token(
        "https://ossrdbms-aad.database.windows.net/.default"
    )
    claims = decode_jwt(token.token)
    username = claims.get("upn")
    if not username:
        raise ValueError("Could not extract username from token. Have you logged in?")

    return username, token.token


def create_postgres_engine():
    db_url = URL.create(
        drivername="postgresql+psycopg",
        username="",  # This will be replaced dynamically
        password="",  # This will be replaced dynamically
        host=os.environ["DBHOST"],
        port=os.environ.get("DBPORT", 5432),
        database=os.environ["DBNAME"],
    )

    # Create a sqlalchemy engine
    engine = create_engine(db_url, echo=True)

    # Listen for the connection event to inject dynamic credentials
    @event.listens_for(engine, "do_connect")
    def provide_dynamic_credentials(dialect, conn_rec, cargs, cparams):
        # Fetch the dynamic username and token
        username, token = get_token_and_username()

        # Override the connection parameters
        cparams["user"] = username
        cparams["password"] = token

    return engine

Password authentication

If not using Entra authentication, the get_connection_uri provides a connection URI that pulls the username and password from environment variables.

import urllib.parse


def get_connection_uri():
    # Read URI parameters from the environment
    dbhost = os.environ["DBHOST"]
    dbname = os.environ["DBNAME"]
    dbuser = urllib.parse.quote(os.environ["DBUSER"])
    password = os.environ["DBPASSWORD"]
    sslmode = os.environ["SSLMODE"]

    # Construct connection URI
    # Use psycopg 3!
    db_uri = (
        f"postgresql+psycopg://{dbuser}:{password}@{dbhost}/{dbname}?sslmode={sslmode}"
    )
    return db_uri

Creating the vector store

from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

collection_name = "my_docs"

# The connection is either a sqlalchemy engine or a connection URI
connection = create_postgres_engine() if USE_ENTRA_AUTH else get_connection_uri()

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

Manage vector store

Add items to vector store

Adding documents by ID over-writes any existing documents that match that ID.

docs = [
    Document(
        page_content="there are cats in the pond",
        metadata={"id": 1, "___location": "pond", "topic": "animals"},
    ),
    Document(
        page_content="ducks are also found in the pond",
        metadata={"id": 2, "___location": "pond", "topic": "animals"},
    ),
    Document(
        page_content="fresh apples are available at the market",
        metadata={"id": 3, "___location": "market", "topic": "food"},
    ),
    Document(
        page_content="the market also sells fresh oranges",
        metadata={"id": 4, "___location": "market", "topic": "food"},
    ),
    Document(
        page_content="the new art exhibit is fascinating",
        metadata={"id": 5, "___location": "museum", "topic": "art"},
    ),
    Document(
        page_content="a sculpture exhibit is also at the museum",
        metadata={"id": 6, "___location": "museum", "topic": "art"},
    ),
    Document(
        page_content="a new coffee shop opened on Main Street",
        metadata={"id": 7, "___location": "Main Street", "topic": "food"},
    ),
    Document(
        page_content="the book club meets at the library",
        metadata={"id": 8, "___location": "library", "topic": "reading"},
    ),
    Document(
        page_content="the library hosts a weekly story time for kids",
        metadata={"id": 9, "___location": "library", "topic": "reading"},
    ),
    Document(
        page_content="a cooking class for beginners is offered at the community center",
        metadata={"id": 10, "___location": "community center", "topic": "classes"},
    ),
]

vector_store.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

Update items in vector store

docs = [
    Document(
        page_content="Updated - cooking class for beginners is offered at the community center",
        metadata={"id": 10, "___location": "community center", "topic": "classes"},
    )
]
vector_store.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

Delete items from vector store

vector_store.delete(ids=["3"])

Query vector store

When your vector store has been created and the relevant documents has been added, you can query the vector store in your chain or agent.

Filtering support

The vector store supports a set of filters that can be applied against the metadata fields of the documents.

Operator Meaning/Category
$eq Equality (==)
$ne Inequality (!=)
$lt Less than (<)
$lte Less than or equal (<=)
$gt Greater than (>)
$gte Greater than or equal (>=)
$in Special Cased (in)
$nin Special Cased (not in)
$between Special Cased (between)
$like Text (like)
$ilike Text (case-insensitive like)
$and Logical (and)
$or Logical (or)

Query directly

Performing a simple similarity search can be done as follows:

results = vector_store.similarity_search(
    "kitty", k=10, filter={"id": {"$in": [1, 5, 2, 9]}}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")
    * there are cats in the pond [{'id': 1, 'topic': 'animals', '___location': 'pond'}]
    * ducks are also found in the pond [{'id': 2, 'topic': 'animals', '___location': 'pond'}]
    * the new art exhibit is fascinating [{'id': 5, 'topic': 'art', '___location': 'museum'}]
    * the library hosts a weekly story time for kids [{'id': 9, 'topic': 'reading', '___location': 'library'}]

If you provide a dict with multiple fields, but no operators, the top level is interpreted as a logical AND filter

vector_store.similarity_search(
    "ducks",
    k=10,
    filter={"id": {"$in": [1, 5, 2, 9]}, "___location": {"$in": ["pond", "market"]}},
)
[Document(id='2', metadata={'id': 2, 'topic': 'animals', '___location': 'pond'}, page_content='ducks are also found in the pond'),
 Document(id='1', metadata={'id': 1, 'topic': 'animals', '___location': 'pond'}, page_content='there are cats in the pond')]
vector_store.similarity_search(
    "ducks",
    k=10,
    filter={
        "$and": [
            {"id": {"$in": [1, 5, 2, 9]}},
            {"___location": {"$in": ["pond", "market"]}},
        ]
    },
)
[Document(id='2', metadata={'id': 2, 'topic': 'animals', '___location': 'pond'}, page_content='ducks are also found in the pond'),
 Document(id='1', metadata={'id': 1, 'topic': 'animals', '___location': 'pond'}, page_content='there are cats in the pond')]

If you want to execute a similarity search and receive the corresponding scores you can run:

results = vector_store.similarity_search_with_score(query="cats", k=1)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
* [SIM=0.528338] there are cats in the pond [{'id': 1, 'topic': 'animals', '___location': 'pond'}]

For a full list of the different searches you can execute on a PGVector vector store, refer to the API reference.

Query by turning into retriever

You can also transform the vector store into a retriever for easier usage in your chains.

retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("kitty")
[Document(id='1', metadata={'id': 1, 'topic': 'animals', '___location': 'pond'}, page_content='there are cats in the pond')]

Current limitations

  • langchain_postgres works only with psycopg3. Update your connection strings from postgresql+psycopg2://... to postgresql+psycopg://langchain:langchain@...
  • The schema of the embedding store and collection has changed to make add_documents work correctly with user specified IDs.
  • One has to pass an explicit connection object now.

Currently, there is no mechanism that supports easy data migration on schema changes. So any schema changes in the vector store require the user to recreate the tables and readd the documents.