AI_GENERATE_CHUNKS(Transact-SQL)

适用于: SQL Server 2025 (17.x) 预览版

AI_GENERATE_CHUNKS 是一个表值函数,它基于类型、大小和源表达式创建“区块”或文本片段。

兼容性级别 170

AI_GENERATE_CHUNKS 要求兼容级别至少为 170。 当级别小于 170 时,数据库引擎无法找到该 AI_GENERATE_CHUNKS 函数。

若要更改数据库的兼容性级别,请参阅 “查看或更改数据库的兼容性级别”。

语法

Transact-SQL 语法约定

AI_GENERATE_CHUNKS (source = text_expression
                    , chunk_type = 'FIXED'
                   [ [ , ] chunk_size = numeric_expression ]
                   [ [ , ] overlap = numeric_expression ]
)

论据

任何字符类型的表达式(例如 nvarchar、varcharncharchar)。

chunk_type

用于对文本/文档进行分块的类型或方法命名的字符串文本,不能 NULL 或列中的值。

此版本的接受值:

  • FIXED

chunk_size

chunk_typeFIXED时,此参数设置指定为变量、文本或小intintbigint 类型的标量表达式的每个区块的字符/字数计数大小chunk_size 不能为 NULL、负或零(0)。

重叠

重叠参数确定应包含在当前区块中的上述文本的百分比。 此百分比应用于 chunk_size 参数以以字符为单位计算大小。 可以将重叠值指定为变量、文本或类型为 tinyint、smallint、int 或 bigint 的标量表达式。 它必须是介于零(0)和 50 之间的整数(含)且不能为 NULL 或负数。 默认值为零 (0)。

返回类型

AI_GENERATE_CHUNKS 返回具有以下列的表:

列名称 数据类型 DESCRIPTION
chunk 与源表达式数据类型相同 返回从源表达式分块的文本。
chunk_set_id int 对文档或行的所有区块进行分组的 ID。 如果多个文档或行分块在单个事务中,则每个文档或行都有不同的 chunk_set_id分块。
chunk_order int 一系列与每个区块的排序相关的顺序,从开始 1 和递增 1
chunk_offset int 源数据/文档区块相对于分块过程的开始位置。
chunk_length int 返回的文本区块的字符长度。

返回示例

下面是使用以下参数返回结果 AI_GENERATE_CHUNKS 的示例:

  • FIXED区块类型。

  • 区块大小为 50 个字符。

  • 区块文本: All day long we seemed to dawdle through a country which was full of beauty of every kind. Sometimes we saw little towns or castles on the top of steep hills such as we see in old missals; sometimes we ran by rivers and streams which seemed from the wide stony margin on each side of them to be subject to great floods.

chunk_set_id chunk_order chunk_offset chunk_length
All day long we seemed to dawdle through a country 1 1 1 50
which was full of beauty of every kind. Sometimes 1 2 51 50
we saw little towns or castles on the top of stee 1 3 101 50
p hills such as we see in old missals; sometimes w 1 4 151 50
e ran by rivers and streams which seemed from the 1 5 201 50
wide stony margin on each side of them to be subje 1 6 251 50
ct to great floods. 1 7 301 19

注解

AI_GENERATE_CHUNKS 可用于包含多行的表。 根据区块大小和要分块的文本量,结果集指示何时启动 chunk_set_id 包含该列的新列或文档。 在以下示例中,当它完成对第一行的文本进行分块并移动到第二行时,更改 chunk_set_id 。 用于和chunk_offset重置的值chunk_order以指示新的起点。

CREATE TABLE textchunk (text_id INT IDENTITY(1,1) PRIMARY KEY, text_to_chunk nvarchar(max));
GO

INSERT INTO textchunk (text_to_chunk)
VALUES
('All day long we seemed to dawdle through a country which was full of beauty of every kind. Sometimes we saw little towns or castles on the top of steep hills such as we see in old missals; sometimes we ran by rivers and streams which seemed from the wide stony margin on each side of them to be subject to great floods.'),
('My Friend, Welcome to the Carpathians. I am anxiously expecting you. Sleep well to-night. At three to-morrow the diligence will start for Bukovina; a place on it is kept for you. At the Borgo Pass my carriage will await you and will bring you to me. I trust that your journey from London has been a happy one, and that you will enjoy your stay in my beautiful land. Your friend, DRACULA')
GO

SELECT c.*
FROM textchunk t
CROSS APPLY
   AI_GENERATE_CHUNKS(source = text_to_chunk, chunk_type = N'FIXED', chunk_size = 50) c
chunk_set_id chunk_order chunk_offset chunk_length
All day long we seemed to dawdle through a country 1 1 1 50
which was full of beauty of every kind. Sometimes 1 2 51 50
we saw little towns or castles on the top of stee 1 3 101 50
p hills such as we see in old missals; sometimes w 1 4 151 50
e ran by rivers and streams which seemed from the 1 5 201 50
wide stony margin on each side of them to be subje 1 6 251 50
ct to great floods. 1 7 301 19
My Friend, Welcome to the Carpathians. I am anxi 2 1 1 50
ously expecting you. Sleep well to-night. At three 2 2 51 50
to-morrow the diligence will start for Bukovina; 2 3 101 50
a place on it is kept for you. At the Borgo Pass m 2 4 151 50
y carriage will await you and will bring you to me 2 5 201 50
. I trust that your journey from London has been a 2 6 251 50
happy one, and that you will enjoy your stay in m 2 7 301 50
y beautiful land. Your friend, DRACULA 2 8 351 三十八

例子

答: 对具有 FIXED 类型和大小为 100 个字符的文本列进行分块

以下示例用于 AI_GENERATE_CHUNKS 对文本列进行分块。 它使用 chunk_typeFIXED 100 个字符和 chunk_size 100 个字符。

SELECT
    c.chunk
FROM
   docs_table t
CROSS APPLY
   AI_GENERATE_CHUNKS(source = text_column, chunk_type = N'FIXED', chunk_size = 100) c

B. 对文本列进行分块,使其重叠

以下示例使用 AI_GENERATE_CHUNKS 重叠对文本列进行分块。 它使用 FIXED 的chunk_type,chunk_size为 100 个字符,重叠百分比为 10%。

SELECT
    c.chunk
FROM
   docs_table t
CROSS APPLY
   AI_GENERATE_CHUNKS(source = text_column, chunk_type = N'FIXED', chunk_size = 100, overlap = 10) c

C. 将AI_GENERATE_EMBEDDINGS用于AI_GENERATE_CHUNKS

此示例用于AI_GENERATE_EMBEDDINGSAI_GENERATE_CHUNKS从文本区块创建嵌入内容,然后将从 AI 模型推理终结点返回的向量数组插入表中。

INSERT INTO
    my_embeddings (chunked_text, vector_embeddings)
SELECT
    c.chunk,
    AI_GENERATE_EMBEDDINGS(c.chunk USE MODEL MyAzureOpenAiModel)
FROM
    table_with_text t
CROSS APPLY
    AI_GENERATE_CHUNKS(source = t.text_to_chunk, chunk_type = N'FIXED', chunk_size = 100) c