Skip to content

Reference for ultralytics/nn/text_model.py

Note

This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/nn/text_model.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!


ultralytics.nn.text_model.TextModel

TextModel()

Bases: Module

Abstract base class for text encoding models.

This class defines the interface for text encoding models used in vision-language tasks. Subclasses must implement the tokenize and encode_text methods to provide text tokenization and encoding functionality.

Methods:

Name Description
tokenize

Convert input texts to tokens for model processing.

encode_text

Encode tokenized texts into normalized feature vectors.

Source code in ultralytics/nn/text_model.py
33
34
35
def __init__(self):
    """Initialize the TextModel base class."""
    super().__init__()

encode_text abstractmethod

encode_text(texts, dtype)

Encode tokenized texts into normalized feature vectors.

Source code in ultralytics/nn/text_model.py
42
43
44
45
@abstractmethod
def encode_text(self, texts, dtype):
    """Encode tokenized texts into normalized feature vectors."""
    pass

tokenize abstractmethod

tokenize(texts)

Convert input texts to tokens for model processing.

Source code in ultralytics/nn/text_model.py
37
38
39
40
@abstractmethod
def tokenize(self, texts):
    """Convert input texts to tokens for model processing."""
    pass





ultralytics.nn.text_model.CLIP

CLIP(size: str, device: device)

Bases: TextModel

Implements OpenAI's CLIP (Contrastive Language-Image Pre-training) text encoder.

This class provides a text encoder based on OpenAI's CLIP model, which can convert text into feature vectors that are aligned with corresponding image features in a shared embedding space.

Attributes:

Name Type Description
model CLIP

The loaded CLIP model.

device device

Device where the model is loaded.

Methods:

Name Description
tokenize

Convert input texts to CLIP tokens.

encode_text

Encode tokenized texts into normalized feature vectors.

Examples:

>>> import torch
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> clip_model = CLIP(size="ViT-B/32", device=device)
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> text_features = clip_model.encode_text(tokens)
>>> print(text_features.shape)

This class implements the TextModel interface using OpenAI's CLIP model for text encoding. It loads a pre-trained CLIP model of the specified size and prepares it for text encoding tasks.

Parameters:

Name Type Description Default
size str

Model size identifier (e.g., 'ViT-B/32').

required
device device

Device to load the model on.

required

Examples:

>>> import torch
>>> clip_model = CLIP("ViT-B/32", device=torch.device("cuda:0"))
>>> text_features = clip_model.encode_text(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def __init__(self, size: str, device: torch.device) -> None:
    """
    Initialize the CLIP text encoder.

    This class implements the TextModel interface using OpenAI's CLIP model for text encoding. It loads
    a pre-trained CLIP model of the specified size and prepares it for text encoding tasks.

    Args:
        size (str): Model size identifier (e.g., 'ViT-B/32').
        device (torch.device): Device to load the model on.

    Examples:
        >>> import torch
        >>> clip_model = CLIP("ViT-B/32", device=torch.device("cuda:0"))
        >>> text_features = clip_model.encode_text(["a photo of a cat", "a photo of a dog"])
    """
    super().__init__()
    self.model, self.image_preprocess = clip.load(size, device=device)
    self.to(device)
    self.device = device
    self.eval()

encode_image

encode_image(
    image: Union[Image, Tensor], dtype: dtype = torch.float32
) -> torch.Tensor

Encode preprocessed images into normalized feature vectors.

This method processes preprocessed image inputs through the CLIP model to generate feature vectors, which are then normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.

Parameters:

Name Type Description Default
image Image | Tensor

Preprocessed image input. If a PIL Image is provided, it will be converted to a tensor using the model's image preprocessing function.

required
dtype dtype

Data type for output features.

float32

Returns:

Type Description
Tensor

Normalized image feature vectors with unit length (L2 norm = 1).

Examples:

>>> from ultralytics.nn.text_model import CLIP
>>> from PIL import Image
>>> clip_model = CLIP("ViT-B/32", device="cuda")
>>> image = Image.open("path/to/image.jpg")
>>> image_tensor = clip_model.image_preprocess(image).unsqueeze(0).to("cuda")
>>> features = clip_model.encode_image(image_tensor)
>>> features.shape
torch.Size([1, 512])
Source code in ultralytics/nn/text_model.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
@smart_inference_mode()
def encode_image(self, image: Union[Image.Image, torch.Tensor], dtype: torch.dtype = torch.float32) -> torch.Tensor:
    """
    Encode preprocessed images into normalized feature vectors.

    This method processes preprocessed image inputs through the CLIP model to generate feature vectors, which are then
    normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.

    Args:
        image (PIL.Image | torch.Tensor): Preprocessed image input. If a PIL Image is provided, it will be
            converted to a tensor using the model's image preprocessing function.
        dtype (torch.dtype, optional): Data type for output features.

    Returns:
        (torch.Tensor): Normalized image feature vectors with unit length (L2 norm = 1).

    Examples:
        >>> from ultralytics.nn.text_model import CLIP
        >>> from PIL import Image
        >>> clip_model = CLIP("ViT-B/32", device="cuda")
        >>> image = Image.open("path/to/image.jpg")
        >>> image_tensor = clip_model.image_preprocess(image).unsqueeze(0).to("cuda")
        >>> features = clip_model.encode_image(image_tensor)
        >>> features.shape
        torch.Size([1, 512])
    """
    if isinstance(image, Image.Image):
        image = self.image_preprocess(image).unsqueeze(0).to(self.device)
    img_feats = self.model.encode_image(image).to(dtype)
    img_feats = img_feats / img_feats.norm(p=2, dim=-1, keepdim=True)
    return img_feats

encode_text

encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor

Encode tokenized texts into normalized feature vectors.

This method processes tokenized text inputs through the CLIP model to generate feature vectors, which are then normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.

Parameters:

Name Type Description Default
texts Tensor

Tokenized text inputs, typically created using the tokenize() method.

required
dtype dtype

Data type for output features.

float32

Returns:

Type Description
Tensor

Normalized text feature vectors with unit length (L2 norm = 1).

Examples:

>>> clip_model = CLIP("ViT-B/32", device="cuda")
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = clip_model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512])
Source code in ultralytics/nn/text_model.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
@smart_inference_mode()
def encode_text(self, texts: torch.Tensor, dtype: torch.dtype = torch.float32) -> torch.Tensor:
    """
    Encode tokenized texts into normalized feature vectors.

    This method processes tokenized text inputs through the CLIP model to generate feature vectors, which are then
    normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.

    Args:
        texts (torch.Tensor): Tokenized text inputs, typically created using the tokenize() method.
        dtype (torch.dtype, optional): Data type for output features.

    Returns:
        (torch.Tensor): Normalized text feature vectors with unit length (L2 norm = 1).

    Examples:
        >>> clip_model = CLIP("ViT-B/32", device="cuda")
        >>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
        >>> features = clip_model.encode_text(tokens)
        >>> features.shape
        torch.Size([2, 512])
    """
    txt_feats = self.model.encode_text(texts).to(dtype)
    txt_feats = txt_feats / txt_feats.norm(p=2, dim=-1, keepdim=True)
    return txt_feats

tokenize

tokenize(texts: Union[str, List[str]]) -> torch.Tensor

Convert input texts to CLIP tokens.

Parameters:

Name Type Description Default
texts str | List[str]

Input text or list of texts to tokenize.

required

Returns:

Type Description
Tensor

Tokenized text tensor with shape (batch_size, context_length) ready for model processing.

Examples:

>>> model = CLIP("ViT-B/32", device="cpu")
>>> tokens = model.tokenize("a photo of a cat")
>>> print(tokens.shape)  # torch.Size([1, 77])
Source code in ultralytics/nn/text_model.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def tokenize(self, texts: Union[str, List[str]]) -> torch.Tensor:
    """
    Convert input texts to CLIP tokens.

    Args:
        texts (str | List[str]): Input text or list of texts to tokenize.

    Returns:
        (torch.Tensor): Tokenized text tensor with shape (batch_size, context_length) ready for model processing.

    Examples:
        >>> model = CLIP("ViT-B/32", device="cpu")
        >>> tokens = model.tokenize("a photo of a cat")
        >>> print(tokens.shape)  # torch.Size([1, 77])
    """
    return clip.tokenize(texts).to(self.device)





ultralytics.nn.text_model.MobileCLIP

MobileCLIP(size: str, device: device)

Bases: TextModel

Implement Apple's MobileCLIP text encoder for efficient text encoding.

This class implements the TextModel interface using Apple's MobileCLIP model, providing efficient text encoding capabilities for vision-language tasks with reduced computational requirements compared to standard CLIP models.

Attributes:

Name Type Description
model MobileCLIP

The loaded MobileCLIP model.

tokenizer callable

Tokenizer function for processing text inputs.

device device

Device where the model is loaded.

config_size_map dict

Mapping from size identifiers to model configuration names.

Methods:

Name Description
tokenize

Convert input texts to MobileCLIP tokens.

encode_text

Encode tokenized texts into normalized feature vectors.

Examples:

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIP(size="s0", device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)

This class implements the TextModel interface using Apple's MobileCLIP model for efficient text encoding.

Parameters:

Name Type Description Default
size str

Model size identifier (e.g., 's0', 's1', 's2', 'b', 'blt').

required
device device

Device to load the model on.

required

Examples:

>>> import torch
>>> model = MobileCLIP("s0", device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def __init__(self, size: str, device: torch.device) -> None:
    """
    Initialize the MobileCLIP text encoder.

    This class implements the TextModel interface using Apple's MobileCLIP model for efficient text encoding.

    Args:
        size (str): Model size identifier (e.g., 's0', 's1', 's2', 'b', 'blt').
        device (torch.device): Device to load the model on.

    Examples:
        >>> import torch
        >>> model = MobileCLIP("s0", device=torch.device("cpu"))
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
        >>> features = model.encode_text(tokens)
    """
    try:
        import warnings

        # Suppress 'timm.models.layers is deprecated, please import via timm.layers' warning from mobileclip usage
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=FutureWarning)
            import mobileclip
    except ImportError:
        # Ultralytics fork preferred since Apple MobileCLIP repo has incorrect version of torchvision
        checks.check_requirements("git+https://github.com/ultralytics/mobileclip.git")
        import mobileclip

    super().__init__()
    config = self.config_size_map[size]
    file = f"mobileclip_{size}.pt"
    if not Path(file).is_file():
        from ultralytics import download

        download(f"https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/{file}")
    self.model = mobileclip.create_model_and_transforms(f"mobileclip_{config}", pretrained=file, device=device)[0]
    self.tokenizer = mobileclip.get_tokenizer(f"mobileclip_{config}")
    self.to(device)
    self.device = device
    self.eval()

encode_text

encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor

Encode tokenized texts into normalized feature vectors.

Parameters:

Name Type Description Default
texts Tensor

Tokenized text inputs.

required
dtype dtype

Data type for output features.

float32

Returns:

Type Description
Tensor

Normalized text feature vectors with L2 normalization applied.

Examples:

>>> model = MobileCLIP("s0", device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512])  # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
@smart_inference_mode()
def encode_text(self, texts: torch.Tensor, dtype: torch.dtype = torch.float32) -> torch.Tensor:
    """
    Encode tokenized texts into normalized feature vectors.

    Args:
        texts (torch.Tensor): Tokenized text inputs.
        dtype (torch.dtype, optional): Data type for output features.

    Returns:
        (torch.Tensor): Normalized text feature vectors with L2 normalization applied.

    Examples:
        >>> model = MobileCLIP("s0", device="cpu")
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
        >>> features = model.encode_text(tokens)
        >>> features.shape
        torch.Size([2, 512])  # Actual dimension depends on model size
    """
    text_features = self.model.encode_text(texts).to(dtype)
    text_features /= text_features.norm(p=2, dim=-1, keepdim=True)
    return text_features

tokenize

tokenize(texts: List[str]) -> torch.Tensor

Convert input texts to MobileCLIP tokens.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to tokenize.

required

Returns:

Type Description
Tensor

Tokenized text inputs with shape (batch_size, sequence_length).

Examples:

>>> model = MobileCLIP("s0", "cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def tokenize(self, texts: List[str]) -> torch.Tensor:
    """
    Convert input texts to MobileCLIP tokens.

    Args:
        texts (List[str]): List of text strings to tokenize.

    Returns:
        (torch.Tensor): Tokenized text inputs with shape (batch_size, sequence_length).

    Examples:
        >>> model = MobileCLIP("s0", "cpu")
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
    """
    return self.tokenizer(texts).to(self.device)





ultralytics.nn.text_model.MobileCLIPTS

MobileCLIPTS(device: device)

Bases: TextModel

Load a TorchScript traced version of MobileCLIP.

This class implements the TextModel interface using Apple's MobileCLIP model in TorchScript format, providing efficient text encoding capabilities for vision-language tasks with optimized inference performance.

Attributes:

Name Type Description
encoder ScriptModule

The loaded TorchScript MobileCLIP text encoder.

tokenizer callable

Tokenizer function for processing text inputs.

device device

Device where the model is loaded.

Methods:

Name Description
tokenize

Convert input texts to MobileCLIP tokens.

encode_text

Encode tokenized texts into normalized feature vectors.

Examples:

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIPTS(device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)

This class implements the TextModel interface using Apple's MobileCLIP model in TorchScript format for efficient text encoding with optimized inference performance.

Parameters:

Name Type Description Default
device device

Device to load the model on.

required

Examples:

>>> model = MobileCLIPTS(device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
def __init__(self, device: torch.device):
    """
    Initialize the MobileCLIP TorchScript text encoder.

    This class implements the TextModel interface using Apple's MobileCLIP model in TorchScript format for
    efficient text encoding with optimized inference performance.

    Args:
        device (torch.device): Device to load the model on.

    Examples:
        >>> model = MobileCLIPTS(device=torch.device("cpu"))
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
        >>> features = model.encode_text(tokens)
    """
    super().__init__()
    from ultralytics.utils.downloads import attempt_download_asset

    self.encoder = torch.jit.load(attempt_download_asset("mobileclip_blt.ts"), map_location=device)
    self.tokenizer = clip.clip.tokenize
    self.device = device

encode_text

encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor

Encode tokenized texts into normalized feature vectors.

Parameters:

Name Type Description Default
texts Tensor

Tokenized text inputs.

required
dtype dtype

Data type for output features.

float32

Returns:

Type Description
Tensor

Normalized text feature vectors with L2 normalization applied.

Examples:

>>> model = MobileCLIPTS(device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512])  # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
@smart_inference_mode()
def encode_text(self, texts: torch.Tensor, dtype: torch.dtype = torch.float32) -> torch.Tensor:
    """
    Encode tokenized texts into normalized feature vectors.

    Args:
        texts (torch.Tensor): Tokenized text inputs.
        dtype (torch.dtype, optional): Data type for output features.

    Returns:
        (torch.Tensor): Normalized text feature vectors with L2 normalization applied.

    Examples:
        >>> model = MobileCLIPTS(device="cpu")
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
        >>> features = model.encode_text(tokens)
        >>> features.shape
        torch.Size([2, 512])  # Actual dimension depends on model size
    """
    # NOTE: no need to do normalization here as it's embedded in the torchscript model
    return self.encoder(texts).to(dtype)

tokenize

tokenize(texts: List[str]) -> torch.Tensor

Convert input texts to MobileCLIP tokens.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to tokenize.

required

Returns:

Type Description
Tensor

Tokenized text inputs with shape (batch_size, sequence_length).

Examples:

>>> model = MobileCLIPTS("cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
def tokenize(self, texts: List[str]) -> torch.Tensor:
    """
    Convert input texts to MobileCLIP tokens.

    Args:
        texts (List[str]): List of text strings to tokenize.

    Returns:
        (torch.Tensor): Tokenized text inputs with shape (batch_size, sequence_length).

    Examples:
        >>> model = MobileCLIPTS("cpu")
        >>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
    """
    return self.tokenizer(texts).to(self.device)





ultralytics.nn.text_model.build_text_model

build_text_model(variant: str, device: device = None) -> TextModel

Build a text encoding model based on the specified variant.

Parameters:

Name Type Description Default
variant str

Model variant in format "base:size" (e.g., "clip:ViT-B/32" or "mobileclip:s0").

required
device device

Device to load the model on.

None

Returns:

Type Description
TextModel

Instantiated text encoding model.

Examples:

>>> model = build_text_model("clip:ViT-B/32", device=torch.device("cuda"))
>>> model = build_text_model("mobileclip:s0", device=torch.device("cpu"))
Source code in ultralytics/nn/text_model.py
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
def build_text_model(variant: str, device: torch.device = None) -> TextModel:
    """
    Build a text encoding model based on the specified variant.

    Args:
        variant (str): Model variant in format "base:size" (e.g., "clip:ViT-B/32" or "mobileclip:s0").
        device (torch.device, optional): Device to load the model on.

    Returns:
        (TextModel): Instantiated text encoding model.

    Examples:
        >>> model = build_text_model("clip:ViT-B/32", device=torch.device("cuda"))
        >>> model = build_text_model("mobileclip:s0", device=torch.device("cpu"))
    """
    base, size = variant.split(":")
    if base == "clip":
        return CLIP(size, device)
    elif base == "mobileclip":
        return MobileCLIPTS(device)
    else:
        raise ValueError(f"Unrecognized base model: '{base}'. Supported base models: 'clip', 'mobileclip'.")





📅 Created 3 months ago ✏️ Updated 2 months ago