Reference for ultralytics/nn/text_model.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/nn/text_model.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.nn.text_model.TextModel
TextModel()
Bases: Module
Abstract base class for text encoding models.
This class defines the interface for text encoding models used in vision-language tasks. Subclasses must implement the tokenize and encode_text methods to provide text tokenization and encoding functionality.
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to tokens for model processing. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Source code in ultralytics/nn/text_model.py
33 34 35 |
|
encode_text
abstractmethod
encode_text(texts, dtype)
Encode tokenized texts into normalized feature vectors.
Source code in ultralytics/nn/text_model.py
42 43 44 45 |
|
tokenize
abstractmethod
tokenize(texts)
Convert input texts to tokens for model processing.
Source code in ultralytics/nn/text_model.py
37 38 39 40 |
|
ultralytics.nn.text_model.CLIP
CLIP(size: str, device: device)
Bases: TextModel
Implements OpenAI's CLIP (Contrastive Language-Image Pre-training) text encoder.
This class provides a text encoder based on OpenAI's CLIP model, which can convert text into feature vectors that are aligned with corresponding image features in a shared embedding space.
Attributes:
Name | Type | Description |
---|---|---|
model |
CLIP
|
The loaded CLIP model. |
device |
device
|
Device where the model is loaded. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to CLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> import torch
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> clip_model = CLIP(size="ViT-B/32", device=device)
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> text_features = clip_model.encode_text(tokens)
>>> print(text_features.shape)
This class implements the TextModel interface using OpenAI's CLIP model for text encoding. It loads a pre-trained CLIP model of the specified size and prepares it for text encoding tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
str
|
Model size identifier (e.g., 'ViT-B/32'). |
required |
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> import torch
>>> clip_model = CLIP("ViT-B/32", device=torch.device("cuda:0"))
>>> text_features = clip_model.encode_text(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|
encode_image
encode_image(
image: Union[Image, Tensor], dtype: dtype = torch.float32
) -> torch.Tensor
Encode preprocessed images into normalized feature vectors.
This method processes preprocessed image inputs through the CLIP model to generate feature vectors, which are then normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image
|
Image | Tensor
|
Preprocessed image input. If a PIL Image is provided, it will be converted to a tensor using the model's image preprocessing function. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized image feature vectors with unit length (L2 norm = 1). |
Examples:
>>> from ultralytics.nn.text_model import CLIP
>>> from PIL import Image
>>> clip_model = CLIP("ViT-B/32", device="cuda")
>>> image = Image.open("path/to/image.jpg")
>>> image_tensor = clip_model.image_preprocess(image).unsqueeze(0).to("cuda")
>>> features = clip_model.encode_image(image_tensor)
>>> features.shape
torch.Size([1, 512])
Source code in ultralytics/nn/text_model.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
|
encode_text
encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor
Encode tokenized texts into normalized feature vectors.
This method processes tokenized text inputs through the CLIP model to generate feature vectors, which are then normalized to unit length. These normalized vectors can be used for text-image similarity comparisons.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs, typically created using the tokenize() method. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with unit length (L2 norm = 1). |
Examples:
>>> clip_model = CLIP("ViT-B/32", device="cuda")
>>> tokens = clip_model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = clip_model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512])
Source code in ultralytics/nn/text_model.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
tokenize
tokenize(texts: Union[str, List[str]]) -> torch.Tensor
Convert input texts to CLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
str | List[str]
|
Input text or list of texts to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text tensor with shape (batch_size, context_length) ready for model processing. |
Examples:
>>> model = CLIP("ViT-B/32", device="cpu")
>>> tokens = model.tokenize("a photo of a cat")
>>> print(tokens.shape) # torch.Size([1, 77])
Source code in ultralytics/nn/text_model.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
ultralytics.nn.text_model.MobileCLIP
MobileCLIP(size: str, device: device)
Bases: TextModel
Implement Apple's MobileCLIP text encoder for efficient text encoding.
This class implements the TextModel interface using Apple's MobileCLIP model, providing efficient text encoding capabilities for vision-language tasks with reduced computational requirements compared to standard CLIP models.
Attributes:
Name | Type | Description |
---|---|---|
model |
MobileCLIP
|
The loaded MobileCLIP model. |
tokenizer |
callable
|
Tokenizer function for processing text inputs. |
device |
device
|
Device where the model is loaded. |
config_size_map |
dict
|
Mapping from size identifiers to model configuration names. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to MobileCLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIP(size="s0", device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)
This class implements the TextModel interface using Apple's MobileCLIP model for efficient text encoding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
str
|
Model size identifier (e.g., 's0', 's1', 's2', 'b', 'blt'). |
required |
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> import torch
>>> model = MobileCLIP("s0", device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
|
encode_text
encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor
Encode tokenized texts into normalized feature vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with L2 normalization applied. |
Examples:
>>> model = MobileCLIP("s0", device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512]) # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 |
|
tokenize
tokenize(texts: List[str]) -> torch.Tensor
Convert input texts to MobileCLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text inputs with shape (batch_size, sequence_length). |
Examples:
>>> model = MobileCLIP("s0", "cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
|
ultralytics.nn.text_model.MobileCLIPTS
MobileCLIPTS(device: device)
Bases: TextModel
Load a TorchScript traced version of MobileCLIP.
This class implements the TextModel interface using Apple's MobileCLIP model in TorchScript format, providing efficient text encoding capabilities for vision-language tasks with optimized inference performance.
Attributes:
Name | Type | Description |
---|---|---|
encoder |
ScriptModule
|
The loaded TorchScript MobileCLIP text encoder. |
tokenizer |
callable
|
Tokenizer function for processing text inputs. |
device |
device
|
Device where the model is loaded. |
Methods:
Name | Description |
---|---|
tokenize |
Convert input texts to MobileCLIP tokens. |
encode_text |
Encode tokenized texts into normalized feature vectors. |
Examples:
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> text_encoder = MobileCLIPTS(device=device)
>>> tokens = text_encoder.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = text_encoder.encode_text(tokens)
This class implements the TextModel interface using Apple's MobileCLIP model in TorchScript format for efficient text encoding with optimized inference performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
device
|
Device to load the model on. |
required |
Examples:
>>> model = MobileCLIPTS(device=torch.device("cpu"))
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
Source code in ultralytics/nn/text_model.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
|
encode_text
encode_text(texts: Tensor, dtype: dtype = torch.float32) -> torch.Tensor
Encode tokenized texts into normalized feature vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
Tensor
|
Tokenized text inputs. |
required |
dtype
|
dtype
|
Data type for output features. |
float32
|
Returns:
Type | Description |
---|---|
Tensor
|
Normalized text feature vectors with L2 normalization applied. |
Examples:
>>> model = MobileCLIPTS(device="cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
>>> features = model.encode_text(tokens)
>>> features.shape
torch.Size([2, 512]) # Actual dimension depends on model size
Source code in ultralytics/nn/text_model.py
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 |
|
tokenize
tokenize(texts: List[str]) -> torch.Tensor
Convert input texts to MobileCLIP tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to tokenize. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Tokenized text inputs with shape (batch_size, sequence_length). |
Examples:
>>> model = MobileCLIPTS("cpu")
>>> tokens = model.tokenize(["a photo of a cat", "a photo of a dog"])
Source code in ultralytics/nn/text_model.py
322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
|
ultralytics.nn.text_model.build_text_model
build_text_model(variant: str, device: device = None) -> TextModel
Build a text encoding model based on the specified variant.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
variant
|
str
|
Model variant in format "base:size" (e.g., "clip:ViT-B/32" or "mobileclip:s0"). |
required |
device
|
device
|
Device to load the model on. |
None
|
Returns:
Type | Description |
---|---|
TextModel
|
Instantiated text encoding model. |
Examples:
>>> model = build_text_model("clip:ViT-B/32", device=torch.device("cuda"))
>>> model = build_text_model("mobileclip:s0", device=torch.device("cpu"))
Source code in ultralytics/nn/text_model.py
361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 |
|