Supported Models#

Use this documentation to learn the details of supported models for NVIDIA NIM for LLMs. For the list of available models, refer to Models.

GPUs#

The GPU listed in the following sections have the following specifications.

GPU	Family	Memory
DGX B200		1.4 TB
H200	SXM	141 GB
H100	SXM	80 GB
H100	NVL	94 GB
A100	SXM	80 GB
L40S	PCIe	48 GB
A10G	PCIe	24 GB
NVIDIA RTX 6000 Ada Generation		48 GB
GeForce RTX 5090		32 GB
GeForce RTX 5080		16 GB
GeForce RTX 4090		24 GB
GeForce RTX 4080		16 GB

Important

This NIM does not support Multi-Instance GPU (MIG) mode.

Optimized Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build or vllm in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.

You can also find additional information about the features, such as LoRA, that these models support in Models.

Code Llama 13B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Throughput	2	24.63
H100 SXM	FP16	Latency	4	25.32
A100 SXM	FP16	Throughput	2	24.63
A100 SXM	FP16	Latency	4	25.31
L40S	FP16	Throughput	2	25.32
L40S	FP16	Latency	2	24.63
A10G	FP16	Throughput	4	25.32
A10G	FP16	Latency	8	26.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 34B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	32.17
H100 SXM	FP8	Latency	4	32.42
H100 SXM	FP16	Throughput	2	63.48
H100 SXM	FP16	Latency	4	64.59
A100 SXM	FP16	Throughput	2	63.48
A100 SXM	FP16	Latency	4	64.59
L40S	FP8	Throughput	4	32.42
L40S	FP16	Throughput	4	64.58
A10G	FP16	Throughput	4	64.58
A10G	FP16	Latency	8	66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	4	65.47
H100 SXM	FP8	Latency	8	66.37
H100 SXM	FP16	Throughput	4	130.35
H100 SXM	FP16	Latency	8	66.37
A100 SXM	FP16	Throughput	4	130.35
A100 SXM	FP16	Latency	8	132.71
A10G	FP16	Throughput	8	132.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1#

Supported Configurations#

The following configurations support this model:

1 node of [8 x H200] for 8 total H200 GPUs
2 nodes of [8 x H100] for 16 total H100 GPUs
2 nodes of [8 x H20] for 16 total H20 GPUs

Refer to the NGC catalog entry for further information.

DeepSeek R1 Distill Llama 8B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200 SXM	FP8	Throughput	1	8.58
H200 SXM	FP8	Latency	2	8.72
H200 SXM	BF16	Throughput	1	15.05
H200 SXM	BF16	Latency	2	16.12
H100 SXM	FP8	Throughput	1	8.58
H100 SXM	FP8	Latency	2	8.74
H100 SXM	BF16	Throughput	1	15.05
H100 SXM	BF16	Latency	2	16.12
H100 NVL	FP8	Throughput	1	8.58
H100 NVL	FP8	Latency	2	8.73
H100 NVL	BF16	Latency	2	16.12
H100 NVL	BF16	Throughput	1	15.0
A100 SXM	BF16	Throughput	1	15.16
A100 SXM	BF16	Latency	2	16.36
L40S	FP8	Throughput	1	8.58
L40S	FP8	Latency	2	8.71
L40S	BF16	Throughput	1	15.14
L40S	BF16	Latency	2	16.32
A10G	BF16	Throughput	2	16.12
A10G	BF16	Latency	4	18.25

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1 Distill Llama 70B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	FP8	Latency	4	68.66
H200	FP8	Throughput	2	68.12
H200	BF16	Latency	8	146.18
H200	BF16	Throughput	4	137.77
H100	FP8	Latency	4	68.65
H100	FP8	Throughput	2	68.18
H100	FP8	Latency	8	69.6
H100	BF16	Latency	8	146.18
H100	BF16	Throughput	4	137.77
A100	BF16	Latency	8	146.19
A100	BF16	Throughput	4	137.82
L40S	FP8	Throughput	4	68.57

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1 Distill Llama 8B RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5080	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4080	INT4 AWQ	Throughput	1	5.42

DeepSeek-R1-Distill-Qwen-32B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	BF16	Throughput	1	61.19
H100	BF16	Throughput	1	61.19
H200	BF16	Throughput	2	62.77
H20	BF16	Throughput	1	61.19
A100	BF16	Throughput	1	61.18
L40S	BF16	Throughput	2	62.79
L20	BF16	Throughput	2	62.8
L40S	FP8	Throughput	2	32.49
H200	FP8	Throughput	1	32.15
H200	FP8	Throughput	2	32.45
H100	FP8	Throughput	1	32.14
H20	FP8	Throughput	1	32.12
L20	FP8	Throughput	1	32.16

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek-R1-Distill-Qwen-7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	BF16	Throughput	1	21.93
H200	FP8	Throughput	1	15.85
H100	BF16	Throughput	1	21.94
H100	FP8	Throughput	1	15.84
H20	BF16	Throughput	1	22.00
H20	FP8	Throughput	1	15.83
L20	BF16	Throughput	1	21.97
L20	FP8	Throughput	1	15.84
A100	BF16	Throughput	1	21.98
A10G	BF16	Throughput	1	21.97

DeepSeek-R1-Distill-Qwen-14B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H20	FP8	Throughput	1	22.52
H20	BF16	Throughput	1	34.98
L20	FP8	Throughput	1	22.54
L20	BF16	Throughput	1	34.96
H100	FP8	Throughput	1	22.54
H200	FP8	Throughput	1	22.54
H200	BF16	Throughput	1	34.87
L40S	FP8	Throughput	1	22.55

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1

Qwen2.5 72B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H20	FP8	Throughput	4	77.71
H20	FP8	Throughput	8	77.96
H20	FP8	Latency	4	78.22
H20	FP8	Latency	8	78.98
L20	FP8	Throughput	4	78.14
L20	FP8	Throughput	8	79.15
L20	FP8	Latency	4	78.14
L20	FP8	Latency	8	78.89
A100 SXM	BF16	Throughput	4	150.35
A100 SXM	BF16	Latency	8	160.18

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Qwen2.5 7B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
L20	FP16	Throughput	1	21.66
A100 PCIe 40GB	FP16	Latency	1	21.66
A100 PCIe 40GB	BF16	Throughput	1	21.66
A100 PCIe 40GB	FP16	Balanced	1	21.66
A100 SXM/NVLink	FP16	Latency	1	21.66
A100 SXM/NVLink	BF16	Throughput	1	21.66
A100 SXM/NVLink	BF16	Balanced	1	21.66

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16,FP16
# of GPUs: 1

Gemma 2 2B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2

Gemma 2 9B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2, or 4

(Meta) Llama 2 7B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	6.57
H100 SXM	FP8	Latency	2	6.66
H100 SXM	FP16	Throughput	1	12.62
H100 SXM	FP16	Throughput LoRA	1	12.63
H100 SXM	FP16	Latency	2	12.93
A100 SXM	FP16	Throughput	1	15.54
A100 SXM	FP16	Throughput LoRA	1	12.63
A100 SXM	FP16	Latency	2	12.92
L40S	FP8	Throughput	1	6.57
L40S	FP8	Latency	2	6.64
L40S	FP16	Throughput	1	12.64
L40S	FP16	Throughput LoRA	1	12.65
L40S	FP16	Latency	2	12.95

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 13B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	2	12.6
H100 SXM	FP16	Throughput	1	24.33
H100 SXM	FP16	Throughput LoRA	1	24.35
H100 SXM	FP16	Latency	2	24.71
A100 SXM	FP16	Throughput	1	24.34
A100 SXM	FP16	Throughput LoRA	1	24.37
A100 SXM	FP16	Latency	2	24.74
L40S	FP8	Throughput	1	12.49
L40S	FP8	Latency	2	12.59
L40S	FP16	Throughput	1	24.33
L40S	FP16	Throughput LoRA	1	24.37
L40S	FP16	Latency	2	24.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 70B Chat#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	65.08
H100 SXM	FP8	Latency	4	65.36
H100 SXM	FP16	Throughput	4	130.52
H100 SXM	FP16	Throughput LoRA	4	130.6
H100 SXM	FP16	Latency	8	133.18
A100 SXM	FP16	Throughput	4	130.52
A100 SXM	FP16	Throughput LoRA	4	130.5
A100 SXM	FP16	Latency	8	133.12
L40S	FP8	Throughput	4	63.35

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 SQLCoder 8B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100 SXM	FP8	Throughput	1	8.52
H100 SXM	FP8	Latency	2	8.61
H100 SXM	FP16	Throughput	1	15
H100 SXM	FP16	Latency	2	16.02
L40S	FP8	Throughput	1	8.53
L40S	FP8	Latency	2	8.61
L40S	FP16	Throughput	1	15
L40S	FP16	Latency	2	16.02
A10G	FP16	Throughput	1	15
A10G	FP16	Throughput	2	16.02
A10G	FP16	Latency	2	16.02
A10G	FP16	Latency	4	18.06

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Swallow 70B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	68.42
H100 SXM	FP8	Latency	4	69.3
H100 SXM	FP16	Throughput	2	137.7
H100 SXM	FP16	Latency	4	145.94
A100 SXM	FP16	Throughput	2	137.7
A100 SXM	FP16	Latency	2	137.7
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Taiwan 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	68.42
H100 SXM	FP8	Latency	4	145.94
H100 SXM	FP16	Throughput	2	137.7
H100 SXM	FP16	Latency	4	137.7
A100 SXM	FP16	Throughput	2	137.7
A100 SXM	FP16	Latency	2	145.94
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 8B Base#

Optimized Configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100 SXM	BF16	Latency	2
H100 SXM	FP8	Latency	2
H100 SXM	BF16	Throughput	1
H100 SXM	FP8	Throughput	1
A100 SXM	BF16	Latency	2
A100 SXM	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	2
A10G	BF16	Latency	4
A10G	BF16	Throughput	2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Llama 3.1 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	8.56
B200	FP8	Throughput-LoRA	1	8.6
B200	FP8	Latency	2	8.68
B200	BF16	Throughput	1	15.05
B200	BF16	Throughput-LoRA	1	15.06
B200	BF16	Latency	2	16.09
H200	FP8	Throughput	1	8.58
H200	FP8	Throughput-LoRA	1	8.63
H200	FP8	Latency	2	8.72
H200	BF16	Throughput	1	22.1
H200	BF16	Throughput-LoRA	1	15.13
H200	BF16	Latency	2	16.2
H100	FP8	Throughput	1	8.58
H100	FP8	Throughput-LoRA	1	8.63
H100	FP8	Latency	2	8.73
H100	BF16	Throughput	1	15.06
H100	BF16	Throughput-LoRA	1	15.07
H100	BF16	Latency	2	16.12
H100 NVL	FP8	Throughput	1	8.58
H100 NVL	FP8	Throughput-LoRA	1	8.63
H100 NVL	FP8	Latency	2	8.73
H100 NVL	BF16	Throughput	1	15.06
H100 NVL	BF16	Throughput-LoRA	1	15.07
H100 NVL	BF16	Latency	2	16.12
A100	BF16	Throughput	1	22.2
A100	BF16	Throughput-LoRA	1	15.19
A100	BF16	Latency	2	16.12
L40S	FP8	Throughput	1	8.6
L40S	FP8	Throughput-LoRA	1	8.64
L40S	FP8	Latency	2	8.76
L40S	BF16	Throughput	1	15.18
L40S	BF16	Throughput	2	16.15
L40S	BF16	Latency	2	16.43
L40S	BF16	Latency	4	18.26
A10G	BF16	Throughput	2	16.36
A10G	BF16	Throughput-LoRA	2	16.35

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Llama 3.1 8B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5080	INT4 AWQ	Throughput	1	5.41
GeForce RTX 4090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4080	INT4 AWQ	Throughput	1	5.42

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Llama 3.1 Nemotron Nano 8B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	FP8	Throughput	1	8.58
H200	FP8	Latency	2	8.73
H200	BF16	Throughput	1	15.06
H200	BF16	Latency	2	16.12
H100	FP8	Throughput	1	8.58
H100	FP8	Latency	2	8.73
H100	BF16	Throughput	1	15.96
H100	BF16	Latency	2	16.12
H100 NVL	FP8	Throughput	1	8.57
H100 NVL	FP8	Latency	2	8.73
H100 NVL	BF16	Throughput	1	15.06
H100 NVL	BF16	Latency	2	16.12
A100	BF16	Throughput	1	15.06
A100	BF16	Latency	2	16.12
L40S	FP8	Throughput	1	8.6
L40S	FP8	Latency	2	8.64
L40S	BF16	Throughput	1	15.06
L40S	BF16	Latency	2	16.12
L40S	BF16	Throughput	2	16.15
L40S	BF16	Latency	4	18.26
A10G	BF16	Throughput	2	16.12

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 H200, H100 SXM, H100 NVL, or A100 GPUs
- 2 or 4 L40S or A10G GPUs

Llama 3.1 Nemotron Ultra 253B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	8	241.43
H200 SXM	FP8	Throughput	8	242.0
H100 SXM	FP8	Throughput	8	241.96
H100 NVL	FP8	Throughput	8	242.01

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 8 H100 NVL, B200, H200 SXM

Llama 3.2 1B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	1.94
B200	FP8	Throughput LoRA	1	1.96
B200	BF16	Throughput	1	2.85
B200	BF16	Throughput LoRA	1	2.37
H200 SXM	FP8	Throughput	1	2.41
H200 SXM	FP8	Throughput LoRA	1	2.41
H200 SXM	BF16	Throughput	1	1.95
H200 SXM	BF16	Throughput LoRA	1	1.97
H100 SXM	FP8	Throughput	1	1.95
H100 SXM	FP8	Throughput LoRA	1	1.97
H100 SXM	BF16	Throughput	1	2.89
H100 SXM	BF16	Throughput LoRA	1	2.9
H100 NVL	FP8	Throughput	1	1.95
H100 NVL	FP8	Throughput LoRA	1	1.97
H100 NVL	BF16	Throughput	1	2.41
H100 NVL	BF16	Throughput LoRA	1	2.41
A100 SXM	BF16	Throughput	1	2.89
A100 SXM	BF16	Throughput LoRA	1	2.9
A100 40GB	BF16	Throughput	1	2.9
A100 40GB	BF16	Throughput LoRA	1	2.9
L40S	FP8	Throughput	1	1.47
L40S	FP8	Throughput LoRA	1	2.89
L40S	BF16	Throughput	1	2.9
L40S	BF16	Throughput Lora	2	1.97
A10G	BF16	Throughput	2	3.89
A10G	BF16	Throughput LoRA	2	2.89

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.2 3B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	4.16
B200	BF16	Throughput	1	6.79
H200	FP8	Throughput	1	4.17
H200	FP8	Throughput LoRA	1	4.17
H200	FP16	Throughput	1	6.79
H200	FP16	Throughput LoRA	1	6.79
H100	FP8	Throughput	1	4.17
H100	FP8	Throughput LoRA	1	4.17
H100	FP16	Throughput	1	6.79
H100	FP16	Throughput LoRA	1	6.79
H100 NVL	FP8	Throughput	1	4.17
H100 NVL	FP8	Throughput LoRA	1	4.17
H100 NVL	FP16	Throughput	1	6.79
H100 NVL	FP16	Throughput LoRA	1	6.79
A100	FP16	Throughput	1	6.79
A100	FP16	Throughput LoRA	1	6.79
A100 40GB	FP16	Throughput	1	6.79
A100 40GB	FP16	Throughput LoRA	2	6.79
L40S	FP8	Throughput	1	4.17
L40S	FP8	Throughput LoRA	1	4.17
L40S	FP16	Throughput	1	6.79
L40S	FP16	Throughput LoRA	1	6.79
A10G	FP16	Throughput	1	6.79
A10G	FP16	Throughput LoRA	1	6.79

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: One H100, A100, or L40S

Llama 3.1 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	67.84
B200	FP8	Throughput	2	67.97
B200	FP8	Latency	2	68.09
B200	FP8	Latency	4	68.34
B200	BF16	Throughput	2	68.09
B200	BF16	Throughput	4	68.47
B200	BF16	Latency	4	68.5
H200 SXM	FP8	Throughput	1	67.88
H200 SXM	FP8	Throughput	2	68.1
H200 SXM	FP8	Throughput LoRA	2	68.23
H200 SXM	FP8	Latency	2	68.23
H200 SXM	FP8	Latency	4	68.68
H200 SXM	BF16	Throughput	2	68.24
H200 SXM	BF16	Throughput	4	68.76
H200 SXM	BF16	Latency	4	68.81
H100 SXM	FP8	Throughput	2	68.11
H100 SXM	FP8	Throughput LoRA	2	68.23
H100 SXM	FP8	Throughput	4	68.66
H100 SXM	FP8	Throughput LoRA	4	68.96
H100 SXM	FP8	Latency	4	68.65
H100 SXM	FP8	Latency	8	69.56
H100 SXM	BF16	Throughput	4	137.83
H100 SXM	BF16	Throughput	8	146.2
H100 SXM	BF16	Latency	8	146.2
H100 NVL	FP8	Throughput	2	68.12
H100 NVL	FP8	Throughput LoRA	2	68.23
H100 NVL	FP8	Throughput	4	68.74
H100 NVL	FP8	Throughput LoRA	4	68.96
H100 NVL	FP8	Latency	4	68.73
H100 NVL	FP8	Latency	8	69.71
H100 NVL	BF16	Throughput	4	137.83
H100 NVL	BF16	Throughput LoRA	4	138.7
H100 NVL	BF16	Throughput	8	146.2
H100 NVL	BF16	Latency	8	146.2
A100 SXM	BF16	Throughput	4	139.79
A100 SXM	BF16	Throughput LoRA	4	139.21
A100 SXM	BF16	Latency	8	149.59
A100 40GB	BF16	Throughput	8	149.31
A100 40GB	BF16	Throughput LoRA	8	148.49
A10G	BF16	Throughput	8	149.71
A10G	BF16	Throughput LoRA	8	148.24
L40S	FP8	Throughput	4	68.82
L40S	FP8	Throughput LoRA	4	69.12
L40S	BF16	Throughput	8	150.32
L40S	BF16	Throughput LoRA	8	149.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 405B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	8	388.75
H100 SXM	FP16	Latency	16	794.9
A100 SXM	PP16	Latency	16	798.2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100 SXM

Llama 3.1 Nemotron 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100 SXM	FP8	Throughput	2	68.18
H100 SXM	FP8	Throughput	4	68.64
H100 SXM	FP8	Latency	8	69.77
H100 SXM	FP16	Throughput	4	137.94
H100 SXM	FP16	Latency	8	146.41
A100 SXM	FP16	Throughput	4	137.93
A100 SXM	FP16	Latency	8	146.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 Swallow 8B Instruct v0.1#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2, 4

Llama 3.1 Swallow 70B Instruct v0.1#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 2, 4, 8

Llama 3.3 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	67.84
B200	BF16	Throughput	2	68.09
B200	FP8	Latency	2	68.09
B200	BF16	Latency	4	68.49
B200	BF16	Throughput	4	68.47
B200	FP8	Latency	4	68.34
A100 40GB	BF16	Throughput	8	149.31
A100 40GB	BF16	Throughput LoRA	8	148.53
A100	BF16	Throughput	4	139.77
A100	BF16	Throughput LoRA	4	139.13
A100	BF16	Latency	8	149.38
H100 NVL	FP8	Throughput	2	68.13
H100 NVL	FP8	Throughput LoRA	2	68.22
H100 NVL	FP8	Latency	4	68.72
H100 NVL	BF16	Throughput	4	137.83
H100 NVL	FP8	Throughput	4	68.73
H100 NVL	BF16	Throughput LoRA	4	138.73
H100 NVL	FP8	Throughput LoRA	4	68.94
H100 NVL	BF16	Latency	8	146.19
H100 NVL	FP8	Latency	8	69.72
H100 NVL	BF16	Throughput	8	146.19
H100	FP8	Throughput	2	68.11
H100	FP8	Throughput LoRA	2	68.22
H100	FP8	Latency	4	68.64
H100	BF16	Throughput	4	137.83
H100	FP8	Throughput	4	68.65
H100	FP8	Throughput LoRA	4	68.94
H100	BF16	Latency	8	146.19
H100	FP8	Latency	8	69.56
H100	BF16	Throughput	8	146.19
H200	FP8	Throughput	1	67.88
H200	FP8	Latency	2	68.23
H200	BF16	Throughput	2	68.23
H200	FP8	Throughput	2	68.1
H200	FP8	Throughput LoRA	2	68.22
H200	BF16	Latency	4	68.82
H200	FP8	Latency	4	68.68
H200	BF16	Throughput	4	68.76
L40S	FP8	Throughput	4	68.97
L40S	FP8	Throughput LoRA	4	69.17
L40S	BF16	Throughput	8	150.23
L40S	BF16	Throughput LoRA	8	149.42

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 H200, H100 SXM, H100 NVL, or A100
- 4 or 8 L40S

Meta Llama 3 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Throughput	1	28
H100 SXM	FP16	Latency	2	28
A100 SXM	FP16	Throughput	1	28
A100 SXM	FP16	Latency	2	28
L40S	FP8	Throughput	1	20.5
L40S	FP8	Latency	2	20.5
L40S	FP16	Throughput	1	28
A10G	FP16	Throughput	1	28
A10G	FP16	Latency	2	28

Generic Configuration#

The Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Llama 3.3 Nemotron Super 49B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	48.53
B200	FP8	Throughput	2	48.71
B200	FP8	Throughput	4	49.1
B200	FP8	Latency	8	49.77
B200	BF16	Throughput	4	99.45
B200	BF16	Latency	8	107.94
H200	FP8	Throughput	1	48.55
H200	FP8	Latency	2	48.81
H200	BF16	Throughput	2	95.28
H200	BF16	Latency	4	99.66
H100	FP8	Throughput	1	48.55
H100	FP8	Throughput	2	48.81
H100	FP8	Throughput	4	49.31
H100	FP8	Latency	8	50.22
H100	BF16	Throughput	4	99.66
H100	BF16	Latency	8	108.42
H100 NVL	FP8	Throughput	1	48.55
H100 NVL	FP8	Throughput	2	48.81
H100 NVL	FP8	Throughput	4	49.34
H100 NVL	FP8	Latency	8	50.25
H100 NVL	BF16	Throughput	4	99.65
H100 NVL	BF16	Latency	8	108.39
A100	BF16	Throughput	4	100.78
A100	BF16	Latency	8	110.19
A100 40GB	BF16	Throughput	8	110.18
L40S	FP8	Throughput	4	49.43
L40S	FP8	Latency	8	50.41
L40S	BF16	Throughput	4	100.69
L40S	BF16	Latency	8	110.62
A10G	BF16	Throughput	8	110.26

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Meta Llama 3 70B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	4	82
H100 SXM	FP8	Latency	8	82
H100 SXM	FP16	Throughput	4	158
H100 SXM	FP16	Latency	8	158
A100 SXM	FP16	Throughput	4	158

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100 SXM

Mistral 7B Instruct V0.3#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	7.08
H100 SXM	FP8	Latency	2	7.19
H100 SXM	BF16	Throughput	1	13.56
H100 SXM	BF16	Latency	2	7.19
A100 SXM	BF16	Throughput	1	13.56
A100 SXM	BF16	Latency	2	13.87
L40S	FP8	Throughput	1	7.08
L40S	FP8	Latency	2	7.16
L40S	BF16	Throughput	1	13.55
L40S	BF16	Latency	2	13.85
A10G	BF16	Throughput	2	13.87
A10G	BF16	Latency	4	14.48

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Mistral NeMo Minitron 8B 8K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	8.91
H100 SXM	FP8	Latency	2	9.03
H100 SXM	FP16	Throughput	1	15.72
H100 SXM	FP16	Latency	2	16.78
A100 SXM	FP16	Throughput	1	15.72
A100 SXM	FP16	Latency	2	16.78
L40S	FP8	Throughput	1	8.92
L40S	FP8	Latency	2	9.02
L40S	FP16	Throughput	1	15.72
L40S	FP16	Latency	2	16.77
A10G	FP16	Throughput	2	16.81
A10G	FP16	Latency	4	15.72

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	31
GeForce RTX 5090	INT4 AWQ	Throughput	1	31
GeForce RTX 5080	INT4 AWQ	Throughput	1	31
GeForce RTX 4090	INT4 AWQ	Throughput	1	31
GeForce RTX 4080	INT4 AWQ	Throughput	1	31

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	2	13.82
H100 SXM	FP16	Throughput	1	23.35
H100 SXM	FP16	Latency	2	25.14
A100 SXM	FP16	Throughput	1	23.35
A100 SXM	FP16	Latency	2	25.14
L40S	FP8	Throughput	2	13.83
L40S	FP8	Latency	4	15.01
L40S	FP16	Throughput	2	25.14
L40S	FP16	Latency	4	28.71
A10G	FP16	Throughput	4	28.71
A10G	FP16	Latency	8	35.87

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mixtral 8x7B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	4	100
H100 SXM	INT8WO	Throughput	2	100
H100 SXM	INT8WO	Latency	4	100
H100 SXM	FP16	Throughput	2	100
H100 SXM	FP16	Latency	4	100
A100 SXM	FP16	Throughput	2	100
A100 SXM	FP16	Latency	4	100
L40S	FP8	Throughput	4	100
L40S	FP16	Throughput	4	100
A10G	FP16	Throughput	8	100

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Mixtral 8x22B Instruct V0.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	8	132.61
H100 SXM	FP8	Latency	8	132.56
H100 SXM	INT8WO	Throughput	8	134.82
H100 SXM	INT8WO	Latency	8	132.31
H100 SXM	FP16	Throughput	8	265.59
A100 SXM	FP16	Throughput	8	265.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

StarCoder2 7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100	BF16	Throughput	1	13.89
H100	BF16	Latency	2	14.44
H100	FP8	Throughput	1	7.56
H100	FP8	Latency	2	7.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Latency	16	636.45
A100 SXM	FP16	Latency	16	636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Reward#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Latency	16	636.45
A100 SXM	FP16	Latency	16	636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phi 3 Mini 4K Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	3.8
H100 SXM	FP16	Throughput	1	7.14
A100 SXM	FP16	Throughput	1	7.14
L40S	FP8	Throughput	1	3.8
L40S	FP16	Throughput	1	7.14
A10G	FP16	Throughput	1	7.14

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phind Codellama 34B V2 Instruct#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	32.17
H100 SXM	FP8	Latency	4	32.41
H100 SXM	FP16	Throughput	2	63.48
H100 SXM	FP16	Latency	4	64.59
A100 SXM	FP16	Throughput	2	63.48
A100 SXM	FP16	Latency	4	64.59
L40S	FP8	Throughput	4	32.43
L40S	FP16	Throughput	4	64.58
A10G	FP16	Latency	8	66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

StarCoderBase 15.5B#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: FP32
# of GPUs: 2, 4, 8