Testing TensorRT-LLM backend#

Tests in this CI directory can be run manually to provide extensive testing.

Run QA Tests#

Run the testing within the Triton container.

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 bash

# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
bash -x ./test.sh

Run the e2e/benchmark_core_model to benchmark#

These two tests are ran in the L0_backend_trtllm test. Below are the instructions to run the tests manually.

Generate the model repository#

Follow the instructions in the Create the model repository section to prepare the model repository.

Modify the model configuration#

Follow the instructions in the Modify the model configuration section to modify the model configuration based on the needs.

End to end test#

End to end test script sends requests to the deployed ensemble model.

Ensemble model is ensembled by three models: preprocessing, tensorrt_llm and postprocessing:

“preprocessing”: This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
“tensorrt_llm”: This model is a wrapper of your TensorRT-LLM model and is used for inferencing
“postprocessing”: This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).

The end to end latency includes the total latency of the three parts of an ensemble model.

cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>

Expected outputs

[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms

benchmark_core_model#

benchmark_core_model script sends requests directly to the deployed tensorrt_llm model, the benchmark_core_model latency indicates the inference latency of TensorRT-LLM, not including the pre/post-processing latency which is usually handled by a third-party library such as HuggingFace.

cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>

Expected outputs

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms

Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you’re using.