Triton Server (tritonfrontend) Bindings (Beta)#

The tritonfrontend python package is a set of bindings to Triton’s existing frontends implemented in C++. Currently, tritonfrontend supports starting up KServeHttp and KServeGrpc frontends. These bindings used in-combination with Triton’s Python In-Process API (tritonserver) and tritonclient extend the ability to use Triton’s full feature set with a few lines of Python.

Let us walk through a simple example:

First we need to load the desired models and start the server with tritonserver.

import tritonserver

# Constructing path to Model Repository
model_path = f"server/src/python/examples/example_model_repository"

server_options = tritonserver.Options(
    server_id="ExampleServer",
    model_repository=model_path,
    log_error=True,
    log_warn=True,
    log_info=True,
)
server = tritonserver.Server(server_options).start(wait_until_ready=True)

Note: model_path may need to be edited depending on your setup.

Now, to start up the respective services with tritonfrontend

from tritonfrontend import KServeHttp, KServeGrpc, Metrics
http_options = KServeHttp.Options(thread_count=5)
http_service = KServeHttp(server, http_options)
http_service.start()

# Default options (if none provided)
grpc_service = KServeGrpc(server)
grpc_service.start()

# Can start metrics service as well
metrics_service = Metrics(server)
metrics_service.start()

Finally, with running services, we can use tritonclient or simple curl commands to send requests and receive responses from the frontends.

import tritonclient.http as httpclient
import numpy as np # Use version numpy < 2
model_name = "identity" # output == input
url = "localhost:8000"

# Create a Triton client
client = httpclient.InferenceServerClient(url=url)

# Prepare input data
input_data = np.array([["Roger Roger"]], dtype=object)

# Create input and output objects
inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]

# Set the data for the input tensor
inputs[0].set_data_from_numpy(input_data)

results = client.infer(model_name, inputs=inputs)

# Get the output data
output_data = results.as_numpy("OUTPUT0")

# Print results
print("[INFERENCE RESULTS]")
print("Output data:", output_data)

# Stop respective services and server.
metrics_service.stop()
http_service.stop()
grpc_service.stop()
server.stop()

Additionally, tritonfrontend provides context manager support as well. So steps 2-3, could also be achieved through:

from tritonfrontend import KServeHttp
import tritonclient.http as httpclient
import numpy as np  # Use version numpy < 2

with KServeHttp(server) as http_service:
    # The identity model returns an exact duplicate of the input data as output
    model_name = "identity"
    url = "localhost:8000"
    # Create a Triton client
    with httpclient.InferenceServerClient(url=url) as client:
        # Prepare input data
        input_data = np.array(["Roger Roger"], dtype=object)
        # Create input and output objects
        inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]
        # Set the data for the input tensor
        inputs[0].set_data_from_numpy(input_data)
        # Perform inference
        results = client.infer(model_name, inputs=inputs)
        # Get the output data
        output_data = results.as_numpy("OUTPUT0")
        # Print results
        print("[INFERENCE RESULTS]")
        print("Output data:", output_data)

server.stop()

With this workflow, you can avoid having to stop each service after client requests have terminated.

Known Issues#

The following features are not currently supported when launching the Triton frontend services through the python bindings:
- Tracing
- Shared Memory
- Restricted Protocols
- VertexAI
- Sagemaker
After a running server has been stopped, if the client sends an inference request, a Segmentation Fault will occur.