October 23, 202512 min read

Testing Triton Inference Server

How we tested NVIDIA Triton Inference Server for AI-powered document processing at scale

Introduction

At AI Linkify, we build AI agents that process complex financial and accounting documents for SMEs. Our workloads involve multiple deep learning models for OCR, classification, and semantic understanding β€” all requiring fast, scalable inference.

To handle these use cases efficiently, we tested the NVIDIA Triton Inference Server on a Google Cloud VM (Debian) with a T4 GPU. This article walks through our exact test configuration β€” step-by-step β€” along with why each choice matters for performance, reliability, and reproducibility.

1. Preparing the Environment

Verify GPU and OS

First, we confirmed the GPU and OS:

lspci | grep -i nvidia
cat /etc/os-release | grep PRETTY_NAME

This ensures the T4 GPU is visible to the OS and the Debian kernel matches driver header requirements.

Install Cloud-Optimized Kernel

For GCP, we installed the cloud-optimized kernel for better driver compatibility and performance:

sudo apt install -y linux-image-cloud-amd64 linux-headers-cloud-amd64
sudo init 6

2. Installing NVIDIA Drivers

Note:

We chose manual installation over apt install nvidia-driver to ensure consistent driver versions across development and production β€” crucial for reproducible inference workloads.

curl -O https://storage.googleapis.com/nvidia-drivers-us-public/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
chmod +x NVIDIA-Linux-x86_64-550.90.12.run
sudo ./NVIDIA-Linux-x86_64-550.90.12.run --silent

The --silent flag allows automated installs, ideal for image baking or reproducible builds.

3. Mounting Additional Storage

In AI setups, model repositories, logs, and datasets grow fast. Instead of filling the root disk, we attached an external persistent disk (Standard HDD for cost efficiency).

Key advantages:

  • Scalability: Expand disk independently when models grow
  • Flexibility: Detach or delete when no longer needed
  • Reliability: Data persists even if VM is recreated

Format and Mount Disk

sudo mkfs.ext4 /dev/sdb
sudo mkdir /mnt/data
sudo mount /dev/sdb /mnt/data
sudo rsync -av /home/ /mnt/data/home/
sudo rsync -av /var/lib/ /mnt/data/var_lib/

Configure Persistent Mounts

Add to /etc/fstab:

/dev/sdb  /mnt/data  ext4  defaults  0 0
/mnt/data/home     /home     none   bind  0 0
/mnt/data/var_lib  /var/lib  none   bind  0 0

4. Installing NVIDIA Container Toolkit

Important:

Triton runs inside Docker, which must access the GPU via NVIDIA's Container Toolkit. The legacy nvidia-docker2 is deprecated.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo apt-get install -y \
  nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Installation

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

5. Deploying Triton Inference Server

Pull Triton Image

docker pull nvcr.io/nvidia/tritonserver:25.09-py3

Run Triton Server

docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p8000:8000 -p8001:8001 -p8002:8002 \
  -v /home/user/server/docs/examples/model_repository:/models \
  nvcr.io/nvidia/tritonserver:25.09-py3 tritonserver --model-repository=/models

Parameter Breakdown

--shm-size=1g

Allocates shared memory for large models

--ulimit memlock=-1

Prevents memory swapping for stable GPU inference

--ulimit stack=67108864

Increases stack size for deep model layers

-p8000/8001/8002

REST, gRPC, and Prometheus metrics ports

Health Check

curl -v localhost:8000/v2/health/ready

6. Testing the Client SDK

Pull SDK Container

docker pull nvcr.io/nvidia/tritonserver:25.09-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.09-py3-sdk

Run Test Inference

/workspace/install/bin/image_client -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
python /workspace/install/python/image_client.py -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
exit

This runs inference using a sample ONNX model and verifies end-to-end GPU acceleration.

Key Takeaways

  • Pin your versions β€” both drivers and containers β€” for reproducible results
  • Use dedicated disks for model repositories to simplify scaling and maintenance
  • Modern container toolkit only β€” nvidia-docker2 is deprecated, use Container Toolkit
  • Monitor resource usage β€” Triton exposes Prometheus metrics on port 8002 for Grafana integration

This configuration has proven reliable for our production workloads at AI Linkify, processing thousands of financial documents daily with consistent performance and scalability.

See AI Linkify in Action

Experience how our AI-powered platform automates document processing and transaction linking for your business