Testing Triton Inference Server
How we tested NVIDIA Triton Inference Server for AI-powered document processing at scale
Introduction
At AI Linkify, we build AI agents that process complex financial and accounting documents for SMEs. Our workloads involve multiple deep learning models for OCR, classification, and semantic understanding β all requiring fast, scalable inference.
To handle these use cases efficiently, we tested the NVIDIA Triton Inference Server on a Google Cloud VM (Debian) with a T4 GPU. This article walks through our exact test configuration β step-by-step β along with why each choice matters for performance, reliability, and reproducibility.
1. Preparing the Environment
Verify GPU and OS
First, we confirmed the GPU and OS:
lspci | grep -i nvidia cat /etc/os-release | grep PRETTY_NAME
This ensures the T4 GPU is visible to the OS and the Debian kernel matches driver header requirements.
Install Cloud-Optimized Kernel
For GCP, we installed the cloud-optimized kernel for better driver compatibility and performance:
sudo apt install -y linux-image-cloud-amd64 linux-headers-cloud-amd64 sudo init 6
2. Installing NVIDIA Drivers
Note:
We chose manual installation over apt install nvidia-driver to ensure consistent driver versions across development and production β crucial for reproducible inference workloads.
curl -O https://storage.googleapis.com/nvidia-drivers-us-public/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run chmod +x NVIDIA-Linux-x86_64-550.90.12.run sudo ./NVIDIA-Linux-x86_64-550.90.12.run --silent
The --silent flag allows automated installs, ideal for image baking or reproducible builds.
3. Mounting Additional Storage
In AI setups, model repositories, logs, and datasets grow fast. Instead of filling the root disk, we attached an external persistent disk (Standard HDD for cost efficiency).
Key advantages:
- Scalability: Expand disk independently when models grow
- Flexibility: Detach or delete when no longer needed
- Reliability: Data persists even if VM is recreated
Format and Mount Disk
sudo mkfs.ext4 /dev/sdb sudo mkdir /mnt/data sudo mount /dev/sdb /mnt/data sudo rsync -av /home/ /mnt/data/home/ sudo rsync -av /var/lib/ /mnt/data/var_lib/
Configure Persistent Mounts
Add to /etc/fstab:
/dev/sdb /mnt/data ext4 defaults 0 0 /mnt/data/home /home none bind 0 0 /mnt/data/var_lib /var/lib none bind 0 0
4. Installing NVIDIA Container Toolkit
Important:
Triton runs inside Docker, which must access the GPU via NVIDIA's Container Toolkit. The legacy nvidia-docker2 is deprecated.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerVerify Installation
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
5. Deploying Triton Inference Server
Pull Triton Image
docker pull nvcr.io/nvidia/tritonserver:25.09-py3
Run Triton Server
docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \ -p8000:8000 -p8001:8001 -p8002:8002 \ -v /home/user/server/docs/examples/model_repository:/models \ nvcr.io/nvidia/tritonserver:25.09-py3 tritonserver --model-repository=/models
Parameter Breakdown
--shm-size=1gAllocates shared memory for large models
--ulimit memlock=-1Prevents memory swapping for stable GPU inference
--ulimit stack=67108864Increases stack size for deep model layers
-p8000/8001/8002REST, gRPC, and Prometheus metrics ports
Health Check
curl -v localhost:8000/v2/health/ready
6. Testing the Client SDK
Pull SDK Container
docker pull nvcr.io/nvidia/tritonserver:25.09-py3-sdk docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.09-py3-sdk
Run Test Inference
/workspace/install/bin/image_client -m inception_onnx -s INCEPTION /workspace/images/mug.jpg python /workspace/install/python/image_client.py -m inception_onnx -s INCEPTION /workspace/images/mug.jpg exit
This runs inference using a sample ONNX model and verifies end-to-end GPU acceleration.
Key Takeaways
- Pin your versions β both drivers and containers β for reproducible results
- Use dedicated disks for model repositories to simplify scaling and maintenance
- Modern container toolkit only β nvidia-docker2 is deprecated, use Container Toolkit
- Monitor resource usage β Triton exposes Prometheus metrics on port 8002 for Grafana integration
This configuration has proven reliable for our production workloads at AI Linkify, processing thousands of financial documents daily with consistent performance and scalability.
See AI Linkify in Action
Experience how our AI-powered platform automates document processing and transaction linking for your business