tensorrt-llm

tensorrt-llm

Beliebt

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

18KSterne
1.6KForks
Aktualisiert 1/24/2026
SKILL.md
readonlyread-only
name
tensorrt-llm
description

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

version
1.0.0

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

  • Deploying on NVIDIA GPUs (A100, H100, GB200)
  • Need maximum throughput (24,000+ tokens/sec on Llama 3)
  • Require low latency for real-time applications
  • Working with quantized models (FP8, INT4, FP4)
  • Scaling across multiple GPUs or nodes

Use vLLM instead when:

  • Need simpler setup and Python-first API
  • Want PagedAttention without TensorRT compilation
  • Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

  • Deploying on CPU or Apple Silicon
  • Need edge deployment without NVIDIA GPUs
  • Want simpler GGUF quantization format

Quick start

Installation

# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

Basic inference

from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)

Serving with trtllm-serve

# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Key features

Performance optimizations

  • In-flight batching: Dynamic batching during generation
  • Paged KV cache: Efficient memory management
  • Flash Attention: Optimized attention kernels
  • Quantization: FP8, INT4, FP4 for 2-4× faster inference
  • CUDA graphs: Reduced kernel launch overhead

Parallelism

  • Tensor parallelism (TP): Split model across GPUs
  • Pipeline parallelism (PP): Layer-wise distribution
  • Expert parallelism: For Mixture-of-Experts models
  • Multi-node: Scale beyond single machine

Advanced features

  • Speculative decoding: Faster generation with draft models
  • LoRA serving: Efficient multi-adapter deployment
  • Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])

Multi-GPU deployment

# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)

Batch inference

# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

  • Throughput: 24,000 tokens/sec
  • Latency: ~10ms per token
  • vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

  • FP8 quantization: 2× faster than FP16
  • Memory: 50% reduction with FP8

Supported models

  • LLaMA family: Llama 2, Llama 3, CodeLlama
  • GPT family: GPT-2, GPT-J, GPT-NeoX
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Mixtral: Mixtral-8x7B, Mixtral-8x22B
  • Vision: LLaVA, Phi-3-vision
  • 100+ models on HuggingFace

References

Resources

You Might Also Like

Related Skills

create-pr

create-pr

170Kdev-devops

Creates GitHub pull requests with properly formatted titles that pass the check-pr-title CI validation. Use when creating PRs, submitting changes for review, or when the user says /pr or asks to create a pull request.

n8n-io avatarn8n-io
Holen

Guide for performing Chromium version upgrades in the Electron project. Use when working on the roller/chromium/main branch to fix patch conflicts during `e sync --3`. Covers the patch application workflow, conflict resolution, analyzing upstream Chromium changes, and proper commit formatting for patch fixes.

electron avatarelectron
Holen
pr-creator

pr-creator

92Kdev-devops

Use this skill when asked to create a pull request (PR). It ensures all PRs follow the repository's established templates and standards.

google-gemini avatargoogle-gemini
Holen
clawdhub

clawdhub

87Kdev-devops

Use the ClawdHub CLI to search, install, update, and publish agent skills from clawdhub.com. Use when you need to fetch new skills on the fly, sync installed skills to latest or a specific version, or publish new/updated skill folders with the npm-installed clawdhub CLI.

moltbot avatarmoltbot
Holen
tmux

tmux

87Kdev-devops

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.

moltbot avatarmoltbot
Holen
create-pull-request

create-pull-request

57Kdev-devops

Create a GitHub pull request following project conventions. Use when the user asks to create a PR, submit changes for review, or open a pull request. Handles commit analysis, branch management, and PR creation using the gh CLI tool.

cline avatarcline
Holen