# TensorRT-LLM

## Docs

- [LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/llm.md): Main class for running LLM inference with TensorRT-LLM
- [RequestOutput](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/request-output.md): Output structure for LLM generation requests
- [SamplingParams](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/sampling-params.md): Configuration for text generation sampling strategies
- [TokenizerBase](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/tokenizer.md): Tokenizer interface for TensorRT-LLM
- [trtllm-bench](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-bench.md): Benchmark TensorRT-LLM models for throughput and latency performance
- [trtllm-build](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-build.md): Build optimized TensorRT engines from model checkpoints
- [trtllm-eval](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-eval.md): Evaluate TensorRT-LLM model accuracy on standard benchmarks
- [trtllm-prune](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-prune.md): Prune TensorRT-LLM checkpoints to reduce size
- [trtllm-refit](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-refit.md): Update TensorRT engine weights from checkpoints
- [trtllm-serve](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-serve.md): Launch an OpenAI-compatible API server for TensorRT-LLM models
- [System Architecture](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/architecture.md): Understanding the core architecture of TensorRT-LLM
- [Execution Backends](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/backends.md): Understanding the three execution backends in TensorRT-LLM
- [Optimization Techniques](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/optimization-techniques.md): Advanced performance optimizations in TensorRT-LLM
- [LLM Arguments Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/llm-args.md): Complete reference for LlmArgs configuration options in TensorRT-LLM
- [Model Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/model-config.md): PretrainedConfig reference for model-specific configuration in TensorRT-LLM
- [Runtime Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/runtime-config.md): Executor, scheduler, and runtime performance configuration for TensorRT-LLM
- [Distributed Inference](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/distributed-inference.md): Scale TensorRT-LLM across multiple GPUs and nodes
- [Python LLM API](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/llm-api.md): Use TensorRT-LLM programmatically with the high-level Python API
- [Production Deployment](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/production.md): Best practices for deploying TensorRT-LLM in production environments
- [OpenAI-Compatible Server](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/trtllm-serve.md): Deploy TensorRT-LLM with trtllm-serve for production inference
- [Adding New Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/adding-models.md): Learn how to add custom model architectures to TensorRT-LLM
- [AutoDeploy Backend](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/autodeploy.md): Beta backend for seamless PyTorch to TensorRT-LLM deployment
- [Building TensorRT-LLM from Source](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/build-from-source.md): Instructions for building TensorRT-LLM from source code on Linux, including Docker setup, dependencies, and compilation options
- [CI/CD Overview](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/ci-overview.md): Continuous integration pipeline, testing strategy, and Jenkins integration
- [TensorRT-LLM Coding Guidelines](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/coding-guidelines.md): Coding standards and style guidelines for C++ and Python development in TensorRT-LLM
- [Contributing to TensorRT-LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/contributing.md): Guidelines for contributing code, submitting pull requests, and participating in the TensorRT-LLM open source project
- [Custom CUDA Kernels](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/custom-kernels.md): Learn how to write and integrate custom CUDA kernels in TensorRT-LLM
- [Disaggregated Serving](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/disaggregated-serving.md): Separate prefill and decode phases for optimized LLM serving
- [Python Plugins](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/plugins.md): Create Python plugins for TensorRT-LLM's TensorRT backend
- [Attention Mechanisms](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/attention-mechanisms.md): Optimized multi-head, multi-query, and grouped-query attention with Flash Attention, XQA kernels, and paged KV cache
- [KV Cache System](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/kv-cache.md): Optimize memory usage and enable cross-request reuse with paged KV cache, block management, and cache salting
- [LoRA (Low-Rank Adaptation)](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/lora.md): Parameter-efficient fine-tuning with dynamic LoRA adapter loading, multi-LoRA support, and quantization compatibility
- [Multimodal Support](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/multimodal.md): Efficient inference for vision-language and audio models with optimized multimodal encoders and KV cache reuse
- [Parallelism Strategies](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/parallelism.md): Scale LLM inference across multiple GPUs with tensor, pipeline, data, expert, and context parallelism
- [Quantization](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/quantization.md): Reduce memory footprint and accelerate inference with FP8, FP4, INT4, and INT8 quantization techniques
- [Speculative Decoding](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/speculative-decoding.md): Accelerate LLM inference at low batch sizes using draft-target models, EAGLE, N-gram, and MTP techniques
- [Installation](https://mintlify.wiki/NVIDIA/TensorRT-LLM/installation.md): Install TensorRT-LLM using Docker, pip, or build from source on Linux systems
- [Introduction to TensorRT-LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/introduction.md): Learn about TensorRT-LLM, NVIDIA's open-source library for optimizing Large Language Model inference on GPUs
- [Custom Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/custom-models.md): Add custom model architectures to TensorRT-LLM
- [Model Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/model-configuration.md): Configure TensorRT-LLM models with PretrainedConfig and model-specific parameters
- [Supported Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/supported-models.md): Complete list of model architectures supported by TensorRT-LLM
- [Benchmarking](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/benchmarking.md): Comprehensive guide to benchmarking TensorRT-LLM with trtllm-bench
- [Optimization Guide](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/optimization-guide.md): Performance tuning best practices for TensorRT-LLM
- [Profiling](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/profiling.md): Profile and analyze TensorRT-LLM performance with NVIDIA Nsight Systems
- [Quickstart Guide](https://mintlify.wiki/NVIDIA/TensorRT-LLM/quickstart.md): Get started with TensorRT-LLM in minutes - run your first inference using Docker and the LLM API