# TensorRT-LLM ## Docs - [LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/llm.md): Main class for running LLM inference with TensorRT-LLM - [RequestOutput](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/request-output.md): Output structure for LLM generation requests - [SamplingParams](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/sampling-params.md): Configuration for text generation sampling strategies - [TokenizerBase](https://mintlify.wiki/NVIDIA/TensorRT-LLM/api/tokenizer.md): Tokenizer interface for TensorRT-LLM - [trtllm-bench](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-bench.md): Benchmark TensorRT-LLM models for throughput and latency performance - [trtllm-build](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-build.md): Build optimized TensorRT engines from model checkpoints - [trtllm-eval](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-eval.md): Evaluate TensorRT-LLM model accuracy on standard benchmarks - [trtllm-prune](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-prune.md): Prune TensorRT-LLM checkpoints to reduce size - [trtllm-refit](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-refit.md): Update TensorRT engine weights from checkpoints - [trtllm-serve](https://mintlify.wiki/NVIDIA/TensorRT-LLM/cli/trtllm-serve.md): Launch an OpenAI-compatible API server for TensorRT-LLM models - [System Architecture](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/architecture.md): Understanding the core architecture of TensorRT-LLM - [Execution Backends](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/backends.md): Understanding the three execution backends in TensorRT-LLM - [Optimization Techniques](https://mintlify.wiki/NVIDIA/TensorRT-LLM/concepts/optimization-techniques.md): Advanced performance optimizations in TensorRT-LLM - [LLM Arguments Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/llm-args.md): Complete reference for LlmArgs configuration options in TensorRT-LLM - [Model Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/model-config.md): PretrainedConfig reference for model-specific configuration in TensorRT-LLM - [Runtime Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/config/runtime-config.md): Executor, scheduler, and runtime performance configuration for TensorRT-LLM - [Distributed Inference](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/distributed-inference.md): Scale TensorRT-LLM across multiple GPUs and nodes - [Python LLM API](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/llm-api.md): Use TensorRT-LLM programmatically with the high-level Python API - [Production Deployment](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/production.md): Best practices for deploying TensorRT-LLM in production environments - [OpenAI-Compatible Server](https://mintlify.wiki/NVIDIA/TensorRT-LLM/deployment/trtllm-serve.md): Deploy TensorRT-LLM with trtllm-serve for production inference - [Adding New Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/adding-models.md): Learn how to add custom model architectures to TensorRT-LLM - [AutoDeploy Backend](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/autodeploy.md): Beta backend for seamless PyTorch to TensorRT-LLM deployment - [Building TensorRT-LLM from Source](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/build-from-source.md): Instructions for building TensorRT-LLM from source code on Linux, including Docker setup, dependencies, and compilation options - [CI/CD Overview](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/ci-overview.md): Continuous integration pipeline, testing strategy, and Jenkins integration - [TensorRT-LLM Coding Guidelines](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/coding-guidelines.md): Coding standards and style guidelines for C++ and Python development in TensorRT-LLM - [Contributing to TensorRT-LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/contributing.md): Guidelines for contributing code, submitting pull requests, and participating in the TensorRT-LLM open source project - [Custom CUDA Kernels](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/custom-kernels.md): Learn how to write and integrate custom CUDA kernels in TensorRT-LLM - [Disaggregated Serving](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/disaggregated-serving.md): Separate prefill and decode phases for optimized LLM serving - [Python Plugins](https://mintlify.wiki/NVIDIA/TensorRT-LLM/developer/plugins.md): Create Python plugins for TensorRT-LLM's TensorRT backend - [Attention Mechanisms](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/attention-mechanisms.md): Optimized multi-head, multi-query, and grouped-query attention with Flash Attention, XQA kernels, and paged KV cache - [KV Cache System](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/kv-cache.md): Optimize memory usage and enable cross-request reuse with paged KV cache, block management, and cache salting - [LoRA (Low-Rank Adaptation)](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/lora.md): Parameter-efficient fine-tuning with dynamic LoRA adapter loading, multi-LoRA support, and quantization compatibility - [Multimodal Support](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/multimodal.md): Efficient inference for vision-language and audio models with optimized multimodal encoders and KV cache reuse - [Parallelism Strategies](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/parallelism.md): Scale LLM inference across multiple GPUs with tensor, pipeline, data, expert, and context parallelism - [Quantization](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/quantization.md): Reduce memory footprint and accelerate inference with FP8, FP4, INT4, and INT8 quantization techniques - [Speculative Decoding](https://mintlify.wiki/NVIDIA/TensorRT-LLM/features/speculative-decoding.md): Accelerate LLM inference at low batch sizes using draft-target models, EAGLE, N-gram, and MTP techniques - [Installation](https://mintlify.wiki/NVIDIA/TensorRT-LLM/installation.md): Install TensorRT-LLM using Docker, pip, or build from source on Linux systems - [Introduction to TensorRT-LLM](https://mintlify.wiki/NVIDIA/TensorRT-LLM/introduction.md): Learn about TensorRT-LLM, NVIDIA's open-source library for optimizing Large Language Model inference on GPUs - [Custom Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/custom-models.md): Add custom model architectures to TensorRT-LLM - [Model Configuration](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/model-configuration.md): Configure TensorRT-LLM models with PretrainedConfig and model-specific parameters - [Supported Models](https://mintlify.wiki/NVIDIA/TensorRT-LLM/models/supported-models.md): Complete list of model architectures supported by TensorRT-LLM - [Benchmarking](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/benchmarking.md): Comprehensive guide to benchmarking TensorRT-LLM with trtllm-bench - [Optimization Guide](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/optimization-guide.md): Performance tuning best practices for TensorRT-LLM - [Profiling](https://mintlify.wiki/NVIDIA/TensorRT-LLM/performance/profiling.md): Profile and analyze TensorRT-LLM performance with NVIDIA Nsight Systems - [Quickstart Guide](https://mintlify.wiki/NVIDIA/TensorRT-LLM/quickstart.md): Get started with TensorRT-LLM in minutes - run your first inference using Docker and the LLM API