Training large language models (LLMs) is a significant undertaking, but a more pervasive and often overlooked cost challenge is AI inference. Inference is the procedure by which a trained AI model processes new input data and generates an output. As organizations deploy these models in production, the costs can quickly become substantial, especially with high token volumes, long prompts, and growing usage demands. To run LLMs in a cost-effective and high-performing way, a comprehensive strategy is essential.

This approach addresses two critical areas: optimizing the inference runtime and optimizing the model itself.

Optimizing the inference runtime

Basic serving methods often struggle with inefficient GPU memory usage, suboptimal batch processing, and slow token generation. This is where a high-performance inference runtime becomes critical. vLLM, is the de facto, open source library that helps LLMs perform calculations more efficiently and at scale.

vLLM addresses these runtime challenges with advanced techniques, including:

  • Continuous batching: Instead of processing requests one by one, vLLM groups tokens from multiple sequences into batches. This minimizes GPU idle time and significantly improves GPU utilization and inference throughput.
  • PagedAttention: This memory management strategy efficiently handles large key-value (KV) caches. By dynamically allocating and managing GPU memory pages, PagedAttention greatly increases the number of concurrent requests and supports longer sequences without memory bottlenecks.

Optimizing the AI model

In addition to optimizing the runtime, organizations can also compress models to reduce their memory footprint and computational requirements. The two primary techniques are quantization and sparsity.

  • Quantization: This technique reduces a model’s numerical values, specifically its weights and activations, using fewer bits per value. This process significantly reduces the memory needed to store model parameters. For example, a 70-billion parameter Llama model can be shrunk from approximately 140 GB to as small as 40 GB. This means models can run on fewer resources and can double computational throughput without significantly degrading accuracy.
  • SparsitySparsity reduces computational demands by setting some of the model’s parameters to zero, allowing systems to bypass unnecessary operations. This can substantially reduce model complexity, decreasing memory usage and computational load resulting in faster inference and lower operational costs.

Red Hat AI: Putting the strategy into practice 

To help organizations implement this strategic approach, the Red Hat AI portfolio provides a unified set of solutions for achieving high-performance inference at scale.

Red Hat AI addresses both model and runtime optimization through its powerful set of tools and assets:

  • Red Hat AI Inference Server: Red Hat provides an enterprise-ready and supported vLLM engine that uses continuous batching and memory-efficient methods. By increasing throughput and reducing GPU usage, the runtime helps organizations maximize the return on their expensive AI hardware.
  • Access to validated and optimized models: Red Hat AI provides access to a repository of pre-evaluated and performance-tested models that are ready for use. These models are rigorously benchmarked against multiple evaluation tasks and can be found on the Red Hat AI Hugging Face repository, which allows organizations to achieve rapid time to value.
  • Included LLM Compressor: The Red Hat LLM toolkit provides a standardized way to apply compression techniques like quantization. This toolkit is what Red Hat uses to offer optimized models allowing customers to optimize their own fine-tuned or customized models.

By leveraging Red Hat AI, organizations can deploy high-performing, cost-effective models on a wide variety of hardware setups, helping teams meet rising AI demands while controlling costs and complexity.

To learn more about the fundamentals of inference performance engineering and model optimization, download the free e-book, Get started with AI Inference.

Resource

Get started with AI for enterprise: A beginner’s guide

Explore this beginner's guide to find out how Red Hat OpenShift AI and Red Hat Enterprise Linux AI can accelerate your AI adoption journey.

About the author

Carlos Condado is a Senior Product Marketing Manager for Red Hat AI. He helps organizations navigate the path from AI experimentation to enterprise-scale deployment by guiding the adoption of MLOps practices and integration of AI models into existing hybrid cloud infrastructures. As part of the Red Hat AI team, he works across engineering, product, and go-to-market functions to help shape strategy, messaging, and customer enablement around Red Hat’s open, flexible, and consistent AI portfolio.

With a diverse background spanning data analytics, integration, cybersecurity, and AI, Carlos brings a cross-functional perspective to emerging technologies. He is passionate about technological innovations and helping enterprises unlock the value of their data and gain a competitive advantage through scalable, production-ready AI solutions.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds