订阅内容
AI/ML 

It seems that any given large language model (LLM) gets larger and larger with each release. This requires a large number of GPUs to train the model and more resources are needed throughout the lifecycle of these models for fine tuning, inferencing and so on. There’s a new Moore’s law for these LLMs: The model size (measured by numbers of parameters) is doubling in size every four months.

 

Image 1

 

LLMs are costly

It's expensive in terms of resources, time and money to train and operate an LLM.Resource requirements have a direct impact on companies deploying an LLM–it doesn’t matter whether it's on their own infrastructure or using a hyperscaler. And LLMs are getting even bigger.

Additionally, operating an LLM requires a great deal of resources. The Llama 3.1 LLM has 405 billion parameters, requiring 810GB of memory (FP16) for inference alone. The Llama 3.1 model family was trained on 15 trillion tokens on a GPU cluster with 39 million GPU hours. With LLM size increasing exponentially, the computing and memory requirements for training and operating are growing, as well. Llama 3.1 fine tuning requires 3.25TB of memory. 

here was a severe shortage of GPUs last year, and just as the supply and demand gap is improving, the next bottleneck is expected to be power. With more data centers coming online and electricity consumption for each data center doubling to almost 150MW, it’s easy to see why this would become a problem for the AI industry. 

How to make LLMs less expensive

Before we discuss how to make LLMs less costly, consider an example of something we’re all familiar with. Cameras are constantly improving, with new models taking higher resolution pictures than ever before. But the raw image files can be as large as 40 MB (or greater) each. Unless you’re a media professional who needs to manipulate these images, most people are happy with a JPEG version of the image, shrinking the file size by 80%. Sure, the compression used by JPEG decreases the image quality from the RAW original, but for most purposes JPEG is good enough. Additionally, it usually takes special applications to process and view a raw image. So dealing with raw images has a higher computation cost compared to a JPEG image. 

Now let’s go back to talking about LLMs. The model size is dependent upon the number of parameters, so one approach is to use a model with a smaller number of parameters. All popular open source models come with a range of parameters, allowing you to choose the ones that best suit a specific application.

 

Image 2

 

However, a LLM with a greater number of parameters generally outperforms one with fewer parameters across most benchmarks. To lower the resource requirements, it may be better to use a larger parameter model but compress it to a smaller size. Tests have shown that GAN compression can reduce computation by almost 20 times.

There are dozens of approaches to compressing an LLM, including quantization, pruning, knowledge distillation and layer reduction.

Quantization

Quantization changes the numerical values in a model from 32-bit floating point format to a lower precision data type: 16-bit floating point, 8-bit integer, 4-bit integer or even 2-bit integer. By lowering the precision data type, the model needs fewer bits during operations, resulting in lower memory and computation. Quantization can be done after the model is trained or during the training process. 

 

Image 3

 

As we move to lower bits, there is a tradeoff involved between quantization and performance. This paper highlights the ideal range offered by 4-bit quantization for larger (at least 70 billion parameters) models. Anything lower shows a noticeable performance discrepancy between the LLM and its quantized counterpart. For a smaller model, 6-bit or 8-bit quantization may be a better choice. 

Thanks to quantization, I can run this LLM RAG demo on my laptop.

Pruning 

Pruning reduces model size by eliminating less important weights or neurons. A delicate balance is required between model size reduction and preserving accuracy. Pruning can be done before, during or after the model training. Layer pruning takes this idea further by removing whole blocks of layers. In this paper, the authors report that up to 50% of layers can be removed with a minimal degradation of performance.

Knowledge distillation

Transfers knowledge from a large model (the teacher) to a smaller model (the student). The smaller model is trained from the larger model’s outputs instead of larger training data.

This paper shows how Google’s BERT model distillation into DistilBERT reduced the model size by 40%, increased the inference speed by 60%, while retaining 97% of its language understanding capabilities.

Hybrid approach

While each of these individual compression techniques are helpful, sometimes a hybrid approach that combines different compression techniques works best. This paper points out how 4-bit quantization reduces memory needs by 4 times, but not computation resources as measured by FLOPS (floating-point operations per second). Quantization combined with layer pruning helps reduce the memory as well as computation resources.

 

Image 5

 

Benefits of a smaller model

Using smaller models can significantly reduce computational requirements, while maintaining a high level of performance and accuracy.

  • Lower computation costs: Smaller models lower the CPU and GPU requirements that can result in significant cost savings. Considering high end GPUs can cost as much as $30,000 each, any reduction in computational cost is good news
  • Decreased memory use: Smaller models require less memory compared to larger counterparts. This helps with model deployments to resource constrained systems like IoT devices or mobile phones
  • Faster inference: Smaller models can load and execute quickly, which results in reduced inference latency. Faster inference can make a big difference for real-time applications like autonomous vehicles
  • Reduced carbon footprint: Reduced computational requirements of smaller models helps improve energy efficiency thereby reducing environmental impact
  • Deployment flexibility: Lower computational requirements increase flexibility for deploying models where they’re needed. The models can be deployed to fit the dynamically changing user needs or system constraints, including the edge of the network

Smaller and less expensive models are gaining popularity, as evident by recent releases of ChatGPT-4o mini (60% cheaper than GPT-3.5 Turbo), and the open source innovations of SmolLM and Mistral NeMo:

  • Hugging Face SmolLM: A family of small models with 135 million, 360 million, and 1.7 billion parameters
  • Mistral NeMo: A small model with 12 billion parameters, built in collaboration with Nvidia

This trend towards SLM (small language models) is driven by the benefits discussed above. There are  lots of options available with smaller models: Use a pre-built model or use compression techniques to shrink an existing LLM. Your use case needs to drive what approach you take when choosing a small model, so consider your options carefully.


关于作者

Ishu Verma is Technical Evangelist at Red Hat focused on emerging technologies like edge computing, IoT and AI/ML. He and fellow open source hackers work on building solutions with next-gen open source technologies. Before joining Red Hat in 2015, Verma worked at Intel on IoT Gateways and building end-to-end IoT solutions with partners. He has been a speaker and panelist at IoT World Congress, DevConf, Embedded Linux Forum, Red Hat Summit and other on-site and virtual forums. He lives in the valley of sun, Arizona.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事