Doing more with less: LLM quantization (part 2)

2024年 11月 22日Ishu Verma9 分 (読了時間の目安)

What if you could get similar results from your large language model (LLM) with 75% less GPU memory? In my previous article,, we discussed the benefits of smaller LLMs and some of the techniques for shrinking them. In this article, we’ll put this to test by comparing the results of the smaller and larger versions of the same LLM.

As you’ll recall, quantization is one of the techniques for reducing the size of a LLM. Quantization achieves this by representing the LLM parameters (e.g. weights) in lower precision formats: from 32-bit floating point (FP32) to 8-bit integer (INT8) or INT4. The tradeoff could be a possible reduction in accuracy, so the goal is to achieve these savings with a reasonable reduction in the accuracy of predictions.

Are quantized LLMs accurate enough?

All the benefits of smaller LLMs are moot if the results are not accurate enough to be useful. There are a number of benchmarks available that compare measure model accuracy, but for the sake of simplicity, let’s manually inspect the quality of responses for non-quantized and quantized LLMs.

Setup

AWS G6.12x large instance (4 NVIDIA L4 TensorCore GPUs, 96GB GPU memory, 48 vCPUs, 192GB CPU memory). We’ll be using the Mistral 7B Instruct LLM with FP32, INT8 and INT4. The model will be served with Ollama and the system performance is monitored with NVTop utility.

Let’s get started with a set of prompts focusing on different areas of knowledge.

Prompt #1: In 20 words or less, describe the theory of relativity

FP32

INT8

INT4

Groundbreaking work by Albert Einstein; connected space, time, gravity, & energy, challenging traditional science views. Consists of

special & general relativity theories.

The theory of relativity describes the relationship between space and time,

including the effects of gravity on both, and sets the foundational

principles for modern physics.

The theory of relativity describes the laws of physics in a frame of

reference where space and time are interconnected; it involves the famous

equation E=mc^2.

Observation: For generic factual information, the responses are accurate and similar for FP32, INT8 and INT4.

Prompt #2: Write a Python function that calculates the factorial of a number

FP32

INT8

INT4

python

def factorial(n):

if n == 0:

return 1

else:

return n * factorial(n-1)

python

def factorial(n):

if n == 0:

return 1

else:

return n * factorial(n-1)

python

def factorial(n):

if n == 0:

return 1

else:

return n * factorial(n-1)

Observation: For coding, the responses are accurate and identical for FP32, INT8 and INT4

Prompt #3: Write a Python function that sorts numbers 8, 1, 9, 2, 5, 4 in ascending order

FP32

INT8

INT4

def sort_numbers(numbers):

sorted_numbers = sorted(numbers)

return sorted_numbers

numbers = [8, 1, 9, 2, 5, 4]

sorted_numbers = sort_numbers(numbers)

print(sorted_numbers)

def sort_numbers(numbers):

sorted_numbers = sorted(numbers)

return sorted_numbers

numbers = [8, 1, 9, 2, 5, 4]

sorted_numbers = sort_numbers(numbers)

print(sorted_numbers)

def sort_numbers(numbers):

sorted_numbers = sorted(numbers)

return sorted_numbers

numbers = [8, 1, 9, 2, 5, 4]

sorted_numbers = sort_numbers(numbers)

print(sorted_numbers)

Observation: For coding, the responses are accurate and identical for FP32, INT8 and INT4.

Prompt #4: In 10 words or less, describe which country won the most total medals in the Olympics 2024

FP32

INT8

INT4

Not specified; as of the time of this response, the Olympic Games

2024 have not yet taken place. The country that won the most total medals

and gold medals will be determined during the competition itself.

I can't predict future events, but as of now, no Olympics have taken

place in the year 2024. So it is not possible for me to answer this

question at the moment.

Not specified yet as the Olympic Games 2024 have not taken place.

Observation: The LLM acknowledges lack of information as it was trained before the event took place. INT4 (12 words) response is most concise and FP32 is most verbose (36 words).

Prompt #5: In 10 words or less, describe OPC/UA protocol and where it is used

FP32

INT8

INT4

OPC UA (open platform communications unified architecture) is a

standardized communication protocol for industrial automation systems; it

is used in manufacturing, energy management, building automation, and

other related industries.

OPC UA (open platform communications unified architecture) is a

standardized industrial communication protocol for interoperability among

devices and systems, widely used in manufacturing automation and IoT.

OPC UA (open platform communication unified architecture) is a

cross-platform data exchange standard for industrial automation; it is

used in manufacturing industries for communication between devices and

systems.

Observation: For a topic related to a specific industry segment (e.g. IoT), the responses are accurate and quite similar for FP32, INT8 and INT4.

Prompt #6: Summarize this in a simple sentence:

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.”

FP32

INT8

INT4

The new transformer model based solely on attention mechanisms

outperforms existing machine translation models in quality,

parallelizability, and training time while also generalizing well to other

tasks like English constituency parsing.

The paper introduces a new deep learning model called transformer for

machine translation tasks, which outperforms existing models in quality

while being more parallelizable and requiring less time to train,

achieving state-of-the-art BLEU scores on WMT 2014 English-to-German and

English-to-French translations. The model also demonstrates generalization

to other tasks like English constituency parsing.

The paper introduces a new neural network architecture called the

transformer that outperforms existing sequence transduction models,

particularly in machine translation tasks, while being more efficient and

parallelizable.

Observation: The task to summarize the abstract of Attention is all you need paper, the responses are accurate and pretty similar. INT8 has the most lengthy response that’s two sentences long vs. one for FP32 and INT4.

Prompt #7: A farmer has 17 chickens and cows in total. If the total number of legs is 44, how many cows are there, given that each chicken has 2 legs and each cow has 4 legs, and there are no other animals present?

FP32

INT8

INT4

To solve this problem, we need to find the number of chickens and cows

from the given information about their total number of legs.

Chickens have 2 legs each, so if there are C chickens in total, then the

number of chicken legs is:

C * 2 = 17 (total number of chicken legs)

C = 8.5

[......Response truncated]

In this case, we have found a possible solution with C = 9 chickens and

cows = 6. It's important to note that there could be another valid

solution by rounding down the number of chickens instead. However,

assuming no other animals are present, our current answer seems plausible

since the total number of animals (chickens + cows) is 15 in both cases,

which is within the range of possible solutions based on the total number

of legs provided.

There are 8 cows, since each chicken has 2 legs (17 * 2 = 34) and each

cow has 4 legs, so the total number of legs for all cows is 8 \* 4 = 32.

Therefore, the total number of legs minus the number of chicken legs

should equal the number of cow legs, which gives us (44 - 34) = 10. Since

each cow has 4 legs, there are 10 / 4 = 8 cows in total.

There are 6 cows since each cow has 4 legs and the total number of legs

for all the animals (chickens and cows) is 44. To find out how many

chickens there are:

1. We know that there are 6 cows with a total of 24 legs (since each cow

has 4 legs).

2. This leaves us with 18 legs for the chickens (44 - 24 = 18).

3. Since each chicken has 2 legs, there must be 9 chickens (18 / 2 = 9).

So, there are 6 cows and 9 chickens in total.

Note: Correct answer is 12 chickens and 5 cows (12*2 + 5*4 =44)

Observation: All versions of LLM are pretty bad at the arithmetic problem! The only difference being INT8 (71 words) and INT4 (91 words) answers are a little bit more concise than FP32 (272 words).

Savings with a quantized LLM

We can see from the above examples that quantized LLMs are reasonably accurate in their predictions. Now, let’s look at the GPU resource consumption during inference for the various versions of LLMs.

GPU consumption spikes during inference (FP32)

Now let’s look at comparison of resource consumption across different versions of the same LLM.

GPU consumption for various versions of a LLM

The GPU consumption during inference reduces slightly as move from FP32 to INT4 (28% to 24%) but the GPU memory consumption reduces dramatically (37% to 8%).

What about enterprise use cases?

For an enterprise looking to benefit from generative AI, LLM needs to be benchmarked for specific use cases rather than encyclopedic knowledge or mathematical skills. These use cases could encompass tasks like customer service chatbot or text summarization that’s focused on a specific domain or company data.

New hire orientation for LLM

An off-the shelf LLM is like a new college graduate, both have a good general knowledge and natural language capabilities. When a new college graduate joins a company, the employee needs to be trained to become familiar with the company (and industry) specific knowledge or tasks before they can be productive. For example, when someone hears the word “cloud”, one would most likely think of other related words (white, dark, sky, rain, storm…). The context is a key factor in determining the priority of these related words. For a LLM, this priority is denoted by its parameters (e.g. weights).

For an IT enterprise, the term “cloud" will have different weight than an agriculture business. For an employee (or LLM) in an IT company, the word “cloud” will have higher weights for related words like computing, AWS and Azure. While an employee (or LLM) for an agricultural business, the word “cloud” will have higher weights for related words like rain, precipitation and storm.

The in-job training for a new employee is similar to fine tuning a LLM. In both instances, the goal is to get better at specific tasks within a given domain. For an employee, this could entail in-house training or shadowing other experienced colleagues. For an LLM, this means fine-tuning the model with a specific set of data or tasks that will adjust the weights resulting in improved performance.

Note: To be fair, it’s worth mentioning that it’s possible to add context or specific knowledge when querying the LLM through retrieval augmented generation (RAG). But providing a context with every query is cumbersome and doesn’t support fast response applications like a support chatbot. This means fine-tuning an LLM with company specific knowledge or tasks becomes a necessity for enterprise use cases.

There are several approaches to fine tuning an LLM, including reinforcement learning from human feedback (RLHF), knowledge distillation with student/teacher models, or parameter efficient fine tuning with LoRA. How can an enterprise fine-tune an off-the shelf LLM without incurring the massive costs (data science skills, infrastructure resources) involved with retraining a model?

Fine-tuning a LLM for common people

InstructLab is an open source project that allows every day users (as opposed to data scientists) to add new knowledge and skills to AI models using a simple question and answer format. This could be used by enterprises who want to customize an LLM with company-specific knowledge or tasks. Let’s put it to practice with sample data from this repository.

We’ll assume this is an insurance company customer service chatbot being asked about information that it doesn’t already have (in this case, a fictional car from the movie, “Back to the Future”). Once again, we’ll use the Mistral 7B Instruct model that we had used earlier.

Prompt (pre fine-tuning): How much does the flux capacitor cost for DMC-12

The cost of a flux capacitor for a DeLorean DMC-12, as depicted in the Back to the Future movie series, is not specified and is generally considered fictional. In reality, similar time travel technology is not currently available or priced.

Observation: The LLM accurately points out the lack of information for this fictional item.

Now let’s fine-tune this LLM by adding the missing knowledge (i.e. the cost of the flux capacitor ($10,000,000)). In reality, this could be any proprietary data an enterprise may want to add e.g. a product catalog, actuarial table or medical insurance coverage rates.

Now let’s see the results after fine tuning the LLM.

Prompt (post fine-tuning): How much does the flux capacitor cost for DMC-12

The cost of a flux capacitor is $10,000,000.

Observation: The LLM accurately provides the information provided during the fine tuning of the model. Enterprises can affect the behavior of LLM by tuning it with their proprietary data or customizing its responses tailored to specific tasks.

If you’re interested in learning more about how to use InstructLab to add new knowledge and skills to a LLM of your choice, check out the interactive step-by-step InstructLab demo here.

執筆者紹介

Ishu Verma

Emerging Technology Evangelist

Ishu Verma is Technical Evangelist at Red Hat focused on emerging technologies like edge computing, IoT and AI/ML. He and fellow open source hackers work on building solutions with next-gen open source technologies. Before joining Red Hat in 2015, Verma worked at Intel on IoT Gateways and building end-to-end IoT solutions with partners. He has been a speaker and panelist at IoT World Congress, DevConf, Embedded Linux Forum, Red Hat Summit and other on-site and virtual forums. He lives in the valley of sun, Arizona.

Read full bio

さらに調べる

チャンネル別に見る

すべてのチャンネルを見る

プラットフォーム製品

試す & 買う

注目のコースと認定

業種別

注目のコースと認定

トピックス

記事

その他

お客様向け

パートナー向け

Red Hat の使命と歩み

オープンソース

企業情報

おすすめのリソース

言語を選択してください

言語を選択してください

Doing more with less: LLM quantization (part 2)

Are quantized LLMs accurate enough?

Savings with a quantized LLM

What about enterprise use cases?

New hire orientation for LLM

Fine-tuning a LLM for common people

執筆者紹介

Ishu Verma

類似検索

さらに調べる

チャンネル別に見る

製品

ツール

試用、購入、販売

コミュニケーション

Red Hat について

言語を選択してください

Red Hat legal and privacy links

Red Hat legal and privacy links