Quantizing LLM to GGML or GUFF Format: A Comprehensive Guide #4068

SiraHaruethaipree · 2023-11-14T02:02:25Z

SiraHaruethaipree
Nov 14, 2023

I would like to know about the detail, How to quantize LLM to GGML or GUFF format. Currently, I found few referent that describe about this e.g. https://github.com/rustformers/llm/blob/main/crates/ggml/README.md. In constant GPTQ have the referent paper that explain about the detail.
So have any website,blog suggest that describe about technique quantize GGML format.
Thank you for your help. :)

KerfuffleV2 · 2023-11-14T05:12:07Z

KerfuffleV2
Nov 14, 2023
Collaborator

The easiest thing unless you actually want to do the conversion yourself is to find a pre-converted model in the correct format. TheBloke has a massive amount of models published in various formats: https://huggingface.co/TheBloke

This repo currently uses the GGUF format. GGML was the previous format. The LLM project you linked still uses the GGML format (however they're working on GGUF support).

This isn't going to be anything like a comprehensive guide, maybe more like a very brief overview. Hopefully it still helps you a bit: If you want to quantize your own model to GGUF format you'll probably follow these steps (I'm assuming it's a LLaMA-type model) -

Clone the repo from HuggingFace or download the model files into some directory
Run the convert.py script in this repo to convert it to GGUF format. convert.py can output unquantized GGUF files or Q8_0 quantized ones. Q8_0 is usually not what you want to use for general use: it's half the size of a 16bit model but quantization works really well these days. I added Q8_0 to it mainly as a more efficient way to store model files in high quality.
Run the examples/quantize example in this repo to quantize to a what you got from step 2 into a smaller format, like Q4_K_M for example. (Note, if you're quantizing from Q8_0 you'll need to pass that utility --allow-requantize)
If you reached this step, congratulations, you have a quantized model in GGUF format.

After writing all this I realized, maybe you're asking about the algorithm rather than how perform quantization with existing tools?

2 replies

SiraHaruethaipree Nov 14, 2023
Author

Yes, I would like to know what main techniques are used for quantization in GGML or GUFF format. For example, GPTQ quantizes value by calibration with datasets to minimize error, or NF4 uses a technique to convert values to a normal float format. There are also other methods such as block-wise quantization, LLM.int8.

KerfuffleV2 Nov 14, 2023
Collaborator

I see. I'm really not an expert so keep in mind the source (some random anonymous person on the internet) when you read the rest of this:

First GGUF and GGML are container formats (GGML is also a machine learning library/API). So they can (and do) share some common quantization formats.

GGML's (the library, which this project is based on) uses block-based quantization. So the basic idea is there are chunks of N elements, each preceded with a block header which has some information to help dequantize more accurately. The simplest example is Q8_0, it has a block size of 32 elements and each block consists of a float16 delta field and 32 int8 quants.

You can look in ggml-quants.h to see the block definitions for all the quantization types. This is the Q8_0 for example:

#define QK8_0 32
typedef struct {
    ggml_fp16_t d;         // delta
    int8_t  qs[QK8_0];     // quants
} block_q8_0;

I also have a simple example of quantizing/dequantizing Q8_0 in Python here: https://github.com/KerfuffleV2/gguf-tools#gguf-tensor-to-image

Most of the other quantizations are more complicated and may store more fields like min values, split of the data for easier processing, etc. The general idea of it being split into chunks with some information about the chunk content to assist accurately recovering the information holds for all of them though, as far as I know.

The actual code for the GGML quantization and dequantization is in ggml-quants.c. Look for the functions named like _reference. Those are the simple, un-optimized versions.

dibyendubiswas1998 · 2024-05-15T10:38:38Z

dibyendubiswas1998
May 15, 2024

Hi, I fine-tune mistral-7b model for my question-answering task (after quantization in 4bit using LoRA, QLoRa).
Now I want to convert the fine-tuned LLM model into gguf format for CPU inferencing.
How to do that?

0 replies

hassan883 · 2025-05-12T05:20:58Z

hassan883
May 12, 2025

GGML is the model saving format, like GGUF or not? I found the 1 blog in which I saw that this is a library. Anyone know of a good blog or article with me?
thanks in advance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing LLM to GGML or GUFF Format: A Comprehensive Guide #4068

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Quantizing LLM to GGML or GUFF Format: A Comprehensive Guide #4068

SiraHaruethaipree Nov 14, 2023

Replies: 3 comments · 2 replies

KerfuffleV2 Nov 14, 2023 Collaborator

SiraHaruethaipree Nov 14, 2023 Author

KerfuffleV2 Nov 14, 2023 Collaborator

dibyendubiswas1998 May 15, 2024

hassan883 May 12, 2025

SiraHaruethaipree
Nov 14, 2023

Replies: 3 comments 2 replies

KerfuffleV2
Nov 14, 2023
Collaborator

SiraHaruethaipree Nov 14, 2023
Author

KerfuffleV2 Nov 14, 2023
Collaborator

dibyendubiswas1998
May 15, 2024

hassan883
May 12, 2025