paint-brush
Quantizing Large Language Models With llama.cpp: A Clean Guide for 2024by@mickymultani
3,964 reads
3,964 reads

Quantizing Large Language Models With llama.cpp: A Clean Guide for 2024

by Micky MultaniMarch 6th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Model quantization is a technique used to reduce the precision of the numbers used in a model's weights and activations. This process significantly reduces the model size and speeds up inference times. It's possible to deploy state-of-the-art models on devices with limited memory and computational power.
featured image - Quantizing Large Language Models With llama.cpp: A Clean Guide for 2024
Micky Multani HackerNoon profile picture


Welcome to this ”to-the-point” tutorial on how to quantize any Large Language Model (LLM) available on Hugging Face using llama.cpp. Whether you're a data scientist, a machine learning engineer, or simply an AI enthusiast, this guide is designed to clarify the process of model quantization and make it easy.


By the end of this tutorial, you'll have a clear understanding of how to efficiently compress LLMs without significant loss in performance, enabling their deployment on resource-constrained environments. You can also use these model on your fav local setups using Ollama!


GitGub Repo is here: https://github.com/mickymultani/QuantizeLLMs

What is Model Quantization?

Before we dive into the technicalities, let's briefly discuss what model quantization is and why it's important. Model quantization is a technique used to reduce the precision of the numbers used in a model's weights and activations.


This process significantly reduces the model size and speeds up inference times, making it possible to deploy state-of-the-art models on devices with limited memory and computational power, such as mobile phones and embedded systems.

Introducing llama.cpp

llama.cpp is a powerful tool that facilitates the quantization of LLMs. It supports various quantization methods, making it highly versatile for different use cases. The tool is designed to work seamlessly with models from the Hugging Face Hub, which hosts a wide range of pre-trained models across various languages and domains.

Setting Up Your Workspace

This tutorial is designed with flexibility in mind, catering to both cloud-based environments and local setups. I personally conducted these steps on Google Colab, utilizing the NVIDIA Tesla T4 GPU, which provides a robust platform for model quantization and testing.


However, you should not feel limited to this setup. The beauty of llama.cpp and the techniques covered in this guide is their adaptability to various environments, including local machines with GPU support.


For instance, if you're using a MacBook with Apple Silicon, you can follow along and leverage the GPU support for model quantization, thanks to the cross-platform compatibility of the tools and libraries we are using.

Setting Up on Google Colab

Google Colab provides a convenient, cloud-based environment with access to powerful GPUs like the T4. If you choose Colab for this tutorial, make sure to select a GPU runtime by going to Runtime > Change runtime type > T4 GPU. This ensures that your notebook has access to the necessary computational resources.

Running on MacBook with Apple Silicon

For those opting to run the quantization process on a MacBook with Apple Silicon, ensure that you have the necessary development tools and libraries installed. While the setup might slightly differ from the Linux-based Colab environment, Python's ecosystem and the compilation of llama.cpp with CUDA support (or the equivalent for your specific GPU architecture) are well-documented, ensuring a smooth setup process.

Setting Up Hugging Face Authentication

Regardless of your platform, access to models from the Hugging Face Hub requires authentication. To seamlessly integrate this in your workflow, especially when working in collaborative or cloud-based environments like Google Colab, it's advisable to set up your Hugging Face authentication token securely.


On Google Colab, you can safely store your Hugging Face token by using Colab's "Secrets" feature. This can be done by clicking on the "Key" icon in the sidebar, selecting "Secrets", and adding a new secret with the name HF_TOKEN and your Hugging Face token as the value. This method ensures that your token remains secure and is not exposed in your notebook's code.


For local setups, consider setting the HF_TOKEN environment variable in your shell or utilizing the Hugging Face CLI to log in, thereby ensuring that your scripts have the necessary permissions to download and upload models to the Hub without hard-coding your credentials.

Step-by-Step Tutorial

Now, let's walk through the process of quantizing a model using llama.cpp. For this tutorial, we'll quantize the "google/gemma-2b-it" model from Hugging Face, but the steps can be applied to any model of your choice.

1. Setting Up Your Environment

First, we need to clone the llama.cpp repository and install the necessary requirements:

!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && LLAMA_CUBLAS=1 make && pip install -r requirements.txt

This command clones the llama.cpp repository and compiles the necessary binaries with CUDA support for GPU acceleration. It also installs Python dependencies required for the process.

2. Downloading the Model

Next, we download the model from Hugging Face Hub using the snapshot_download function. This function ensures that we have a local copy of the model for quantization:

from huggingface_hub import snapshot_download

model_name = "google/gemma-2b-it"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, local_dir_use_symlinks=False)

3. Preparing the Model for Quantization

Before quantizing, we convert the downloaded model to a format compatible with llama.cpp (gguf). This step involves specifying the desired precision (f16 for half-precision floating point) and the output file:

!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf

4. Quantizing the Model

With the model in the correct format, we proceed to the quantization step. Here, we're using a quantization method called q4_k_m, which is specified in the methods list. This method quantizes the model to 4-bit precision with knowledge distillation and mapping techniques for better performance:

import os

methods = ["q4_k_m"]
quantized_path = "./quantized_model/"

for m in methods:
    qtype = f"{quantized_path}/{m.upper()}.gguf"
    os.system("./llama.cpp/quantize "+quantized_path+"/FP16.gguf "+qtype+" "+m)

5. Testing the Quantized Model

After quantization, it's important to test the model to ensure it performs as expected. The following command runs the quantized model with a sample prompt from a file, allowing you to assess its output quality:

! ./llama.cpp/main -m ./quantized_model/Q4_K_M.gguf -n 90 --repeat_penalty 1.0 --color -i -r "User: " -f llama.cpp/prompts/chat-with-bob.txt

6. Sharing Your Quantized Model

Finally, if you wish to share your quantized model with the community, you can upload it to Hugging Face Hub using the HfApi and upload_file functions. This step involves creating a new repository for your quantized model and uploading the .gguf file:

from huggingface_hub import HfApi, create_repo, upload_file

model_path = "./quantized_model/Q4_K_M.gguf"
repo_name = "gemma-2b-it-GGUF-quantized"
repo_url = create_repo(repo_name, private=False)

api = HfApi()
api.upload_file(
    path_or_fileobj=model_path,
    path_in_repo="Q4_K_M.gguf",
    repo_id="yourusername/gemma-2b-it-GGUF-quantized",
    repo_type="model",
)

Make sure to replace "yourusername" with your actual Hugging Face username!

Wrapping Up

Congratulations! You've just learned how to quantize a large language model using llama.cpp. This process not only helps in deploying models to resource-constrained environments but also in reducing computational costs for inference.


By sharing your quantized models, you contribute to a growing ecosystem of efficient AI models accessible to a broader audience. Quantization is a powerful tool in the machine learning practitioner's toolkit.


Happy quantizing!