AutoGPTQ: An User-friendly LLMs Quantization Package

Name: Oluwaseun Adeojo

Updated on 6/4/2023

Introduction to AutoGPTQ

With the advent of larger language models (LLMs) in the AI landscape, optimizing their efficiency has become a crucial endeavor. AutoGPTQ provides a solution, offering an easy-to-use LLMs quantization package built around the GPTQ algorithm. With user-friendly APIs, AutoGPTQ brings an efficient approach to handle quantization tasks in machine learning workflows.

You can check out AutoGPTQ Github here (opens in a new tab).

AutoGPTQ Updates and Performance

AutoGPTQ is a dynamic project that constantly improves its features and capabilities. The latest updates involve integrating with the performance optimization libraries, supporting different types of models, and enhancing the CUDA kernel speed.

One of the primary strengths of AutoGPTQ is its inference speed. The GPU comparison showcases an impressive speed metric, tokens/second, with the quantized model using AutoGPTQ outperforming others. For instance, with a batch size of input as 1, using a beam search decode strategy and enforcing the model to generate 512 tokens, the quantized Llama-7b model outperforms the original model in terms of inference speed (25.53 tokens/s vs 18.87 tokens/s).

# AutoGPTQ Performance Comparison
performance_comparison = {
    "model": ["llama-7b", "moss-moon 16b", "gpt-j 6b"],
    "GPU": ["1xA100-40G", "1xA100-40G", "1xRTX3060-12G"],
    "num_beams": [1, 4, 1],
    "fp16": [18.87, 68.79, None],
    "gptq-int4": [25.53, 91.30, 29.55]
}

Installing AutoGPTQ

Getting started with AutoGPTQ is straightforward. The latest stable release can be installed from pip, enabling quick setup:

pip install auto-gptq

For certain setups, pre-built wheels that satisfy your environment can be downloaded from each version's release assets and installed to skip the building stage:

# firstly, cd the directory where the wheel saved, then execute command below
pip install auto_gptq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl 
# install v0.2.0 auto_gptq pre-build wheel for linux in an environment whose python=3.10 and cuda=11.8

Additionally, the package offers options to disable CUDA extensions or to support specific models like LLaMa:

# To disable CUDA extensions
BUILD_CUDA_EXT=0 pip install auto-gptq
 
# To support LLaMa model
pip install auto-gptq[llama]

AutoGPTQ in Action: Quantization and Inference

The core functionality of AutoGPTQ is to enable the quantization of large language models. The process is simple and can be executed with just a few lines of code. Below is an example where a pre-trained model is quantized to 4-bit and then used for inference:

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, Base
 
QuantizeConfig
import logging
 
# Set up logging
logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S")
 
# Define pretrained and quantized model directories
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
 
# Set up tokenizer and examples
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [tokenizer("auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.")]
 
# Set up quantization config
quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)
 
# Load, quantize, and save model
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Customizing Models

AutoGPTQ also allows users to extend its functionalities to support custom models. It's a straightforward process that gives the user more control over their machine learning tasks. This customizable nature sets AutoGPTQ apart from other quantization packages, making it more flexible and adaptive to various use cases.

This customization can be seen in an example of extending auto_gptq to support OPT model.

# Extend auto_gptq to support OPT model (code to be provided based on your custom needs)

Evaluation on Downstream Tasks

AutoGPTQ supports evaluating model's performance on specific down-stream tasks before and after quantization. This ensures that the quantization process doesn't negatively impact the performance of the model on the tasks it's meant to perform. For instance, you can evaluate a model like EleutherAI/gpt-j-6b on a sequence-classification task using the cardiffnlp/tweet_sentiment_multilingual dataset:

To further illustrate this, we'll run through a simple evaluation example using the EleutherAI/gpt-j-6b model and the cardiffnlp/tweet_sentiment_multilingual dataset. In this case, we are assessing the performance of the quantized model on a sequence classification task, more specifically, sentiment analysis.

from transformers import pipeline, AutoTokenizer
from auto_gptq import AutoGPTQForSequenceClassification
from datasets import load_dataset
 
# Define pretrained and quantized model directories
pretrained_model_dir = "EleutherAI/gpt-j-6b"
quantized_model_dir = "gpt-j-6b-4bit"
 
# Load and quantize the model
model = AutoGPTQForSequenceClassification.from_pretrained(pretrained_model_dir)
model.quantize()
model.save_quantized(quantized_model_dir)
 
# Load the tokenizer and the sentiment analysis pipeline
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
 
# Load the dataset
dataset = load_dataset("cardiffnlp/tweet_sentiment_multilingual", split="test")
 
# Evaluate the model on the test dataset
correct, total = 0, 0
for example in dataset:
    prediction = sentiment_analysis(example["text"])[0]
    if prediction["label"].lower() == example["label"].lower():
        correct += 1
    total += 1
 
# Print the accuracy of the model on the test dataset
print(f"Accuracy: {correct / total:.2f}")

The code above showcases the quantization, saving, and subsequent evaluation of the quantized model. This allows you to gauge the performance of the model and how the quantization process impacts the outcome on the sequence classification task.

FAQ

1. Can AutoGPTQ handle only GPT-based models?

While AutoGPTQ was initially designed with GPT-based models in mind, the developers have extended its functionalities to accommodate a wider range of transformer models. This versatility stems from the modular design of the library, allowing it to be adapted for other models.

2. How do I customize AutoGPTQ for my specific use case?

AutoGPTQ allows customization by extending its classes and methods to support your specific needs. You can create custom classes inheriting from the base classes provided by AutoGPTQ and override the necessary methods.

3. Will quantization affect the performance of my model?

Quantization does involve a trade-off between model performance and model size or computational efficiency. However, AutoGPTQ aims to minimize this impact. It provides options to evaluate your model on downstream tasks before and after quantization, helping ensure that the performance degradation is acceptable for your use case.

Conclusion

In conclusion, AutoGPTQ provides an effective and efficient way to quantize transformer models while maintaining performance standards on specific tasks. Its user-friendly API and customization capabilities make it a versatile tool for machine learning professionals aiming to optimize their models. Whether you're looking to reduce the storage requirements of your model or improve inference speed, AutoGPTQ can be an essential part of your toolkit.