What is Quantization?
Quantization — Compressing an AI model by reducing the precision of its weights, allowing it to run faster on weaker hardware.
Quantization converts model weights from high-precision numbers (32-bit) to lower-precision (8-bit or 4-bit). This reduces model size by 2-8x and speeds up inference, with minimal quality loss. It is the primary technique enabling large models to run on consumer hardware.
Frequently Asked Questions
Does quantization hurt model quality?
Slightly. 8-bit quantization typically shows negligible quality loss (under 1%). 4-bit quantization may lose 2-5% on benchmarks but remains highly usable for most applications.
What quantization methods are popular?
GPTQ, AWQ, and GGUF are the most common. GGUF is popular for local deployment via llama.cpp. GPTQ and AWQ are preferred for GPU-based serving.
Can I quantize any model?
Most transformer-based models can be quantized. Pre-quantized versions of popular models are available on Hugging Face, so you do not need to run the quantization process yourself.