Model optimization

Deep dive presentation - vLLM and LLM Compressor

Challenges with running larger models

Even with the latest hardware like NVIDIA B200 GPUs offering up to 192GB of memory, the rapid increase in model sizes (e.g., GPT MoE 1.8T parameters) means that many state-of-the-art models cannot fit into a single GPU or even across multiple GPUs without optimization.

Give below are simple examples that shows how much memory is required just for storing parameters in GPUs for larger models.

Llama 4 Scout: 109B params

Optimization	Params Size (GB)	GPUs Required
BFloat16	109 * 2 ≈ 220GB	3 x 80GB
INT8/FP8	109 * 1 ≈ 109GB	2 x 80GB
INT4	109 * 0.5 ≈ 55GB	1 x 80GB

Optimization

Params Size (GB)

GPUs Required

BFloat16

109 * 2 ≈ 220GB

3 x 80GB

INT8/FP8

109 * 1 ≈ 109GB

2 x 80GB

INT4

109 * 0.5 ≈ 55GB

1 x 80GB

Llama 4 Maverick: 400B params

Optimization	Params Size (GB)	GPUs Required
BFloat16	400 * 2 ~= 800GB	10 x 80GB ← requires multi-node!
INT8/FP8	400 * 1 ~= 400GB	5 x 80GB
INT4	400 * 0.5 ~= 200GB	3 x 80GB

Optimization

Params Size (GB)

GPUs Required

BFloat16

400 * 2 ~= 800GB

10 x 80GB ← requires multi-node!

INT8/FP8

400 * 1 ~= 400GB

5 x 80GB

INT4

400 * 0.5 ~= 200GB

3 x 80GB

What optimized models offer:

Reduces GPU Memory requirements
- Model parameters account for the majority of GPU RAM usage at typical sequence lengths.
- Optimization allows you to allocate more memory to the KV cache, enabling longer context windows or larger batch sizes.
Accelerates linear layers
- Minimizes data movement, which is a major bottleneck in large models.
- Enables the use of low-precision tensor cores (as supported by the underlying hardwares), significantly speeding up matrix multiplications and inference.
Maintains model quality
- Fine-grained quantization techniques can compress models with minimal or negligible impact on accuracy, preserving performance while reducing resource needs.

Below is an example of the DeepSeek R1 model, showing how different quantization methods impact evaluation results after compression. It’s important to note that not all quantization methods are equal and model quality can vary significantly depending on the quantization scheme used. So, it’s important to consider the impact on accuracy when choosing a quantization scheme.

In addition to the improvements described above, it also offers the following advantages:

Cost efficiency: Running large models requires expensive hardware. By optimizing and compressing models, you can reduce the required resources, leading to significant cost savings in both cloud and on-premise deployments.
Faster inference: Smaller, optimized models can process requests faster, reducing latency and improving user experience, especially in real-time applications.
Energy efficiency: Compressing models reduces the computational load, which in turn lowers power consumption and helps meet sustainability goals.
Deployment flexibility: Optimized models are easier to deploy on a wider range of devices, including edge devices and environments with limited resources.
Scalability: Smaller models allow you to serve more concurrent users or run more instances on the same hardware, improving scalability.
Bandwidth savings: Transferring large models over the network can be slow and costly. Compressed models are smaller and easier to distribute.
Regulatory and security constraints: Cases where data and models must remain on-premise or on specific hardware, optimization enables running advanced models within these constraints.
Enabling new use cases: By reducing the size and resource requirements, model optimization makes it feasible to use advanced LLMs in scenarios previously not possible, such as mobile, IoT, or embedded systems.

Quantization in practice

Quantization types in vLLM & LLM Compressor:

Type of Quantization	What it does	Example impact	Quantization schemes supported
Weight quantization	Reduces the precision of model weights, lowering storage and memory requirements; LLM Compressor for weights quantization; Requires calibration dataset for weight quantization	100B model: BFloat16 → 200GB, FP8 → 100GB	W8A16, W4A16, WNA16
Weight and activation quantization	Reduces model size and improves inference performance; LLM Compressor for weights quantization and vllm for activation quantization during inference; Requires calibration dataset for weight quantization	Smaller activation memory footprint, faster inference	W8A8, W4A8, W4A4
KV Cache quantization	Reduced KV cache footprint & faster attention and crucial for large context workloads; Requires calibration dataset; LLM Compressor for scales calibration and vLLM to use the scales	Enables longer context or larger batch sizes with same hardware	FP8

Type of Quantization

What it does

Example impact

Quantization schemes supported

Weight quantization

Reduces the precision of model weights, lowering storage and memory requirements; LLM Compressor for weights quantization; Requires calibration dataset for weight quantization

100B model: BFloat16 → 200GB, FP8 → 100GB

W8A16, W4A16, WNA16

Weight and activation quantization

Reduces model size and improves inference performance; LLM Compressor for weights quantization and vllm for activation quantization during inference; Requires calibration dataset for weight quantization

Smaller activation memory footprint, faster inference

W8A8, W4A8, W4A4

KV Cache quantization

Reduced KV cache footprint & faster attention and crucial for large context workloads; Requires calibration dataset; LLM Compressor for scales calibration and vLLM to use the scales

Enables longer context or larger batch sizes with same hardware

FP8

Supported quantization schemes and when to use what?

Format	Description	Use Case(s)	Recommended GPU type
W4A16	4-bit weights, FP16 activations. High compression, fits small deployments; Requires calibration dataset for weight quantization.	Memory-constrained inference at low QPS /online inferencing; edge devices; low memory/containerized apps.	Recommended for any GPUs types.
W8A8-INT8	8-bit weights, INT8 activations (per-token, runtime); Requires calibration dataset for weight quantization.	High-QPS or offline serving; general purpose inference on any GPU; high-throughput inference on older GPUs.	Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
W8A8-FP8	8-bit weights, FP8 activations (runtime). Preserves precision while gaining speed. Requires calibration dataset for weight quantization.	High-QPS or offline serving; accuracy-sensitive with memory constraints;	Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell).
2:4 Sparsity (FP8 Weights/Activations)	Structured sparsity + FP8 weights/activations. Uses sparsity acceleration. Very high performance.	Speed-focused inference on modern hardware;	Recommended for compute capability >=9.0 (Hopper and Blackwell).

Format

Description

Use Case(s)

Recommended GPU type

W4A16

4-bit weights, FP16 activations. High compression, fits small deployments; Requires calibration dataset for weight quantization.

Memory-constrained inference at low QPS /online inferencing; edge devices; low memory/containerized apps.

Recommended for any GPUs types.

W8A8-INT8

8-bit weights, INT8 activations (per-token, runtime); Requires calibration dataset for weight quantization.

High-QPS or offline serving; general purpose inference on any GPU; high-throughput inference on older GPUs.

Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).

W8A8-FP8

8-bit weights, FP8 activations (runtime). Preserves precision while gaining speed. Requires calibration dataset for weight quantization.

High-QPS or offline serving; accuracy-sensitive with memory constraints;

Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell).

2:4 Sparsity (FP8 Weights/Activations)

Structured sparsity + FP8 weights/activations. Uses sparsity acceleration. Very high performance.

Speed-focused inference on modern hardware;

Recommended for compute capability >=9.0 (Hopper and Blackwell).

For a full list of supported hardware vs quantization scheme mapping, refer to the vLLM documentation.

Supported quantization methods/recipies and when to use what?

Method	Description	Use case / Accuracy needs
GPTQ	Utilizes second-order layer-wise optimizations to prioritize important weights/activations and enables updates to remaining weights	High accuracy recovery; best for scenarios where accuracy is critical and longer quantization time is acceptable
AWQ	Uses channelwise scaling to better preserve important outliers in weights and activations	Moderate accuracy recovery; suitable when faster quantization is needed with reasonable accuracy
SmoothQuant	Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models	Good accuracy recovery with minimal calibration time; can be combined with other methods for efficiency
SparseGPT	One‑shot pruning method that solves layer‑wise sparse regression to set weights to zero while readjusting survivors; supports unstructured sparsity up to ≈ 50–60 % without any retraining and 2 : 4 semi‑structured (N:M) sparsity for hardware‑friendly acceleration; can be stacked with low‑bit quantization	When latency/throughput or memory footprint must drop quickly and some accuracy loss is acceptable: 2 : 4 mode on Hopper/Blackwell‑class GPUs for ~1.5–2× speed‑up with near‑AWQ accuracy on large‑scale models; small models (<7 B) may see noticeable drops

Method

Description

Use case / Accuracy needs

GPTQ

Utilizes second-order layer-wise optimizations to prioritize important weights/activations and enables updates to remaining weights

High accuracy recovery; best for scenarios where accuracy is critical and longer quantization time is acceptable

AWQ

Uses channelwise scaling to better preserve important outliers in weights and activations

Moderate accuracy recovery; suitable when faster quantization is needed with reasonable accuracy

SmoothQuant

Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models

Good accuracy recovery with minimal calibration time; can be combined with other methods for efficiency

SparseGPT

One‑shot pruning method that solves layer‑wise sparse regression to set weights to zero while readjusting survivors; supports unstructured sparsity up to ≈ 50–60 % without any retraining and 2 : 4 semi‑structured (N:M) sparsity for hardware‑friendly acceleration; can be stacked with low‑bit quantization

When latency/throughput or memory footprint must drop quickly and some accuracy loss is acceptable: 2 : 4 mode on Hopper/Blackwell‑class GPUs for ~1.5–2× speed‑up with near‑AWQ accuracy on large‑scale models; small models (<7 B) may see noticeable drops

Let’s help a client select the quantization method and scheme

Question	Example client answer	How the client’s answer drives the decision
1. Inference style Is the workload online (latency‑critical, interactive) or offline (throughput‑critical, batch)?	e.g. “online customer‑service chatbot”	• Online ⇒ Memory‑bandwidth bound ⇒ Weight‑only quantization (activations stay FP16). • Offline ⇒ Compute bound ⇒ Weight + activation quantization (both operands low‑precision).
2. Target GPU architecture	e.g. “Ampere A100”	• Turing/Ampere have INT8 Tensor Cores ⇒ pick INT8 for speeds. • Hopper/H100 have native FP8 ⇒ pick FP8 (or INT8 if tooling is simpler).
3. Expected concurrency / batch size Enough requests to saturate matrix‑mult units?	e.g. “≈5 concurrent users; GPU often idle”	• If GPU not fully busy, you gain more by cutting memory traffic (weight‑only). • If GPU fully busy, you gain more by lowering compute cost (weight + activation).
4. Accuracy head‑room / SLA “How much accuracy can I lose?”	e.g. “<0.5 pp drop allowed”	Tight budgets push you toward higher‑accuracy methods (GPTQ, SmoothQuant + GPTQ).

Question

Example client answer

How the client’s answer drives the decision

1. Inference style Is the workload online (latency‑critical, interactive) or offline (throughput‑critical, batch)?

e.g. “online customer‑service chatbot”

• Online ⇒ Memory‑bandwidth bound ⇒ Weight‑only quantization (activations stay FP16). • Offline ⇒ Compute bound ⇒ Weight + activation quantization (both operands low‑precision).

2. Target GPU architecture

e.g. “Ampere A100”

• Turing/Ampere have INT8 Tensor Cores ⇒ pick INT8 for speeds. • Hopper/H100 have native FP8 ⇒ pick FP8 (or INT8 if tooling is simpler).

3. Expected concurrency / batch size Enough requests to saturate matrix‑mult units?

e.g. “≈5 concurrent users; GPU often idle”

• If GPU not fully busy, you gain more by cutting memory traffic (weight‑only). • If GPU fully busy, you gain more by lowering compute cost (weight + activation).

4. Accuracy head‑room / SLA “How much accuracy can I lose?”

e.g. “<0.5 pp drop allowed”

Tight budgets push you toward higher‑accuracy methods (GPTQ, SmoothQuant + GPTQ).

Example decision cheat sheet

Chosen answers	Quantization scheme	Recommended method(s)	Why this combination?
Online, Ampere/Turing, few users, strict latency	W4 / W8 – A16 (weight-only)	AWQ (fast), or GPTQ (max accuracy)	Data-movement is the bottleneck; compute is "free". Weight-only avoids per-token FP16→INT8 converts on activations.
Online, Hopper, few users	W4 / W8 – A16 weight-only (still)	AWQ or GPTQ	Hopper can run FP8 activations, but if users are few, activation traffic is tiny—stick to weight-only.
Offline, Ampere/Turing, large batch	W8 – A8 (INT8/INT8)	SmoothQuant + GPTQ (fold activation outliers, then weight-quant)	Matrix-multiplication dominates; lowering both operands to INT8 doubles Tensor-Core throughput.
Offline, Hopper, massive batch	W8 – A8 or FP8/FP8	SmoothQuant + SparseGPT (optional pruning)	Hopper’s FP8 Tensor Cores peak at ~2× A100 throughput. SmoothQuant tames activation outliers; SparseGPT can prune 2:4 (semi-structured) for more speed.

Chosen answers

Quantization scheme

Recommended method(s)

Why this combination?

Online, Ampere/Turing, few users, strict latency

W4 / W8 – A16 (weight-only)

AWQ (fast), or GPTQ (max accuracy)

Data-movement is the bottleneck; compute is "free". Weight-only avoids per-token FP16→INT8 converts on activations.

Online, Hopper, few users

W4 / W8 – A16 weight-only (still)

AWQ or GPTQ

Hopper can run FP8 activations, but if users are few, activation traffic is tiny—stick to weight-only.

Offline, Ampere/Turing, large batch

W8 – A8 (INT8/INT8)

SmoothQuant + GPTQ (fold activation outliers, then weight-quant)

Matrix-multiplication dominates; lowering both operands to INT8 doubles Tensor-Core throughput.

Offline, Hopper, massive batch

W8 – A8 or FP8/FP8

SmoothQuant + SparseGPT (optional pruning)

Hopper’s FP8 Tensor Cores peak at ~2× A100 throughput. SmoothQuant tames activation outliers; SparseGPT can prune 2:4 (semi-structured) for more speed.

Quantization workflow

Model selection and loading

model = AutoModelForCausalLM.from_pretrained("your-model")
tokenizer = AutoTokenizer.from_pretrained("your-model")

Choosing the quantization scheme (Supported quantization schemes)
Choosing the quantization method (Supported quantization methods)
Preparing calibration data
- Ensure the calibration data contains a high variety of samples to prevent overfitting towards a specific use case.
- If the model was fine-tuned, use the sample datasets from the fine-tuning training data for calibration.
- Employ the chat template or instruction template that the model was trained with.
- Start with 512 samples for calibration data, and increase if accuracy drops.
- Use a sequence length of 2048 as a starting point.
- Tune key hyperparameters to the quantization algorithm:
  - dampening_frac sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
  - actorder sets the activation ordering. When compressing the weights of a layer, the order in which channels are quantized matters. Setting actorder="weight" can improve accuracy without added latency.
Applying quantization
- Use oneshot API and provide the recipies to quantize and/or apply sparsity to the model given a dataset

from llmcompressor import oneshot

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

Saving the model

SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluating accuracy of the quantized model

lm_eval \
  --model vllm \
  --model_args pretrained=$MODEL,kv_cache_dtype=fp8,add_bos_token=True \
  --tasks gsm8k --num_fewshot 5 --batch_size auto