Weights and Activation Quantization (W4A16)

In this exercise, we will use a notebook to investigate how LLMs weights and activations can be quantized to W4A16 for memory savings and inference acceleration. This quantization method is particularly useful for:

  • Reducing model size

  • Maintaining good performance during inference

The quantization process involves the following steps:

  1. Load the model: Load the pre-trained LLM model

  2. Choose the quantization scheme and method - Refer to the slides for a quick recap of the schemes and formats supported

  3. Prepare calibration dataset: Prepare the right dataset for calibration

  4. Quantize the model: Convert the model weights and activations to W4A16 format

    • Using SmoothQuant and GPTQ

  5. Save the model: Save the quantized model to a suitable storage

  6. Evaluate the model: Evaluate the quantized model’s accuracy

Pre-requisites

To start the lab, perform the following pre-requisite setup activities.

  1. Create a Data Science Project

  2. Create Data Connections - To store the quantized model

  3. Deploy a Data Science Pipeline Server

  4. Launch a Workbench

  5. Clone the Git Repo https://github.com/redhat-ai-services/etx-llm-optimization-and-inference-leveraging.git into the workbench

Creating a Data Science Project

  • In the OpenShift AI Dashboard application, navigate to the Data Science Projects menu on the left:

    quant ds proj nav
    Figure 1. OpenShift AI Dashboard
  • Create a Data Science project with the name quantization

    quant create project
    Figure 2. Project

Creating a Data Connection for the Pipeline Server

  • To provide a S3 storage for pipeline server and for saving the quantized model to S3, create a new OpenShift Project minio and set up MinIO by applying the manifest available at optimization_lab/minio.yaml. The default credentials for accessing MinIO are minio/minio123

oc apply -n minio.yaml
  • Login to MinIO with the credentials minio/minio123 and create two buckets with the names pipeline and models.

  • Create a new Data Connection that points to it.

quant add dc
Figure 3. Connection
  • Select the connection type S3 compatible object storage -v1 and use the following values for configuring the MinIO connection.

    quant add dc type
    Figure 4. S3 comaptible object storage
    • Name:

      Pipeline
    • Access Key:

      minio
    • Secret Key:

      minio123
    • Endpoint:

      http://minio-service.minio.svc.cluster.local:9000
    • Region:

      none
    • Bucket:

      pipelines
  • The result should look similar to:

    quant data connection
    Figure 5. Result
  • Create another Data Connection with the name minio-models using the same MinIO connection details with the bucket name as "models"

Creating a Pipeline Server

  • It is recommended to create the pipeline server before creating a workbench.

  • Go to the Data Science Project quantizationData science pipelinesPipelines → click on Configure Pipeline Server

    quant pipelineserver01
    Figure 6. Pipeline Server 1
  • Use the same information as in the Data Connection created earlier (Pipeline) and click the Configure Pipeline Server button:

    quant pipelineserver02
    Figure 7. Pipeline Server 2
  • When the pipeline server is ready, the screen will look like the following:

    quant pipelineserver03
    Figure 8. Pipeline Server 3

At this point, the pipeline server is ready and deployed.

There is no need for wait for the pipeline server to be ready. You may go on to the next steps and check this out later on. This may take more than a couple of minutes to complete.

Creating a Workbench

  • Once the Data Connection and Pipeline Server are fully created, it’s time to create the workbench

  • Go to Data Science Projects, select the project quantization, and click on Create a workbench

    quant create wb
    Figure 9. Create Workbench
  • Make sure it has the following characteristics:

    • Choose a name for it, like: granite-quantization

    • Image Selection: Minimal Python or Standard Data Science

    • Container Size: Medium

    • Accelerator: NVIDIA-GPU

  • That should look like:

    quant launch workbench 01
    Figure 10. Launch Workbench
  • Add the created Data Connection by clicking on the Connections section and selecting Attach existing connections. Then, click Attach for the created minio-models connection. 🔗

    quant add dc 01
    Figure 11. Add Data Connection
    quant attach dc
    Figure 12. Attach Data Connection
  • Then, click on Create Workbench and wait for the workbench to be fully started.

  • Once it is, click the link besides the name of the workbench to connect to it!

    quant open link
    Figure 13. Open Link
  • Authenticate with the same credentials as earlier.

  • You will be asked to accept the following settings:

    quant accept
    Figure 14. Accept Settings
  • Once you accept it, you should now see this:

    quant jupyter
    Figure 15. Jupyter

Git clone the Common Repo

We will clone the content of our Git repo so that you can access all the materials created as part of our prototyping exercise. 📚

  • Using the Git UI:

    • Open the Git UI in Jupyter:

      quant git clone 1
      Figure 16. Git UI
    • Enter the URL of the Git repo:

      https://github.com/redhat-ai-services/etx-llm-optimization-and-inference-leveraging.git
      quant git clone 2
      Figure 17. Git Clone

At this point, the project is ready for the quantization work.

Exercise: Quantize the Model with llm-compressor

From the optimization_lab/llm_compressor folder, open the notebook weight_activation_quantization.ipynb and follow the instructions.

quantization int8 notebook
Figure 18. Notebook

To execute the cells you can select them and either click on the play icon or press Shift + Enter

quantization notebook cell status
Figure 19. Execute Cell

When the cell is being executed, you can see [*]. And once the execution has completed, you will see a number instead of the *, e.g., [1]

quantization notebook execute cell
Figure 20. Cell Status

When done, you can close the notebook and head to the next page.

Once you complete all the quantization exercises and you no longer need the workbench, ensure you stop it so that the associated GPU gets freed and can be utilized to serve the model.
quantization notebook workbench done
Figure 21. Workbench Done
quantization notebook workbench stop
Figure 22. Workbench Stop