Evaluating Large Lanuage Models

In this section we will cover how Large Language Models can be evaluated using LM-Eval framework.

This lab only works with rhoai-eus-2.16-aws-gpu overlay on GPU cluster. Qwen_Instruct model used in the lab requires gpu node and trustyai ConfigMap patching instructions are not applicable for RHOAI 2.18 due to the way how operator now controls that ConfigMap

Validate via ArgoCD that all components are deployed and synced. TrustyAI demo components are deployed via ai-example-lmeval-lab application. It includes LLM model from https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct. Source code can be found under 'tenants/ai-example/lmeval-lab'. Minio deployment for object storage is done via 'ai-example-lmeval-lab-minio' ArgoCd application - source code is under 'tenants/ai-example/lmeval-lab-minio'.

'Qwen2.5-0.5B-Instruct' model requires Nvidia GPU - so it might take some time for GPU machine to be provisioned and model to start. You can monitor the process in OpenShift Console under Compute → Machines. Model is deployed in ai-example-lmeval-lab namespace. Make sure that qwen-instruct-predictor* pod is up and running.

By default trustyai component is configured to prevent online access. In order for LLM Evaluation to succeed - online access needs to be enabled. Run the following commands to allow online connections:

oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \
--type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}'

oc rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications

Wait for couple minutes before proceeding or check the status of trustyai-service-operator-controller-manager pod in redhat-ods-applications namespace

LLM Evaluation can be initiated by creating the LMEvalJob Custom Resource.

+ .lm-eval-job.yaml

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob
  namespace: ai-example-lmeval-lab
spec:
  model: local-completions
  taskList:
    taskNames:
      - arc_easy
  logSamples: true
  batchSize: '1'
  allowOnline: true
  allowCodeExecution: false
  outputs:
    pvcManaged:
      size: 5Gi
  modelArgs:
    - name: model
      value: qwen-instruct
    - name: base_url
      value: http://qwen-instruct-predictor:8080/v1/completions
    - name: num_concurrent
      value:  "1"
    - name: max_retries
      value:  "3"
    - name: tokenized_requests
      value: "False"
    - name: tokenizer
      value: Qwen/Qwen2.5-0.5B-Instruct

Copy the content of the yaml to a local file on your machine and apply it using the following command

oc apply -f lm-eval-job.yaml

LM-Eval job will take some time to complete. The progress can be observed by monitoring evaljob pod in ai-example-lmeval-lab namespace.

Pod logs can be observed by running the following command:

oc logs -f pod/evaljob -n ai-example-lmeval-lab

If 'evaljob' pod fails - make sure that trustyai-service-operator-config configmap in redhat-ods-application namespace has lmes-allow-online set to true. If not - apply patch command above and restart trustyai-service-operator-controller-manager deployment.

LLM Evaluation job can be restarted by using ommands below.

oc delete LMEvalJob/evaljob -n ai-example-lmeval-lab

oc apply -f lm-eval-job.yaml

Once LM-Eval job completes - the evaluation results can be retrieved by running the following command:

oc get lmevaljobs.trustyai.opendatahub.io evaljob -n ai-example-lmeval-lab \
  -o template --template={{.status.results}} | jq '.results'
{
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.6561447811447811,
    "acc_stderr,none": 0.009746660584852454,
    "acc_norm,none": 0.5925925925925926,
    "acc_norm_stderr,none": 0.010082326627832872
  }
}

Results show accuracy scores of 0.6561447811447811 and 0.5925925925925926 with stderr of 0.009746660584852454 and 0.010082326627832872