Evaluating Large Lanuage Models
In this section we will cover how Large Language Models can be evaluated using LM-Eval framework.
This lab only works with rhoai-eus-2.16-aws-gpu overlay on GPU cluster. Qwen_Instruct model used in the lab requires gpu node and trustyai ConfigMap patching instructions are not applicable for RHOAI 2.18 due to the way how operator now controls that ConfigMap |
Validate via ArgoCD that all components are deployed and synced. TrustyAI demo components are deployed via ai-example-lmeval-lab
application. It includes LLM model from https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct. Source code can be found under 'tenants/ai-example/lmeval-lab'. Minio deployment for object storage is done via 'ai-example-lmeval-lab-minio' ArgoCd application - source code is under 'tenants/ai-example/lmeval-lab-minio'.
'Qwen2.5-0.5B-Instruct' model requires Nvidia GPU - so it might take some time for GPU machine to be provisioned and model to start. You can monitor the process in OpenShift Console under Compute → Machines
. Model is deployed in ai-example-lmeval-lab
namespace. Make sure that qwen-instruct-predictor*
pod is up and running.
By default trustyai component is configured to prevent online access. In order for LLM Evaluation to succeed - online access needs to be enabled. Run the following commands to allow online connections:
oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \ --type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}' oc rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
Wait for couple minutes before proceeding or check the status of trustyai-service-operator-controller-manager pod in redhat-ods-applications namespace
LLM Evaluation can be initiated by creating the LMEvalJob
Custom Resource.
+ .lm-eval-job.yaml
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob
namespace: ai-example-lmeval-lab
spec:
model: local-completions
taskList:
taskNames:
- arc_easy
logSamples: true
batchSize: '1'
allowOnline: true
allowCodeExecution: false
outputs:
pvcManaged:
size: 5Gi
modelArgs:
- name: model
value: qwen-instruct
- name: base_url
value: http://qwen-instruct-predictor:8080/v1/completions
- name: num_concurrent
value: "1"
- name: max_retries
value: "3"
- name: tokenized_requests
value: "False"
- name: tokenizer
value: Qwen/Qwen2.5-0.5B-Instruct
Copy the content of the yaml to a local file on your machine and apply it using the following command
oc apply -f lm-eval-job.yaml
LM-Eval job will take some time to complete. The progress can be observed by monitoring evaljob
pod in ai-example-lmeval-lab
namespace.
Pod logs can be observed by running the following command:
oc logs -f pod/evaljob -n ai-example-lmeval-lab
If 'evaljob' pod fails - make sure that trustyai-service-operator-config
configmap in redhat-ods-application namespace has lmes-allow-online
set to true
. If not - apply patch command above and restart trustyai-service-operator-controller-manager
deployment.
LLM Evaluation job can be restarted by using ommands below.
oc delete LMEvalJob/evaljob -n ai-example-lmeval-lab oc apply -f lm-eval-job.yaml
Once LM-Eval job completes - the evaluation results can be retrieved by running the following command:
oc get lmevaljobs.trustyai.opendatahub.io evaljob -n ai-example-lmeval-lab \ -o template --template={{.status.results}} | jq '.results' { "arc_easy": { "alias": "arc_easy", "acc,none": 0.6561447811447811, "acc_stderr,none": 0.009746660584852454, "acc_norm,none": 0.5925925925925926, "acc_norm_stderr,none": 0.010082326627832872 } }
Results show accuracy scores of 0.6561447811447811 and 0.5925925925925926 with stderr of 0.009746660584852454 and 0.010082326627832872