Evaluating Model Accuracy with TrustyAI lm-eval-harness service
While performance metrics like latency and throughput are critical for deploying efficient GenAI systems, task-level accuracy and reasoning quality are equally essential for selecting or fine-tuning a model. In this activity, we use the popular lm-eval-harness framework to evaluate how well a language model performs across established benchmarks, focusing on reasoning and subject matter understanding.
What is Trusty AI?
TrustyAI is an open-source AI explainability and trustworthiness platform designed to help developers and data scientists understand and monitor their machine learning models. It provides tools to analyze predictions, identify biases, and ensure that AI systems are fair, transparent, and reliable. As part of its comprehensive toolkit, TrustyAI integrates the popular lm-eval harness to specifically benchmark and evaluate the performance of large language models against standardized tests, allowing users not only to understand why a model makes a decision but also to quantitatively measure its accuracy and capabilities. This combination of explainability and performance evaluation enables organizations to build more responsible, ethical, and robust AI applications.
What is lm-eval-harness?
lm-eval-harness is a community-maintained benchmarking toolkit from EleutherAI. It enables consistent, reproducible evaluation of large language models (LLMs) across dozens of academic and real-world benchmarks, such as:
-
MMLU (Massive Multitask Language Understanding)
-
HellaSwag, ARC, and Winogrande
-
Question answering, common sense reasoning, reading comprehension, and more
The framework supports both open-source models and OpenAI-compatible endpoints, and can be customized with additional tasks, prompt templates, and evaluation metrics.
ARC (Abstract and Reasoning Corpus)
Today we will be running the ARC evaluation.
ARC (AI2 Reasoning Challenge) is a multiple-choice question answering dataset designed to evaluate advanced reasoning and knowledge application in AI systems. It contains science exam questions from grades 3 to 9, split into two subsets: ARC-Easy, with questions that can often be solved by retrieval or surface-level cues, and ARC-Challenge, which includes more difficult questions requiring reasoning, commonsense understanding, and integration of world knowledge. ARC serves as a benchmark for testing a model’s ability to move beyond factual lookup toward genuine scientific reasoning.
Today’s Activity
In this section of our lab we will:
-
Set up the Trusty AI operator
-
Create and run the lm-eval job
-
Interpret and understand results
Setup Trusty AI
-
First, we’ll change the
trustyai
managementState from "Removed" to "Managed" in thedefault
DataScienceCluster.oc patch datasciencecluster default-dsc -p '{"spec":{"components":{"trustyai":{"managementState":"Managed"}}}}' --type=merge
-
Configure TrustyAI to allow downloading remote datasets from Huggingface
By default, TrustyAI prevents evaluation jobs from accessing the internet or running downloaded code. A typical evaluation job will download two items from Huggingface:
-
The dataset of the evaluation task, and any dataset processing code
-
The tokenizer of your model
If you trust the source of your dataset and tokenizer, you can override TrustyAI’s default setting. In our case, we’ll be downloading allenai/ai2_arc and Phi-3-mini-4k-instruct’s tokenizer. To download those two resources, run:
oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \ --type merge -p '{"metadata": {"annotations": {"opendatahub.io/managed": "false"}}}' oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \ --type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}' oc rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
Wait for your
trustyai-service-operator-controller-manager
pod in theredhat-ods-applications
namespace to restart, and then TrustyAI should be ready to go. -
-
Ensure Granite model is configured
Add the external endpoint to the workshop_code/evals/trusty/arc_easy.yaml file in the following section:
- name: base_url value: https://<YOUR_EXTERNAL_INFERENCE_ENDPOINT>/v1/completions # the location of your model's completions endpoint
Create and run the lm-eval job
-
Run the evaluation
To start an evaluation, apply an
LMEvalJob
custom resource as defined in the following file:oc apply -f workshop_code/evals/trusty/arc_easy.yaml -n vllm
Check out the arc_easy.yaml file to learn more about the
LMEvalJob
specification.If everything has worked, you should see a pod called
arc-easy-eval-job
running in your namespace. You can watch the progress of your evaluation job by running:watch oc logs -f arc-easy-eval-job -n vllm
You will see progression in percentage points.
Or alternatively, view the logs of the model pod:
oc logs -f granite-8b-predictor-<exact-pod-name> -n vllm
You will see the exact questions getting passed to the model endpoint.
This evaluation run will take approximately 10 minutes.
-
While You Wait: What is lm-eval-harness?
-
Check out this overview notebook to explore extensibility and task definitions.
-
View real Red Hat validated model results to understand benchmark outcomes in production contexts and to see how your favorite models rank.
-
Interpret and understand results
-
Interpreting ARC Results Metric: Performance is measured by multiple-choice accuracy (correct answers out of 4 options).
Baseline: ~25% corresponds to random guessing.
Performance ranges:
-
50–60%: typical for smaller or older models.
-
ARC-Easy: many modern models exceed 70–80%.
-
ARC-Challenge: 70–80% indicates strong reasoning; 80%+ is near state-of-the-art.
Split differences: ARC-Easy emphasizes simpler retrieval-based questions, while ARC-Challenge demands multi-step reasoning and integration of world knowledge.
Implications: Higher ARC accuracy reflects stronger scientific reasoning, knowledge application, and logical problem-solving capabilities.
-
-
Check out the results
After the evaluation finishes (it took about 8.5 minutes on my cluster), you can take a look at the results. These are stored in the
status.results
field of the LMEvalJob resource:oc get LMEvalJob arc-easy-eval-job -n vllm -o jsonpath='{.status.results}' | jq '.results'
returns:
{ "arc_easy": { "alias": "arc_easy", "acc,none": 0.8186026936026936, "acc_stderr,none": 0.007907153952801706, "acc_norm,none": 0.7836700336700336, "acc_norm_stderr,none": 0.00844876352205705 } }
Explanation of results
acc,none: This stands for accuracy. The value 0.8186 means the model answered approximately 81.86% of the questions correctly based on its raw output.
acc_stderr,none: This is the standard error of the accuracy. The value 0.0079 represents the margin of error for the accuracy score. It indicates how much the result might vary if the test were run again. A smaller number means the result is more statistically reliable.
acc_norm,none: This is the normalized accuracy. The value 0.7836 means that after cleaning up the model’s answers (e.g., removing extra spaces, punctuation, or standardizing capitalization), it answered about 78.37% of the questions correctly. This score is often considered a more realistic measure of performance.
acc_norm_stderr,none: This is the standard error for the normalized accuracy, indicating the margin of error for that specific score.
Now you’re free to play around with evaluations! You can see the full list of evaluation supported by lm-evaluation-harness here.
TrustyAI additional references
-
Try MMLU industry-focused test
In some cases, you may want to check that a model has retained accuracy around a standard, specific dataset topic.
Let’s try the mmlu_jurisprudence dataset to test the model’s knowledge on law. Update the base_url to your external inference endpoint.
oc apply -f workshop_code/evals/trusty/mmlu_jurisprudence.yaml -n vllm
This will only take a minute or so to process.
oc get LMEvalJob mmlu-jurisprudence-eval-job -n vllm -o template --template '{{.status.results}}' | jq .results
Bonus Exercise: MMLU-Pro Evaluation
If you have additional time, try running the more challenging MMLU-Pro evaluation.
MMLU-Pro is a reasoning-focused, multiple-choice benchmark derived from the original MMLU dataset. MMLU-Pro extends the original MMLU benchmark by introducing 10-option multiple-choice questions across diverse academic disciplines. It’s designed to test a model’s reasoning, factual recall, and elimination skills with increased difficulty.
Key differences from standard MMLU:
-
10-option multiple choice (vs. 4-option in standard MMLU)
-
More challenging questions requiring deeper reasoning
-
Covers advanced topics across academic disciplines
To run MMLU-Pro evaluation, you would need to create a custom LMEvalJob configuration file similar to the ARC and jurisprudence examples, but specifying the MMLU-Pro task.
Expected Performance Ranges for MMLU-Pro:
-
~10% = random guessing baseline (10-option multiple choice)
-
~30-50% = typical for smaller or untuned models
-
~60-70%+ = high reasoning capability indicating strong performance
MMLU-Pro evaluations typically take longer due to the increased difficulty and dataset size. |
Summary
What We Did:
-
Set up TrustyAI operator - enabled model evaluation framework in OpenShift AI
-
Configured internet access - allowed downloading of evaluation datasets from Hugging Face
-
Connected to deployed model - linked evaluation job to the Granite 8B inference service
-
Ran ARC Easy benchmark - tested model’s reasoning on grade-school science questions
-
Analyzed results - achieved 81.8% accuracy, indicating strong reasoning performance
Key Outcome:
-
Successfully evaluated deployed AI model accuracy using industry-standard benchmarks through TrustyAI + lm-eval-harness
Tools Used:
-
TrustyAI: Enterprise evaluation operator
-
lm-eval-harness: Standard benchmarking framework
-
ARC Easy: Science reasoning benchmark
Bottom Line: Demonstrated how to measure and validate AI model accuracy in production using automated evaluation pipelines.