Evaluating Model Accuracy with TrustyAI lm-eval-harness service

While performance metrics like latency and throughput are critical for deploying efficient GenAI systems, task-level accuracy and reasoning quality are equally essential for selecting or fine-tuning a model. In this activity, we use the popular lm-eval-harness framework to evaluate how well a language model performs across established benchmarks, focusing on reasoning and subject matter understanding.

What is Trusty AI?

TrustyAI is an open-source AI explainability and trustworthiness platform designed to help developers and data scientists understand and monitor their machine learning models. It provides tools to analyze predictions, identify biases, and ensure that AI systems are fair, transparent, and reliable. As part of its comprehensive toolkit, TrustyAI integrates the popular lm-eval harness to specifically benchmark and evaluate the performance of large language models against standardized tests, allowing users not only to understand why a model makes a decision but also to quantitatively measure its accuracy and capabilities. This combination of explainability and performance evaluation enables organizations to build more responsible, ethical, and robust AI applications.

What is lm-eval-harness?

lm-eval-harness is a community-maintained benchmarking toolkit from EleutherAI. It enables consistent, reproducible evaluation of large language models (LLMs) across dozens of academic and real-world benchmarks, such as:

MMLU (Massive Multitask Language Understanding)
HellaSwag, ARC, and Winogrande
Question answering, common sense reasoning, reading comprehension, and more

The framework supports both open-source models and OpenAI-compatible endpoints, and can be customized with additional tasks, prompt templates, and evaluation metrics.

ARC (Abstract and Reasoning Corpus)

Today we will be running the ARC evaluation.

ARC (AI2 Reasoning Challenge) is a multiple-choice question answering dataset designed to evaluate advanced reasoning and knowledge application in AI systems. It contains science exam questions from grades 3 to 9, split into two subsets: ARC-Easy, with questions that can often be solved by retrieval or surface-level cues, and ARC-Challenge, which includes more difficult questions requiring reasoning, commonsense understanding, and integration of world knowledge. ARC serves as a benchmark for testing a model’s ability to move beyond factual lookup toward genuine scientific reasoning.

Today’s Activity

In this section of our lab we will:

Set up the Trusty AI operator
Create and run the lm-eval job
Interpret and understand results

Setup Trusty AI

First, we’ll change the trustyai managementState from "Removed" to "Managed" in the default DataScienceCluster.

oc patch datasciencecluster default-dsc -p '{"spec":{"components":{"trustyai":{"managementState":"Managed"}}}}' --type=merge

Configure TrustyAI to allow downloading remote datasets from Huggingface

By default, TrustyAI prevents evaluation jobs from accessing the internet or running downloaded code. A typical evaluation job will download two items from Huggingface:
- The dataset of the evaluation task, and any dataset processing code
- The tokenizer of your model
  
  If you trust the source of your dataset and tokenizer, you can override TrustyAI’s default setting. In our case, we’ll be downloading allenai/ai2_arc and Phi-3-mini-4k-instruct’s tokenizer. To download those two resources, run:
  oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \ --type merge -p '{"metadata": {"annotations": {"opendatahub.io/managed": "false"}}}' oc patch configmap trustyai-service-operator-config -n redhat-ods-applications \ --type merge -p '{"data":{"lmes-allow-online":"true","lmes-allow-code-execution":"true"}}' oc rollout restart deployment trustyai-service-operator-controller-manager -n redhat-ods-applications
Wait for your trustyai-service-operator-controller-manager pod in the redhat-ods-applications namespace to restart, and then TrustyAI should be ready to go.

Ensure Granite model is configured

Add the external endpoint to the workshop_code/evals/trusty/arc_easy.yaml file in the following section:

- name: base_url
      value: https://<YOUR_EXTERNAL_INFERENCE_ENDPOINT>/v1/completions # the location of your model's completions endpoint

Create and run the lm-eval job

Run the evaluation

To start an evaluation, apply an LMEvalJob custom resource as defined in the following file:
```
oc apply -f workshop_code/evals/trusty/arc_easy.yaml -n vllm
```
Check out the arc_easy.yaml file to learn more about the LMEvalJob specification.

If everything has worked, you should see a pod called arc-easy-eval-job running in your namespace. You can watch the progress of your evaluation job by running:
```
watch oc logs -f arc-easy-eval-job -n vllm
```
You will see progression in percentage points.

Or alternatively, view the logs of the model pod:
```
oc logs -f granite-8b-predictor-<exact-pod-name> -n vllm
```
You will see the exact questions getting passed to the model endpoint.

This evaluation run will take approximately 10 minutes.
While You Wait: What is lm-eval-harness?
- Check out this overview notebook to explore extensibility and task definitions.
- View real Red Hat validated model results to understand benchmark outcomes in production contexts and to see how your favorite models rank.

Interpret and understand results

Interpreting ARC Results Metric: Performance is measured by multiple-choice accuracy (correct answers out of 4 options).

Baseline: ~25% corresponds to random guessing.

Performance ranges:
- 50–60%: typical for smaller or older models.
- ARC-Easy: many modern models exceed 70–80%.
- ARC-Challenge: 70–80% indicates strong reasoning; 80%+ is near state-of-the-art.
  
  Split differences: ARC-Easy emphasizes simpler retrieval-based questions, while ARC-Challenge demands multi-step reasoning and integration of world knowledge.
Implications: Higher ARC accuracy reflects stronger scientific reasoning, knowledge application, and logical problem-solving capabilities.
Check out the results

After the evaluation finishes (it took about 8.5 minutes on my cluster), you can take a look at the results. These are stored in the status.results field of the LMEvalJob resource:
```
oc get LMEvalJob arc-easy-eval-job -n vllm -o jsonpath='{.status.results}' | jq '.results'
```
returns:
```
{
  "arc_easy": {
    "alias": "arc_easy",
    "acc,none": 0.8186026936026936,
    "acc_stderr,none": 0.007907153952801706,
    "acc_norm,none": 0.7836700336700336,
    "acc_norm_stderr,none": 0.00844876352205705
  }
}
```
Explanation of results

acc,none: This stands for accuracy. The value 0.8186 means the model answered approximately 81.86% of the questions correctly based on its raw output.

acc_stderr,none: This is the standard error of the accuracy. The value 0.0079 represents the margin of error for the accuracy score. It indicates how much the result might vary if the test were run again. A smaller number means the result is more statistically reliable.

acc_norm,none: This is the normalized accuracy. The value 0.7836 means that after cleaning up the model’s answers (e.g., removing extra spaces, punctuation, or standardizing capitalization), it answered about 78.37% of the questions correctly. This score is often considered a more realistic measure of performance.

acc_norm_stderr,none: This is the standard error for the normalized accuracy, indicating the margin of error for that specific score.

Now you’re free to play around with evaluations! You can see the full list of evaluation supported by lm-evaluation-harness here.

TrustyAI additional references
- TrustyAI Notes Repo
- TrustyAI GitHub
Try MMLU industry-focused test

In some cases, you may want to check that a model has retained accuracy around a standard, specific dataset topic.

Let’s try the mmlu_jurisprudence dataset to test the model’s knowledge on law. Update the base_url to your external inference endpoint.
```
oc apply -f workshop_code/evals/trusty/mmlu_jurisprudence.yaml -n vllm
```
This will only take a minute or so to process.
```
oc get LMEvalJob mmlu-jurisprudence-eval-job -n vllm -o template --template '{{.status.results}}' | jq  .results
```

Bonus Exercise: MMLU-Pro Evaluation

If you have additional time, try running the more challenging MMLU-Pro evaluation.

MMLU-Pro is a reasoning-focused, multiple-choice benchmark derived from the original MMLU dataset. MMLU-Pro extends the original MMLU benchmark by introducing 10-option multiple-choice questions across diverse academic disciplines. It’s designed to test a model’s reasoning, factual recall, and elimination skills with increased difficulty.

Key differences from standard MMLU:

10-option multiple choice (vs. 4-option in standard MMLU)
More challenging questions requiring deeper reasoning
Covers advanced topics across academic disciplines

To run MMLU-Pro evaluation, you would need to create a custom LMEvalJob configuration file similar to the ARC and jurisprudence examples, but specifying the MMLU-Pro task.

Expected Performance Ranges for MMLU-Pro:

~10% = random guessing baseline (10-option multiple choice)
~30-50% = typical for smaller or untuned models
~60-70%+ = high reasoning capability indicating strong performance

MMLU-Pro evaluations typically take longer due to the increased difficulty and dataset size.

Summary

What We Did:

Set up TrustyAI operator - enabled model evaluation framework in OpenShift AI
Configured internet access - allowed downloading of evaluation datasets from Hugging Face
Connected to deployed model - linked evaluation job to the Granite 8B inference service
Ran ARC Easy benchmark - tested model’s reasoning on grade-school science questions
Analyzed results - achieved 81.8% accuracy, indicating strong reasoning performance

Key Outcome:

Successfully evaluated deployed AI model accuracy using industry-standard benchmarks through TrustyAI + lm-eval-harness

Tools Used:

TrustyAI: Enterprise evaluation operator
lm-eval-harness: Standard benchmarking framework
ARC Easy: Science reasoning benchmark

Bottom Line: Demonstrated how to measure and validate AI model accuracy in production using automated evaluation pipelines.