Model Evaluation Module

In This Workshop

In this hands-on, example-driven lab, you’ll move beyond leaderboard metrics to explore:

System-level benchmarking with GuideLLM
Task-level accuracy evaluation with lm-eval-harness

By the end of this hands-on experience, you’ll know how to:

Setup and use GuideLLM via a tekton pipeline to evaluate model performance
Setup and use lm-eval harness via TrustyAI to evaluate model accuracy
Interpret results meaningfully across metrics like accuracy, throughput, and latency.
Adjust tooling variables to align LLM behavior with production SLAs and expectations