Model Evaluation Module

In This Workshop

In this hands-on, example-driven lab, you’ll move beyond leaderboard metrics to explore:

  • System-level benchmarking with GuideLLM

  • Task-level accuracy evaluation with lm-eval-harness

By the end of this hands-on experience, you’ll know how to:

  • Setup and use GuideLLM via a tekton pipeline to evaluate model performance

  • Setup and use lm-eval harness via TrustyAI to evaluate model accuracy

  • Interpret results meaningfully across metrics like accuracy, throughput, and latency.

  • Adjust tooling variables to align LLM behavior with production SLAs and expectations