Evaluation and Observability
Persona: AI Engineer (primary). Also relevant: SRE/Platform Engineer. |
In this module
Learn a practical, high-level flow to combine agent evaluation with observability so you can measure, understand, and improve behavior over time. |
Estimated time: 15-20 minutes |
What you’ll do…stretch goals!
-
Plan evaluation scenarios and acceptance criteria
-
Run evaluations and capture outputs, tool calls, and reasoning traces
-
Instrument the service to emit logs/metrics/traces that align to evaluation steps
-
Add guardrails/safety toggles and observe impact in both eval results and signals
-
Establish dashboards/alerts to continuously monitor the same signals
Flow overview
-
Define what “good” looks like for your agent (correctness, stability, latency, tool success)
-
Evaluate with representative prompts/tasks; record outputs and intermediate steps
-
Observe service-level signals (logs, metrics, traces) during the same runs
-
Compare results against criteria; identify where behavior or performance falls short
-
Tweak prompts, tools, or guardrails; re-run to verify improvements
-
Keep dashboards and alerts keyed to the same signals used in evaluation
Success criteria
-
The agent produces consistent answers across N repeated runs for a fixed prompt (tolerance defined by your team)
-
Tool-call failures/timeouts trend downward after guardrail/safety changes
-
Latency and error rates meet targets; traces show fewer retries and faster resolutions
-
Approximate MTTR improvement: measure time from failure detection to GitHub issue creation before and after the finally step
Open up the following notebook in your workspace.