Lab Guide: Agentic AI Agents with Llama Stack Clients
1. Introduction
In the next steps we are going to create multiple Agents using the Llama Stack client python package. All agents will be defined within the same Jupyter Notebook. The agents can use the models and tools defined within the Llama stack config we deployed in the previous part of the lab.
2. Agent/Model configuration parameters
In the following parts of the lab, we will configure agents in different ways. Within their configurations, the following options are used. Feel free to experiment with these parameters throughout the exercises:
-
max_infer_iters(Agent setting) - Upper bound limit on how many inference-tool execution cycles the agent can perform before it must terminate and return a response. -
type(Model setting) - Reffers to the selected sampling strategy used by the model. Check this link to read more about sampling strategies. -
temperature(Model setting) - Sampling temperature used for generation. The higher the temperature, the more random the output of the model. -
top_p(Model setting) - Defines the probabilistic sum of tokens that should be considered for each subsequent token duing the models output generation. By increasing or decreasing Top P, you can explore how repetitive or complex responses can get, particularly in their vocabulary and phrasing.
3. Notebook Setup
-
Access Openshift AI and create a new workbench within a project of your choice using the Jupyter Image with Python 3.12 (You may need to scale the worker MachineSet within your OpenShift cluster to have enough ressources available).
-
Open the Workbench and create a new notebook
-
Install the
llama-stackclient library within a code block:!pip install -qq llama-stack==0.2.23 llama_stack_client==0.2.23 -
Import the necessary libraries:
import os from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger -
Define the connection details for the Llama Stack server. By default, this will use the internal Kubernetes service name.
LLAMA_STACK_SERVER_HOST = os.getenv("LLAMA_STACK_SERVER_HOST", "llamastack-with-config-service.llama-stack.svc.cluster.local") LLAMA_STACK_SERVER_PORT = os.getenv("LLAMA_STACK_SERVER_PORT", "8321") -
Your playbook should look like this now:
4. Agent 1 - Web Search
In this step we will explore the web search tool which enables an agent to fetch information from the web.
Instantiate the client and create an agent. This agent is configured to use the llama-4-scout-17b-16e-w4a16 model and has access to the web search tool.
client_websearch = LlamaStackClient(base_url=f"http://{LLAMA_STACK_SERVER_HOST}:{LLAMA_STACK_SERVER_PORT}")
agent_websearch = Agent(
client_websearch,
model="llama-4-scout-17b-16e-w4a16",
instructions="You are a helpful assistant.",
tools=[
"builtin::websearch",
],
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.1, "top_p": 0.95},
},
)
session_websearch = agent_websearch.create_session("monitored_session")
Now you can ask the agent questions. This first example uses the web search tool to find the current OpenShift release. The agent then sends the task to the defined model. The model can decide whether to use any of the available tools or to solve the task with its own capabilities. In case the model decides to use a tool, the model responses to the agent with the selected tool call. The agent executes the tool and returns the response back to the LLM. There is a maximum of steps defiend (max_infer_iters=5) that the model can do until it needs to sends a final response back to the agent.
|
If you see any Python import errors when executing a single cell you need to run all cells before the one displaying the error. You can execute all Cells via the 'Run All Cells' Option within the 'Run' menu. |
response = agent_websearch.create_turn(
messages=[{"role": "user", "content": "Whats the current Red Hat OpenShift release?"}],
session_id=session_websearch,
)
for log in AgentEventLogger().log(response):
log.print()
|
In case you see a wrong answer you can rerun the command via the 'Run Selected Cell' within the 'Run' menu. |
5. Agent 2 - OpenShift MCP Server
The second agent we create has access to the OpenShift MCP Server to retrieve OpenShift API information. The agent uses the 'llama-3-2-3b' model.
client_mcp_ocp = LlamaStackClient(base_url=f"http://{LLAMA_STACK_SERVER_HOST}:{LLAMA_STACK_SERVER_PORT}")
agent_mcp_ocp = Agent(
client_mcp_ocp,
model="llama-3-2-3b",
instructions="You are a helpful assistant",
tools=[
"mcp::openshift"
],
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.1, "top_p": 0.95},
"max_tokens": 8000,
},
)
session_mcp_ocp = agent_mcp_ocp.create_session("monitored_session")
It’s now possible to ask the agent questions about the OpenShift cluster and the agent is able to receive data via the MCP server:
response = agent_mcp_ocp.create_turn(
messages=[{"role": "user", "content": "What pods are running in the llama-stack namespace?"}],
session_id=session_mcp_ocp,
)
for log in AgentEventLogger().log(response):
log.print()
6. Agent 3 - Websearch & MCP
The third agent we create has access to the mcp server as well as the web search tool.
client_mutli = LlamaStackClient(base_url=f"http://{LLAMA_STACK_SERVER_HOST}:{LLAMA_STACK_SERVER_PORT}")
agent_multi = Agent(
client_mutli,
model="llama-4-scout-17b-16e-w4a16",
instructions="You are an assistant helping to debug OpenShift cluster issues.",
tools=[
"mcp::openshift",
"builtin::websearch"
],
max_infer_iters=5,
sampling_params={
"strategy": {"type": "top_p", "temperature": 0.1, "top_p": 0.95},
"max_tokens": 8000,
},
)
session_multi = agent_multi.create_session("monitored_session")
Let’s apply a deployment that will fail if a specific environment variable is not set to our cluster as an investiation target for the agent:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fail-crash-loop
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: fail-crash-loop
template:
metadata:
labels:
app: fail-crash-loop
spec:
containers:
- name: alpine
image: alpine:3.19
command:
- sh
- -c
- |
if [ -z "$IMPORTANT_MESSAGE" ]; then
echo "ERROR: IMPORTANT_MESSAGE is not set. Exiting."
exit 1
else
echo "IMPORTANT_MESSAGE is set to: $IMPORTANT_MESSAGE"
sleep 3600
fi
Apply the Deployment using oc apply -f broken-deployment.yaml.
It’s now possible to ask the agent questions about the OpenShift cluster which is able to receive data via the mcp server:
messages=[{"role": "user", "content": "Search for pods having problems in the default namespace using the OpenShift mcp."},
{"role": "user", "content": "Investigate the failing resource and suggest a fix"},
{"role": "user", "content": "Look up relevant troubleshooting information from the web."}
]
for message in messages:
print("\n"+"="*50)
print(f"Processing user query: {message}")
print("="*50)
response = agent_multi.create_turn(
messages=[message],
session_id=session_multi,
)
for log in AgentEventLogger().log(response):
log.print()
The agent will use the two tools to answer the user prompt:
7. Bonus Notebook - Llama Stack as OpenAI API drop in replacement
Until now we explored the build in agent capabilities of Llama stack via the llama stack python client library. Within this section we are going to use Llama Stack server as an OpenAI API drop in replacement. This functionality is very important as it offers the choice to customer to select his prefered agentic framework.
-
Create a new notebook file within the workbench we have used so far for the Llama Stack agents (Click on
File→New→Notebook). -
Install the openai python client:
!pip install -qq openai
-
Import the client:
from openai import OpenAI
-
Define the OpenAI API endpoint:
OPENAI_URL="http://llamastack-with-config-service.llama-stack.svc.cluster.local:8321/v1/openai/v1"
-
Define the client:
client=OpenAI(base_url=OPENAI_URL, api_key="none")
-
Test the connection by listing the available models:
models = client.models.list()
print(models)
-
In the next step we can use the OpenAI responses API endpoint (which is provided by the Llama Stack server):
response = client.responses.create(
input=[
{
"role": "system",
"content": (
"You are an assistant that can search the web when needed. "
"Always verify information from search results and summarize concisely."
)
},
{
"role": "user",
"content": "Find the most recent Openshift Release"
}
],
model="llama-4-scout-17b-16e-w4a16",
instructions="Search the web and summarize information you found",
store=True,
stream=False,
temperature=0.2,
text={
"format": {
"type": "text",
"name": "web_summary",
"schema": {},
"description": "Summarized result of the web search.",
"strict": True
}
},
tools=[
{
"type": "web_search",
"search_context_size": "medium"
}
],
)
# Inspect the response
print("Response ID:", response.id)
print("Output Type:", response.output[0].type if response.output else "None")
print("Output Text:", getattr(response, "output_text", None))
print("#"*80)
print("Full response object: ", response)
Since the Responses API has built-in support for tool calls, the model can leverage this capability. The mechanism works the same way as with the agents we’ve already seen: the model requests the tool execution, and the agent — in this case, the Responses API of the Llama Stack server — handles it.
8. Next steps
If there is enough time within the session you can adapt the available agent inputs (edit the content parts of the messages) and for example explore the following:
-
Explore the web search. Ask for specific information about a recent event (sport, concerts etc.).
-
Explore the OpenShift MCP server with it tools.
-
See the Llama Stack storage in action. Add an information to the input array and ask for it in the next entry.
-
Adapt the different configuration settings of the agents.