Lab Setup and Prerequisites

Prerequisites

Before starting this lab, ensure you have the following:

Required Tools

  • GitHub Account: You’ll need access to fork repositories and collaborate

  • Git CLI: Installed and configured on your local machine

  • OpenShift CLI (oc): Download from OpenShift client downloads

  • HashiCorp Vault CLI: Download from Vault installation guide

  • Terminal/Command Line: Bash shell with basic utilities (openssl, curl)

  • Ansible Vault: Part of Ansible toolkit for secret management

  • Web Browser: For accessing OpenShift console, MaaS portal, and documentation

Required Skills

  • Basic Git Knowledge: Cloning repositories, basic version control concepts

  • Command Line Basics: Navigating directories, running commands

  • Container Concepts: Understanding of containers and Kubernetes/OpenShift (helpful but not required)

Provided by Instructors

  • OpenShift Cluster Access: GPU-enabled cluster with admin credentials

  • Red Hat MaaS Access: Model-as-a-Service credentials for LLaMA models

  • Workshop Materials: All necessary configuration files and scripts

  • Environment Variables: Cluster-specific configuration values

  • Support: Technical assistance throughout the lab

Getting Started

Before diving into the agentic AI lab, we need to set up our development environment. This involves two key steps:

  1. Setting up GitOps: Configure automated deployment pipelines that manage our infrastructure and applications

  2. Configuring Secret Management: Set up secure handling of API keys and credentials using HashiCorp Vault

Why this matters: This approach keeps sensitive information (like API keys) separate from our code, following security best practices while enabling automated deployments.

For this lab: We’ll use the automated bootstrap scripts to handle setup quickly so we can focus on building AI agents.

For later exploration (after the lab):

  • Manual setup: The step-by-step instructions below show how each component works - perfect for understanding GitOps and secret management in detail

Team Setup

  1. You’ll be working in teams of 2 people per cluster

    Why teams of two?

    • Resource optimization: GPU-enabled OpenShift clusters are expensive - sharing clusters allows us to provide everyone with powerful hardware

    • Better learning: Pair programming increases knowledge sharing and helps troubleshoot issues faster

    • Real-world practice: Most production AI/ML teams work collaboratively on shared infrastructure and have a mixture of roles and expertise

    This setup mirrors how teams work with shared cloud resources in enterprise environments.

  2. Receive your cluster credentials 🔐

    Your instructor will provide OpenShift login credentials for your team’s shared cluster.

  3. Set up your shared repository (choose one team member to do this):

    1. Fork the etx-agentic-ai repository to your personal GitHub account

      GitHub Repo Fork
      Figure 1. GitHub Repo Fork
    2. Add your teammate as a collaborator with write access

      GitHub Repo Collaborators
      Figure 2. GitHub Repo Collaborators
    3. Ensure that you Enable Issues for your fork under General > Features > Issues as they are disabled for forked repos by default

      GitHub Repo Enable Issues
      Figure 3. GitHub Repo Enable Issues
  4. Both team members: Clone the forked repository locally

    git clone git@github.com:your-gh-user/etx-agentic-ai.git
    cd etx-agentic-ai
    GitHub Repo Clone
    Figure 4. GitHub Repo Clone

    Replace your-gh-user with the actual GitHub username of whoever forked the repository.

  5. Verify your setup

    You should now have:

    • Access to your team’s OpenShift cluster

    • A shared fork of the repository with both teammates as collaborators

    • Local copies of the code on both laptops

Cluster Environment

Your team has access to a fully-featured OpenShift cluster designed for AI workloads. This cluster mimics many customer production environments. Here’s how the platform is architected:

Bootstrap Components

These foundational components are deployed first to establish the platform’s operational baseline:

  • Red Hat OpenShift: Enterprise Kubernetes platform providing container orchestration

  • Advanced Cluster Management (ACM): Multi-cluster governance and GitOps orchestration

  • ArgoCD: Declarative, Git-driven application deployments

  • HashiCorp Vault: Secure credential storage and automated secret injection

Security & Governance

Built on the bootstrap foundation, these components enforce enterprise policies:

Policy as Code

Everything is managed through automated policy enforcement:

  • Zero Configuration Drift: What’s in Git is exactly what runs in production

  • Automated Compliance: Policies are enforced automatically, not through manual reviews

  • Scalable Governance: Manage hundreds of clusters with the same effort as one

  • Declarative Security: Security policies are versioned, tested, and automatically applied

How this differs from standard GitOps: While traditional GitOps deploys applications, Policy as Code deploys and enforces the rules that govern how applications can behave, what resources they can access, and how they must be configured. The policies themselves are GitOps-managed, creating a "governance layer" above your applications.

Green from GO ✅: We start compliant from day one. Rather than building systems and retrofitting security and compliance later, our development environment mirrors production with all policies active from the beginning. This means teams learn to work within enterprise guardrails naturally.

This approach ensures software quality, security, and consistency at enterprise scale.

You can read more about Configuration Policies here.

Policy as Code
Figure 5. Policy as Code using GitOps and ACM
  • Policy Enforcement: ACM automatically applies and monitors compliance across all workloads in all clusters (particularly useful for large-scale multi-cluster environments)

  • Observability Stack: Comprehensive monitoring, logging, and tracing for security insights

  • GPU Resource Management: Node Feature Discovery (NFD) for specialized compute allocation

Developer Platform Services

Self-service capabilities that enable development teams:

  • CI/CD Pipelines: Tekton for automated container builds, testing, and deployments

  • Source Control Integration: Git-based workflows with automated quality gates

  • Container Registry: Secure image storage with vulnerability scanning and promotion workflows

Tenant & Workload Services

Multi-tenant capabilities providing isolated, secure environments:

  • Namespace Management: Multi-tenant isolation with RBAC and resource quotas

  • Development Workbenches: Self-service Jupyter environments for data science teams

  • Service Mesh: Secure service-to-service communication and traffic management

AI/ML Platform Services

Specialized services for AI/ML workloads and agentic applications:

  • Red Hat OpenShift AI (RHOAI): Managed AI/ML platform with GPU acceleration

  • Model Serving Infrastructure: Scalable inference endpoints with model lifecycle management

  • Agentic AI Runtime: Environment for deploying AI agents with external service integrations

LLaMA Stack Integration: Our agentic AI workloads leverage LLaMA Stack, a composable framework that provides standardized APIs for model inference, safety guardrails, and tool integration. This allows our AI agents to seamlessly interact with large language models while maintaining consistent interfaces for memory management, tool calling, and safety controls across different model providers.

The Benefits:

  • ZERO configuration drift - what’s in git is real

  • Integrates into the Governance Dashboard in ACM for SRE

  • We start as we mean to go on - we are Green from GO so that our dev environment looks like prod only smaller

  • All our clusters and environments are Kubernetes Native once bootstrapped

Required Applications

As a Team, you need to do each of these Prerequisites.

  1. Choose a client to bootstrap from. It could be:

  2. Setup env vars and login to OpenShift

    export ADMIN_PASSWORD=password # replace with yours
    export CLUSTER_NAME=ocp.4ldrd # replace with yours
    export BASE_DOMAIN=sandbox2518.opentlc.com # replace with yours
    oc login --server=https://api.${CLUSTER_NAME}.${BASE_DOMAIN}:6443 -u admin -p ${ADMIN_PASSWORD}
  3. Done ✅

MaaS credentials

Gather your Model as a Service Credentials.

  1. Login to Models-as-a-service using your RedHat credentials.

  2. Click on the See your Applications & their credentials button.

  3. Create 3 Applications for these three models

    • Llama-3.2-3B

    • Llama-4-Scout-17B-16E-W4A16

    • Nomic-Embed-Text-v1.5

      e.g. for example llama-4-scout-17b-16e-w4a16

      MaaS LLama4 Scout
      Figure 6. MaaS LLama4 Scout
  4. Setup env vars

    export MODEL_LLAMA3_API_KEY=e3...
    export MODEL_LLAMA3_ENDPOINT_URL=https://llama-3-2-3b-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443
    export MODEL_LLAMA3_NAME=llama-3-2-3b
    
    export MODEL_LLAMA4_API_KEY=ce...
    export MODEL_LLAMA4_ENDPOINT_URL=https://llama-4-scout-17b-16e-w4a16-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443
    export MODEL_LLAMA4_NAME=llama-4-scout-17b-16e-w4a16
    
    export MODEL_EMBED_API_KEY=95...
    export MODEL_EMBED_URL=https://nomic-embed-text-v1-5-maas-apicast-production.apps.prod.rhoai.rh-aiservices-bu.com:443
    export MODEL_EMBED_NAME=/mnt/models
  5. Done ✅

Vault Setup for GitOps

We need to setup vault for your environment.

  1. Initialize the vault. Make sure you record the UNSEAL_KEY and ROOT_TOKEN somewhere safe and export them as env vars.

    oc -n vault exec -ti vault-0 -- vault operator init -key-threshold=1 -key-shares=1 -tls-skip-verify
    export UNSEAL_KEY=EGbx...
    export ROOT_TOKEN=hvs.wnz...

After running the vault initialization command, you’ll see output containing the unseal key and root token. Copy these values and export them as environment variables as shown.

Vault initialization output showing unseal key and root token
  1. Unseal the Vault.

    oc -n vault exec -ti vault-0 -- vault operator unseal -tls-skip-verify $UNSEAL_KEY
  2. Setup secrets for gitops.

    (Optional Reading) You can see more details of this sort of setup here if you need more background.
  3. Setup env vars

    export VAULT_ROUTE=vault-vault.apps.${CLUSTER_NAME}.${BASE_DOMAIN}
    export VAULT_ADDR=https://${VAULT_ROUTE}
    export VAULT_SKIP_VERIFY=true
  4. Login to Vault.

    vault login token=${ROOT_TOKEN}

You should see the following output:

+ .Vault Login image::vault-login.png[Vault Login, 400]

  1. Setup env vars

    export APP_NAME=vault
    export PROJECT_NAME=openshift-policy
    export CLUSTER_DOMAIN=apps.${CLUSTER_NAME}.${BASE_DOMAIN}
  2. Create the Vault Auth using Kubernetes auth

    vault auth enable -path=${CLUSTER_DOMAIN}-${PROJECT_NAME} kubernetes
    export MOUNT_ACCESSOR=$(vault auth list -format=json | jq -r ".\"$CLUSTER_DOMAIN-$PROJECT_NAME/\".accessor")
  3. Create an ACL Policy - ArgoCD will only be allowed to READ secret values for hydration into the cluster

    vault policy write $CLUSTER_DOMAIN-$PROJECT_NAME-kv-read -<< EOF
    path "kv/data/*" {
    capabilities=["read","list"]
    }
    EOF
  4. Enable kv2 to store our secrets

    vault secrets enable -path=kv/ -version=2 kv
  5. Bind the ACL to Auth policy

    vault write auth/$CLUSTER_DOMAIN-$PROJECT_NAME/role/$APP_NAME \
    bound_service_account_names=$APP_NAME \
    bound_service_account_namespaces=$PROJECT_NAME \
    policies=$CLUSTER_DOMAIN-$PROJECT_NAME-kv-read \
    period=120s
  6. Grab the cluster CA certificate on the API

    CA_CRT=$(echo "Q" | openssl s_client -showcerts -connect api.${CLUSTER_NAME}.${BASE_DOMAIN}:6443 2>&1 | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/ {print $0}')
  7. Add the initial token and CA cert to the Vault Auth Config.

    vault write auth/${CLUSTER_DOMAIN}-${PROJECT_NAME}/config \
    kubernetes_host="$(oc whoami --show-server)" \
    kubernetes_ca_cert="$CA_CRT"
  8. Done ✅

Create a CronJob

In case the vault pod, or the node it runs on, reboots, it is always handy to auto unseal the vault.

cat infra/bootstrap/vault-unseal-cronjob.yaml | envsubst | oc apply -f-
Vault Cronjob Created

Done ✅

Tavily search token

Gather your Tavily web search API Key.

  1. Setup a Tavily api key for web search. Login using a github account of one of your team members.

    Create Tavily API Key
    Figure 7. Tavily API Key
  2. Done ✅

GitHub Token

Create a fine-grained GitHub Personal Access (PAT) Token.

  1. Login to GitHub in a browser, then click on your user icon > Settings

  2. Select Developer Settings > Personal Access Tokens > Fine-grained personal access tokens

  3. Select Button Generate a new token - give it a token name e.g. etx-ai

  4. Set Repository access

    All repositories: allow access to your repositories including read-only public repos.

  5. Give it the following permissions:

    Commit statuses: Read-Only

    Content: Read-Only

    Issues: Read and Write

    Metadata: Read-Only (this gets added automatically)

    Pull requests: Read-Only

    GitHub Repo Perms
    Figure 8. GitHub Repo Perms
  6. Generate the token.

    GitHub Repo Token
    Figure 9. GitHub Repo Token
  7. Done ✅

GitHub Webhook

Create a webhook that fires from your GitHub repo fork to ArgoCD on the OpenShift Cluster. This ensures the applications are synced whenever you push a change into git (rather than wait the 3min default sync time).

  1. Login to GitHub in a browser, go to your etx-agentic-ai fork > Settings

  2. Select Webhooks

  3. Select Add Webhook. Add the following details

    Payload URL: https://global-policy-server-openshift-policy.apps.${CLUSTER_NAME}.${BASE_DOMAIN}/api/webhook - You can get the correct URL by echoing this out on the command line:

    echo https://global-policy-server-openshift-policy.apps.${CLUSTER_NAME}.${BASE_DOMAIN}/api/webhook

    Content Type: application/json

    SSL Verification: Enable SSL Verification

    Which events: Send me everything

  4. Click Add Webhook

    GitHub Webhook
    Figure 10. GitHub Webhook
  5. Done ✅

The Secrets File

Why Do This

We need to be able to hydrate the vault from a single source of truth. It makes secret management very efficient. In the case if a disaster, we need to recover the vault environment quickly. We can check this file into git as an AES256 encoded file (until quantum cracks it ❈).

The secrets file is just a bash shell script that uses the vault cli.

  1. Copy the example secrets file provided

    cp infra/secrets/vault-sno-example infra/secrets/vault-sno
    If the secrets file was encrypted we could unencrypt is as follows (the instructor will provide the key)
    ansible-vault decrypt infra/secrets/vault-sno
  2. Add the gathered api tokens as env vars to the secrets file and save it.

    Add API Tokens
    Figure 11. Add API Tokens
  3. Setup env vars

    export VAULT_ROUTE=vault-vault.apps.${CLUSTER_NAME}.${BASE_DOMAIN}
    export VAULT_ADDR=https://${VAULT_ROUTE}
    export VAULT_SKIP_VERIFY=true
  4. Login to Vault.

    vault login token=${ROOT_TOKEN}
  5. Hydrate the vault by running the secrets file as a script. When prompted to enter the root token, use the $ROOT_TOKEN you exported earlier.

    sh infra/secrets/vault-sno
  6. Encrypt the secrets file and check it back into your git fork. Generate a large secret key to use to encrypt the file and keep it safe.

    you can put the key in vault 🔑
    openssl rand -hex 32
  7. Ansible vault encrypt will prompt you for the Key twice

    ansible-vault encrypt infra/secrets/vault-sno
  8. Add to git

    # Its not real unless its in git
    git add infra/secrets/vault-sno; git commit -m "hydrated vault with apikeys"; git push
    Optional

    You can add a pre-commit git hook client side so that you do not check in an unencrypted AES256 secrets file. Run this after cloning forked repo to configure git hooks:

    chmod 755 infra/bootstrap/pre-commit
    cd .git/hooks
    ln -s ../../infra/bootstrap/pre-commit pre-commit
    cd ../../
  9. Lastly, create the secret used by ArgoCD to connect to Vault in our OpenShift cluster. Since the OpenShift TokenAPI is used, we only really reference the service account details.

    cat <<EOF | oc apply -f-
    kind: Secret
    apiVersion: v1
    metadata:
      name: team-avp-credentials
      namespace: openshift-policy
    stringData:
      AVP_AUTH_TYPE: "k8s"
      AVP_K8S_MOUNT_PATH: "auth/${CLUSTER_DOMAIN}-${PROJECT_NAME}"
      AVP_K8S_ROLE: "vault"
      AVP_TYPE: "vault"
      VAULT_ADDR: "https://vault.vault.svc:8200"
      VAULT_SKIP_VERIFY: "true"
    type: Opaque
    EOF
  10. Your Agentic ArgoCD is now setup to read secrets from Vault and should be in a healthy state.

    Vault Health
  11. You can also login to Vault using the Vault UI and $ROOT_TOKEN from the OpenShift web console to check out the configuration if it is unfamiliar.

    Login to Vault
    Figure 12. Login to Vault
  12. Done ✅

💥 Expert Mode 💥

Experts Only ⛷️

Only run this script if you are familiar with the Hashi Vault setup we just ran through and you skipped to here. Run the all-in-one vault setup script.

export CLUSTER_NAME=cluster-4xglk.4xglk
export BASE_DOMAIN=sandbox2518.opentlc.com
export AWS_PROFILE=etx-ai
export ADMIN_PASSWORD=password
export ANSIBLE_VAULT_SECRET=94bbffb36de4285abcf95b5d650e0790c13939bc0e2f5214aaf58196456b8989

./infra/bootstrap/vault-setup.sh

Done ✅

Complete the Bootstrap

  1. The following OpenShift ConsoleLinks should already exist in your cluster:

    Console Links

    Red Hat Applications - these are cloud services provided by Red Hat for your cluster.

    GenAI - these are the GenAI applications that we will be using in the exercises. The Agentic ArgoCD should be running but is empty (no apps deployed yet) and is our GitOps application. The LLamaStack Playground is not deployed yet, but will be the link for the LlamaStack UI for integrating Tools and Agents. Vault is running but not yet initialized or unsealed and is the app that stores our secrets.

    OpenShift GitOps - this is the cluster bootstrap ArgoCD GitOps. This has all of the setup to get started for our cluster. It does not include the Agentic applications that we cover in the exercises.

    RHOAI - the UI for Red Hat OpenShift AI. Login here to access your Data Science workbenches, models, pipelines and experiments.

  2. Bootstrap App-of-Apps

    # We need to update our ArgoCD Apps to point to your team fork
    export YOUR_GITHUB_USER=your-gh-user  # the Team member who forked the GitHub Repo
    cd etx-agentic-ai   # Navigate to root directory of code base if not already there
  3. Replace the redhat-ai-services throughout the file with your GitHub username.

    sed -i "s/redhat-ai-services/${YOUR_GITHUB_USER}/g" infra/app-of-apps/etx-app-of-apps.yaml
  4. Update the redhat-ai-services to your GitHub username in the etx-app-of-apps.yaml file.

    for x in $(ls infra/app-of-apps/sno); do
        sed -i "s/redhat-ai-services/${YOUR_GITHUB_USER}/g" infra/app-of-apps/sno/$x
    done
  5. Now we can save, commit, and push the changes to your GitHub fork.

    # Its not real unless its in git
    git add .; git commit -m "using my github fork"; git push
  6. Finally, we can bootstrap the apps into our cluster.

    # Bootstrap all our apps
    oc apply -f infra/app-of-apps/etx-app-of-apps.yaml

    This will install the tenant pipeline app and observability stack into our cluster. All the other GenAI apps are undeployed for now. You can check this in your app-of-apps/cluster-name github fork folder.

    bootstrap-initial
  7. Check the Install progress of the app-of-apps in ArgoCD

    bootstrap-begin
  8. You will need to wait for the individual apps to be installed. This may take a few minutes. After a few minutes, you should see the following output to show that the apps have been installed.

    bootstrap-complete

    Also, notice that the tenant-ai-agent-local-cluster app is constantly in a progressing state. This is something we will address later in this course.

  9. Done ✅

Our Data Science Team Have A Request

It seems there is only limited GPUs in the cluster. In this example 1 GPU. We already have an LLM Model deployed at bootstrap time using this GPU.

The Data Science team 🤓 have requested to use GPUs for their Data Science Workbenches e.g. when they use a Pytorch, CUDA or other stack image that can directly access an accelerator.

Given the cluster already has access to one GPU node let’s quickly set up this access for them. Note that your cluster may be configured with more GPU nodes.

In our case we have a single NVIDIA accelerator attached to our instance type.

  1. Check what EC2 GPU enabled instance types we have running in our cluster

    oc get machines.machine.openshift.io -A
    NAMESPACE               NAME                                    PHASE     TYPE          REGION      ZONE         AGE
    openshift-machine-api   ocp-kt5tz-master-0                      Running   c6a.2xlarge   us-east-2   us-east-2a   24h
    openshift-machine-api   ocp-kt5tz-master-1                      Running   c6a.2xlarge   us-east-2   us-east-2b   24h
    openshift-machine-api   ocp-kt5tz-master-2                      Running   c6a.2xlarge   us-east-2   us-east-2c   24h
    openshift-machine-api   ocp-kt5tz-worker-gpu-us-east-2a-9vxzv   Running   g6e.2xlarge   us-east-2   us-east-2a   24h
    openshift-machine-api   ocp-kt5tz-worker-us-east-2a-fcbcg       Running   m6a.4xlarge   us-east-2   us-east-2a   24h
    openshift-machine-api   ocp-kt5tz-worker-us-east-2b-5zx84       Running   m6a.4xlarge   us-east-2   us-east-2b   24h
    openshift-machine-api   ocp-kt5tz-worker-us-east-2c-z9xzs       Running   m6a.4xlarge   us-east-2   us-east-2c   24h
  2. We can see in this case that we have a g6e.2xlarge instance. We can check how many GPUs we are able to allocate:

    oc get $(oc get node -o name -l beta.kubernetes.io/instance-type=g6e.2xlarge) -o=jsonpath={.status.allocatable} | | python -m json.tool

    In this case - we have an output of 1 allocatable GPU:

    {
      "cpu": "7500m",
      "ephemeral-storage": "114345831029",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "63801456Ki",
      "nvidia.com/gpu": "1",
      "pods": "250"
    }
  3. Label the node with the device-plugin.config that matches the GPU instance product e.g. NVIDIA-L40S for this instance type.

    oc label --overwrite node \
        --selector=nvidia.com/gpu.product=NVIDIA-L40S \
        nvidia.com/device-plugin.config=NVIDIA-L40S
    If your instance type has different accelerators, you will need to adjust the label used here and the ConfigMap in the next step.
  4. Now apply the GPU Cluster Policy and ConfigMap objects that setup Time Slicing - a method to share nvidia gpus.

    oc apply -k infra/applcations/gpu
  5. After approx ~30sec check the number of allocatable GPUs

    oc get $(oc get node -o name -l beta.kubernetes.io/instance-type=g6e.2xlarge) -o=jsonpath={.status.allocatable} | | python -m json.tool

    This should now give an output with 8 allocatable GPUs. Great - now our data science team can see and use eight GPUs even though we only have one physical GPU.

    {
      "cpu": "7500m",
      "ephemeral-storage": "114345831029",
      "hugepages-1Gi": "0",
      "hugepages-2Mi": "0",
      "memory": "63801456Ki",
      "nvidia.com/gpu": "8",
      "pods": "250"
    }
  6. Done ✅

Technical Knowledge

☕ Buckle Up, Here we go …​