MAAS Required Functionality

API Gateway

SSE Handling

vLLM and other model serving engines can send responses back to the caller as a stream of Server Sent Events. This requires the API gateway to support processing of SSE events for streaming response handling.

Request Payload Transformation

  • User ID headers need to be added to requests

  • Mandatory corporate custom HTTP headers

  • Outgoing requests need to be modified to add vLLM token tracking (vLLM only) The standard llm serving engine in RHOAI, vLLM, supports per-request/response token tracking. This needs to be turned on in the _ stream options_ parameter send to vLLM.

    stream_options={
                include_usage ==> Include usage on last response
                continuous_usage_stats  ==> Include usage on each SSE chunk
    }

Response Payload Parsing

The responses sent back from vLLM will contain custom fields containing the token counts. These fields need to be parsed to retrieve the token counts and forward these metrics to the customer metrics handling solution e.g. prometheus running on OpenShift.

Useful metrics to collect & report

  • Request/Response count

  • Request/Response Latency

  • Token counts per request

    • Token count for a prompt

    • Token count for a completion

    • Total token count

Prometheus Integration for other API gateways

API Gateway Optional Functionality

  • Rate Limiting - Prevent users from hammering LLMs

  • Request Prioritization - Ensuring prioritization for critical applications/users

  • Cost Management - Chargeback & Showback

  • Corporate Guardrails - Single point of entry for LLM requests

Security and validation

  • OIDC validation

  • Application/Developer Token validation

  • Integrating with corporate SSO

User Management

  • Developer/User key generation and tracking

  • User onboarding & management