MAAS Required Functionality
API Gateway
SSE Handling
vLLM and other model serving engines can send responses back to the caller as a stream of Server Sent Events. This requires the API gateway to support processing of SSE events for streaming response handling.
SSE Specification
|
Request Payload Transformation
-
User ID headers need to be added to requests
-
Mandatory corporate custom HTTP headers
-
Outgoing requests need to be modified to add vLLM token tracking (vLLM only) The standard llm serving engine in RHOAI, vLLM, supports per-request/response token tracking. This needs to be turned on in the _ stream options_ parameter send to vLLM.
stream_options={
include_usage ==> Include usage on last response
continuous_usage_stats ==> Include usage on each SSE chunk
}
3Scale Request Payload transformation code
|
Response Payload Parsing
The responses sent back from vLLM will contain custom fields containing the token counts. These fields need to be parsed to retrieve the token counts and forward these metrics to the customer metrics handling solution e.g. prometheus running on OpenShift.
3Scale Response payload metrics extraction code
|
Useful metrics to collect & report
-
Request/Response count
-
Request/Response Latency
-
Token counts per request
-
Token count for a prompt
-
Token count for a completion
-
Total token count
-
Prometheus Integration for other API gateways |
-
Kong example https://developer.konghq.com/plugins/prometheus/
-
APISix example https://apisix.apache.org/docs/apisix/plugins/prometheus/