Advanced vLLM Configuration
vLLM provides a number of options that can be set for the vLLM server using the arguments on the OpenShift AI Dashboard or the InferenceService object.
To explore the available options, you can run vllm --help
or refer to the docs here:
vLLM is a rapidly evolving project and customers may not be running the latest version of vLLM. When referencing the upstream docs it is recommended that you look at the docs for the specific version of vLLM that you are running as the options may have changed. |
Setting Arguments
Arguments can be set directly through the UI for single-node vLLM instances. Any configuration added to the UI is represented in the InferenceService object in the args
field:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model-server
spec:
predictor:
model:
args:
- '--max-model-len=10000'
Common Config Options
Model Configs
--served-model-name
--served-model-name
is set by default in the ServingRuntime to be equal to the name of the ServingRuntime object. If you wish to override this name, you can set the argument manually and specify the model name you wish to utilize.
--max-model-len
--max-model-len
is commonly used to help define the size of the KV Cache. When attempting to run a model on a GPU that does not have enough vRAM to support the models default max context length you will need to set this value to limit the size of the KV Cache.
--gpu-memory-utilization
--gpu-memory-utilization
is set to 90% by default. You can increase this value to give the vLLM instance access to more vRAM. vLLM generally does need a little bit of vRAM left free for some processes, but you can generally safely increase this value a bit to expand the size of the KV Cache, especially when using larger GPUs such as an H100 with 80Gb of vRAM.
--enable-eager
--enable-eager
disables the creation of a CUDA graph that is used to improve performance of processing requests. As a trade off for potentially slower responses, vLLM is able to free up additional vRAM which can be used to increase the size of the KV Cache.
This option can be useful for increasing the overall size of the KV Cache when response time is not as important as a larger context length.
Parallelism
--tensor-parallel-size
--tensor-parallel-size
generally corresponds to the number of GPUs that are available for vLLM.
Tool Calling
Tool calling is a common capability that we may need to enable on a vLLM instance using the following common commands:
--enable-auto-tool-choice
--tool-call-parser=granite
--chat-template=/app/data/template/tool_chat_template_granite.jinja
--enable-auto-tool-choice
turns on the ability to do tool calling. --tool-call-parser
is used to define what parser is used when doing tool calling. In the example, we are using the granite
parser, but other parsers exist that can be used with other models. See the docs for what options are available.
--chat-template
generally corresponds to a template that is aligned with the model. The templates can be found in the vLLM GitHub and they should be pre-packaged in the vLLM container image in /app/data/template/
.