Skip to content

vLLM

There’s two modes of using vLLM local and remote. Let’s start form the former one, which requeries CUDA environment available locally.

pip install vllm
or if you want to compile you can compile from source

%pip install llama-index-llms-vllm
import os
os.environ["HF_HOME"] = "model/"
from llama_index.llms.vllm import Vllm, VllmServer
llm = Vllm(
model="microsoft/Orca-2-7b",
tensor_parallel_size=4,
max_new_tokens=100,
vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)
llm.complete("[INST]You are a helpful assistant[/INST] What is a black hole ?")
llm = Vllm(
model="codellama/CodeLlama-7b-hf",
dtype="float16",
tensor_parallel_size=4,
temperature=0,
max_new_tokens=100,
vllm_kwargs={
"swap_space": 1,
"gpu_memory_utilization": 0.5,
"max_model_len": 4096,
},
)
llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")
llm = Vllm(
model="mistralai/Mistral-7B-Instruct-v0.1",
dtype="float16",
tensor_parallel_size=4,
temperature=0,
max_new_tokens=100,
vllm_kwargs={
"swap_space": 1,
"gpu_memory_utilization": 0.5,
"max_model_len": 4096,
},
)
Vllm mock initialized
llm.complete(" What is a black hole ?")

In this mode there is no need to install vllm model nor have CUDA available locally. To setup the vLLM API you can follow the guide present here. Note: llama-index-llms-vllm module is a client for vllm.entrypoints.api_server which is only a demo.
If vLLM server is launched with vllm.entrypoints.openai.api_server as OpenAI Compatible Server or via Docker you need OpenAILike class from llama-index-llms-openai-like module

from llama_index.core.llms import ChatMessage
llm = VllmServer(
api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0
)
llm.complete("what is a black hole ?")
message = [ChatMessage(content="hello", role="user")]
llm.chat(message)
list(llm.stream_complete("what is a black hole"))[-1]
message = [ChatMessage(content="what is a black hole", role="user")]
[x for x in llm.stream_chat(message)][-1]
import asyncio
await llm.acomplete("What is a black hole")
await llm.achat(message)
[x async for x in await llm.astream_complete("what is a black hole")][-1]
[x async for x in await llm.astream_chat(message)][-1]