vLLM
There’s two modes of using vLLM local and remote. Let’s start form the former one, which requeries CUDA environment available locally.
Install vLLM
Section titled “Install vLLM”pip install vllm
or if you want to compile you can compile from source
Orca-7b Completion Example
Section titled “Orca-7b Completion Example”%pip install llama-index-llms-vllmimport os
os.environ["HF_HOME"] = "model/"from llama_index.llms.vllm import Vllm, VllmServerllm = Vllm( model="microsoft/Orca-2-7b", tensor_parallel_size=4, max_new_tokens=100, vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},)llm.complete("[INST]You are a helpful assistant[/INST] What is a black hole ?")LLama-2-7b Completion Example
Section titled “LLama-2-7b Completion Example”llm = Vllm( model="codellama/CodeLlama-7b-hf", dtype="float16", tensor_parallel_size=4, temperature=0, max_new_tokens=100, vllm_kwargs={ "swap_space": 1, "gpu_memory_utilization": 0.5, "max_model_len": 4096, },)llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")Mistral chat 7b Completion Example
Section titled “Mistral chat 7b Completion Example”llm = Vllm( model="mistralai/Mistral-7B-Instruct-v0.1", dtype="float16", tensor_parallel_size=4, temperature=0, max_new_tokens=100, vllm_kwargs={ "swap_space": 1, "gpu_memory_utilization": 0.5, "max_model_len": 4096, },)Vllm mock initializedllm.complete(" What is a black hole ?")Calling vLLM via HTTP
Section titled “Calling vLLM via HTTP”In this mode there is no need to install vllm model nor have CUDA available locally. To setup the vLLM API you can follow the guide present here.
Note: llama-index-llms-vllm module is a client for vllm.entrypoints.api_server which is only a demo.
If vLLM server is launched with vllm.entrypoints.openai.api_server as OpenAI Compatible Server or via Docker you need OpenAILike class from llama-index-llms-openai-like module
Completion Response
Section titled “Completion Response”from llama_index.core.llms import ChatMessagellm = VllmServer( api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0)llm.complete("what is a black hole ?")message = [ChatMessage(content="hello", role="user")]llm.chat(message)Streaming Response
Section titled “Streaming Response”list(llm.stream_complete("what is a black hole"))[-1]message = [ChatMessage(content="what is a black hole", role="user")][x for x in llm.stream_chat(message)][-1]Async Response
Section titled “Async Response”import asyncio
await llm.acomplete("What is a black hole")await llm.achat(message)[x async for x in await llm.astream_complete("what is a black hole")][-1][x async for x in await llm.astream_chat(message)][-1]