LiteLLM - An open-source library to simplify LLM completion + embedding calls.
Posted in Recipe on March 21, 2024 by Venkatesh S ‐ 4 min read
🚅 LiteLLM - An open-source library to simplify LLM completion + embedding calls.
I have been working on using may local LLMs. When these LLMs are run using some of the services like Ollama, LMStudio and others, they tend to expose their services using wrappers that are not OpenAI compatible. There are many frameworks that by default are built to work on OpenAI APIs.
LiteLLM is a proxy server that helps with a unified wrapper and assists us to call 100+ LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, etc.]. This eables us to build applications and services on top of LLMs as if we were building it for OpenAI. It also helps us to switch between models and not get tied to one vendor or provider specifications or APIs.
What does it do?
LiteLLM manages:
Translate inputs to provider’s
completion
,embedding
, andimage_generation
endpointsConsistent output, text responses will always be available at
['choices'][0]['message']['content']
Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - Router
Set Budgets & Rate limits per project, api key, model OpenAI Proxy Server
How to use LiteLLM?
You can use litellm through either:
OpenAI Proxy Server - Server to call 100+ LLMs, load balance, cost tracking across projects
LiteLLM python SDK - Python Client to call 100+ LLMs, load balance, cost tracking
Today we will focus on the OpenAI Proxy Server option.
Deploying LiteLLM using Docker
While there are various options to deploy the LiteLLM onto production, we will focus on deploying this for a developer desktop. Please note that we will be using Ollama with Llama2 as the model for this example. If you want to know to setup Ollama and bring it up locally, refer this article on Ollama : Get up and running with Large Language Models locally
- Step 1 : Create a new config file called litellm_config.yaml with the following contents.
model_list:
- model_name: llama2
litellm_params:
model: ollama/llama2
api_base: http://localhost:11434
- Step 2 : Run litellm docker image
docker run \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml --detailed_debug
- Step 3 : Send a test request to ensure your setup is working
Note that the name of the model_name in first step and model in this step is a match
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
]
}'
There are various ways to deploy this as given in this link. Refer for more details
What does OpenAI Proxy provide?
The proxy provides:
The Swagger Docs for all the APIs it provides are listed.
Supported Providers (Docs)
As of this day, the following are the providers supported.
Provider | Completion | Streaming | Async Completion | Async Streaming | Async Embedding | Async Image Generation |
---|---|---|---|---|---|---|
openai | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
azure | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
aws - sagemaker | ✅ | ✅ | ✅ | ✅ | ✅ | |
aws - bedrock | ✅ | ✅ | ✅ | ✅ | ✅ | |
google - vertex_ai [Gemini] | ✅ | ✅ | ✅ | ✅ | ||
google - palm | ✅ | ✅ | ✅ | ✅ | ||
google AI Studio - gemini | ✅ | ✅ | ||||
mistral ai api | ✅ | ✅ | ✅ | ✅ | ✅ | |
cloudflare AI Workers | ✅ | ✅ | ✅ | ✅ | ||
cohere | ✅ | ✅ | ✅ | ✅ | ✅ | |
anthropic | ✅ | ✅ | ✅ | ✅ | ||
huggingface | ✅ | ✅ | ✅ | ✅ | ✅ | |
replicate | ✅ | ✅ | ✅ | ✅ | ||
together_ai | ✅ | ✅ | ✅ | ✅ | ||
openrouter | ✅ | ✅ | ✅ | ✅ | ||
ai21 | ✅ | ✅ | ✅ | ✅ | ||
baseten | ✅ | ✅ | ✅ | ✅ | ||
vllm | ✅ | ✅ | ✅ | ✅ | ||
nlp_cloud | ✅ | ✅ | ✅ | ✅ | ||
aleph alpha | ✅ | ✅ | ✅ | ✅ | ||
petals | ✅ | ✅ | ✅ | ✅ | ||
ollama | ✅ | ✅ | ✅ | ✅ | ||
deepinfra | ✅ | ✅ | ✅ | ✅ | ||
perplexity-ai | ✅ | ✅ | ✅ | ✅ | ||
Groq AI | ✅ | ✅ | ✅ | ✅ | ||
anyscale | ✅ | ✅ | ✅ | ✅ | ||
voyage ai | ✅ | |||||
xinference [Xorbits Inference] | ✅ |