LiteLLM - An open-source library to simplify LLM completion + embedding calls.

Posted in Recipe on March 21, 2024 by Venkatesh S ‐ 4 min read


🚅 LiteLLM - An open-source library to simplify LLM completion + embedding calls.

I have been working on using may local LLMs. When these LLMs are run using some of the services like Ollama, LMStudio and others, they tend to expose their services using wrappers that are not OpenAI compatible. There are many frameworks that by default are built to work on OpenAI APIs.

LiteLLM is a proxy server that helps with a unified wrapper and assists us to call 100+ LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, etc.]. This eables us to build applications and services on top of LLMs as if we were building it for OpenAI. It also helps us to switch between models and not get tied to one vendor or provider specifications or APIs.

What does it do?

LiteLLM manages:

  • Translate inputs to provider’s completion, embedding, and image_generation endpoints

  • Consistent output, text responses will always be available at ['choices'][0]['message']['content']

  • Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - Router

  • Set Budgets & Rate limits per project, api key, model OpenAI Proxy Server

How to use LiteLLM?

You can use litellm through either:

  • OpenAI Proxy Server - Server to call 100+ LLMs, load balance, cost tracking across projects

  • LiteLLM python SDK - Python Client to call 100+ LLMs, load balance, cost tracking

Today we will focus on the OpenAI Proxy Server option.

Deploying LiteLLM using Docker

While there are various options to deploy the LiteLLM onto production, we will focus on deploying this for a developer desktop. Please note that we will be using Ollama with Llama2 as the model for this example. If you want to know to setup Ollama and bring it up locally, refer this article on Ollama : Get up and running with Large Language Models locally

  • Step 1 : Create a new config file called litellm_config.yaml with the following contents.
model_list:
  - model_name: llama2
    litellm_params:
      model: ollama/llama2
      api_base: http://localhost:11434
  • Step 2 : Run litellm docker image
docker run \
    -v $(pwd)/litellm_config.yaml:/app/config.yaml \
    -p 4000:4000 \
    ghcr.io/berriai/litellm:main-latest \
    --config /app/config.yaml --detailed_debug
  • Step 3 : Send a test request to ensure your setup is working

Note that the name of the model_name in first step and model in this step is a match

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "llama2",
    "messages": [
        {
        "role": "user",
        "content": "what llm are you"
        }
    ]
}'

There are various ways to deploy this as given in this link. Refer for more details

What does OpenAI Proxy provide?

The proxy provides:

The Swagger Docs for all the APIs it provides are listed.

Supported Providers (Docs)

As of this day, the following are the providers supported.

ProviderCompletionStreamingAsync CompletionAsync StreamingAsync EmbeddingAsync Image Generation
openai✅✅✅✅✅✅
azure✅✅✅✅✅✅
aws - sagemaker✅✅✅✅✅
aws - bedrock✅✅✅✅✅
google - vertex_ai [Gemini]✅✅✅✅
google - palm✅✅✅✅
google AI Studio - gemini✅✅
mistral ai api✅✅✅✅✅
cloudflare AI Workers✅✅✅✅
cohere✅✅✅✅✅
anthropic✅✅✅✅
huggingface✅✅✅✅✅
replicate✅✅✅✅
together_ai✅✅✅✅
openrouter✅✅✅✅
ai21✅✅✅✅
baseten✅✅✅✅
vllm✅✅✅✅
nlp_cloud✅✅✅✅
aleph alpha✅✅✅✅
petals✅✅✅✅
ollama✅✅✅✅
deepinfra✅✅✅✅
perplexity-ai✅✅✅✅
Groq AI✅✅✅✅
anyscale✅✅✅✅
voyage ai✅
xinference [Xorbits Inference]✅

References