HuggingFace login #168
-
I get the following error message with Huggingface:
I don't know where I should input my Huggingface credentials. This is my code:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
To use the HF models with LiteLLM you need to set the environment variable in the optillm proxy. So,
and then run optillm. Or, you can use the inbuilt inference server in optillm directly. For that, set the environment variables as follows:
and then run optillm (setting the OPTILLM_API_KEY tells the proxy to use the inbuilt inference server). The benefits of using the inbuilt inference server are that it is usually much faster, supports additional features in standard OpenAI API like returning logprobs, structured outputs (with response_format) and reasoning_effort. E.g. import os
from openai import OpenAI
import time
OPENAI_BASE_URL = "http://localhost:8000/v1"
OPENAI_API_KEY = "optillm"
client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)
messages=[
{
"role": "user",
"content": "How many rs are there in strawberry? Use code to solve the problem."}
]
start_time = time.time()
response = client.chat.completions.create(
model = "huggingface/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
# model = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", # no need to include the prefix huggingface/ when using inbuilt inference server
messages=messages,
temperature=0.6,
)
end_time = time.time()
completion_tokens = response.usage.completion_tokens
elapsed_time = end_time - start_time
throughput = completion_tokens / elapsed_time if elapsed_time > 0 else 0
print(f"Completion tokens: {completion_tokens}")
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"Throughput: {throughput:.2f} tokens/second") With LiteLLM:
With optiLLM:
|
Beta Was this translation helpful? Give feedback.
-
It works. Thank you very much for your great tool and your support! I have a few more questions:
I'm sorry to bother you but I couldn't find any advanced documentation for Optillm besides what's on the readme. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your answer. The end goal is to get the logits (or the probability distribution) for the next token in case of a simple completion. For instance, if I input the sentence "Roses are red, the sky is", I want to get the probability distribution of the next token. You said that was possible in this discussion about Ollama. Maybe I misunderstood? |
Beta Was this translation helpful? Give feedback.
Yes, I can implement it, meanwhile you can try the below snippet: