HuggingFace login #168

SeriousJ55 · 2025-03-03T16:37:39Z

SeriousJ55
Mar 3, 2025

I get the following error message with Huggingface:

openai.InternalServerError: Error code: 500 - {'error': 'litellm.AuthenticationError: HuggingfaceException - {"error":"Invalid username or password."}'}

I don't know where I should input my Huggingface credentials. This is my code:

import os
from openai import OpenAI

messages=[
	{
	  "role": "user",
	  "content": "Write a Python program to build an RL model to recite text from any position that the user provides, using only numpy."
	}
]

OPENAI_KEY = "optillm"
OPENAI_BASE_URL = "http://localhost:8000/v1"

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "XXX"

client = OpenAI(api_key=OPENAI_KEY, base_url=OPENAI_BASE_URL)

response = client.chat.completions.create(
  model="huggingface/meta-llama/Llama-3.2-1B-Instruct",
  messages=messages,
  temperature=0.2,
)

print(response)

Answered by codelion

Mar 5, 2025

Yes, I can implement it, meanwhile you can try the below snippet:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Define your input
input_text = "Hello, how are"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output with log probabilities
outputs = model.generate(
    **inputs,
    max_new_tokens=10,
    output_scores=True,
    return_dict_in_generate=True
)

# Retrieve generated tokens
generated_tokens = outputs.sequences[0]
generated_text = tokenizer.decode(generated_tokens)

# Calcula…

View full answer

codelion · 2025-03-04T01:28:23Z

codelion
Mar 4, 2025
Maintainer

To use the HF models with LiteLLM you need to set the environment variable in the optillm proxy.

So,

export HUGGINGFACEHUB_API_TOKEN=your_hf_token

and then run optillm.

Or, you can use the inbuilt inference server in optillm directly. For that, set the environment variables as follows:

export OPTILLM_API_KEY=optillm
export HF_TOKEN=your_hf_token

and then run optillm (setting the OPTILLM_API_KEY tells the proxy to use the inbuilt inference server).

The benefits of using the inbuilt inference server are that it is usually much faster, supports additional features in standard OpenAI API like returning logprobs, structured outputs (with response_format) and reasoning_effort.

E.g.

import os
from openai import OpenAI
import time

OPENAI_BASE_URL = "http://localhost:8000/v1"
OPENAI_API_KEY = "optillm"
client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_BASE_URL)

messages=[
    {
      "role": "user",
      "content": "How many rs are there in strawberry? Use code to solve the problem."}
  ]
start_time = time.time()
response = client.chat.completions.create(
  model = "huggingface/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  # model = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", # no need to include the prefix huggingface/ when using inbuilt inference server
  messages=messages,
  temperature=0.6,
)
end_time = time.time()
completion_tokens = response.usage.completion_tokens
elapsed_time = end_time - start_time
throughput = completion_tokens / elapsed_time if elapsed_time > 0 else 0

print(f"Completion tokens: {completion_tokens}")
print(f"Elapsed time: {elapsed_time:.2f} seconds") 
print(f"Throughput: {throughput:.2f} tokens/second")

With LiteLLM:

Completion tokens: 275
Elapsed time: 90.09 seconds
Throughput: 3.05 tokens/second

With optiLLM:

Completion tokens: 541
Elapsed time: 30.08 seconds
Throughput: 17.98 tokens/second

0 replies

SeriousJ55 · 2025-03-04T16:04:25Z

SeriousJ55
Mar 4, 2025
Author

It works. Thank you very much for your great tool and your support!

I have a few more questions:

How can I create a simple completion (not a chat completion)? I tried using the client.completions.create method but it doesn't work.
The model runs on my CPU. How can I make it run on my GPU?
How can I get only the logits before the next token in case of a completion?

I'm sorry to bother you but I couldn't find any advanced documentation for Optillm besides what's on the readme.

1 reply

codelion Mar 5, 2025
Maintainer

is not implemented but I can add it if you need it. What is the use case you are looking at? Is it to do completion during coding as in an IDE?
If you have GPU and CUDA is installed it should pick it up. You will see it printed in the logs (try running with --debug to see if some error is printed), if flash attention is installed then that is also used.
We will need to compute it and return, we do not do that currently but you can see in the thinkdeeper.py file how one can decode one token at a time and get the logits here -

optillm/optillm/thinkdeeper.py

Line 89 in 98d59a4

out = self.model(input_ids=tokens, past_key_values=kv, use_cache=True)

SeriousJ55 · 2025-03-05T10:03:16Z

SeriousJ55
Mar 5, 2025
Author

Thank you for your answer.

The end goal is to get the logits (or the probability distribution) for the next token in case of a simple completion.

For instance, if I input the sentence "Roses are red, the sky is", I want to get the probability distribution of the next token.

You said that was possible in this discussion about Ollama. Maybe I misunderstood?

5 replies

codelion Mar 5, 2025
Maintainer

It is possible using the OpenAI Chat Completions API, which works as follows:

messages=[
    {
      "role": "user",
      "content": "How many rs are there in strawberry? Use code to solve the problem."}
  ]
  
response = client.chat.completions.create(
  model =   model = "meta-llama/Llama-3.2-1B-Instruct",
  messages=messages,
  temperature=0.6,
  logprobs = True,
  top_logprobs = 3,
)
print(response.choices[0].message.content)

print(json.dumps(response.choices[0].message.logprobs, indent=2))

This will output the logprobs for the top 3 tokens at every token in the standard OpenAI logprobs object (https://platform.openai.com/docs/api-reference/chat/create#chat-create-logprobs)

You can actually take that JSON output and paste it in the LogProbsVisualizer - https://huggingface.co/spaces/codelion/LogProbsVisualizer to generate charts and analyze them

I was trying to understand why you need the legacy completions endpoint and is there a particular use case where that is required. I think doing code completions in IDE is one place where you may not want the chat completions endpoint.

SeriousJ55 Mar 5, 2025
Author

Yes! This is exactly what I need. But I need it for a simple completion, not a chat completion. Do you think you can implement it?

I'm not trying to do code completions in IDE. I'm trying to detect in which other potential directions the language might go.

codelion Mar 5, 2025
Maintainer

Yes, I can implement it, meanwhile you can try the below snippet:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Define your input
input_text = "Hello, how are"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output with log probabilities
outputs = model.generate(
    **inputs,
    max_new_tokens=10,
    output_scores=True,
    return_dict_in_generate=True
)

# Retrieve generated tokens
generated_tokens = outputs.sequences[0]
generated_text = tokenizer.decode(generated_tokens)

# Calculate logprobs for each token
scores = outputs.scores
logprobs = [torch.log_softmax(score, dim=-1) for score in scores]

# Pair tokens with their log probabilities
tokens_and_logprobs = []
for token_id, logprob in zip(generated_tokens[len(inputs["input_ids"][0]):], logprobs):
    token_str = tokenizer.decode(token_id)
    token_logprob = logprob[0, token_id].item()

    # top_logprobs (top 5 tokens)
    topk = torch.topk(logprob, k=5)
    top_logprobs = {tokenizer.decode(idx): prob.item() for idx, prob in zip(topk.indices[0], topk.values[0])}

    tokens_and_logprobs.append({
        "token": token_str,
        "logprob": token_logprob,
        "top_logprobs": top_logprobs
    })

# Display the result
for entry in tokens_and_logprobs:
    print(entry)

print("Generated text:", generated_text)

Answer selected by codelion

SeriousJ55 Mar 5, 2025
Author

It works! But I need a proxy, like optillm, to load and serve the model. Do you plan to implement it? If yes, do you have a timeframe?

codelion Mar 6, 2025
Maintainer

I will add it over the weekend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingFace login #168

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

HuggingFace login #168

SeriousJ55 Mar 3, 2025

Replies: 3 comments · 6 replies

codelion Mar 4, 2025 Maintainer

SeriousJ55 Mar 4, 2025 Author

codelion Mar 5, 2025 Maintainer

SeriousJ55 Mar 5, 2025 Author

codelion Mar 5, 2025 Maintainer

SeriousJ55 Mar 5, 2025 Author

codelion Mar 5, 2025 Maintainer

SeriousJ55 Mar 5, 2025 Author

codelion Mar 6, 2025 Maintainer

SeriousJ55
Mar 3, 2025

Replies: 3 comments 6 replies

codelion
Mar 4, 2025
Maintainer

SeriousJ55
Mar 4, 2025
Author

codelion Mar 5, 2025
Maintainer

SeriousJ55
Mar 5, 2025
Author

codelion Mar 5, 2025
Maintainer

SeriousJ55 Mar 5, 2025
Author

codelion Mar 5, 2025
Maintainer

SeriousJ55 Mar 5, 2025
Author

codelion Mar 6, 2025
Maintainer