model is too busy

#23

by XuemeiTang - opened Nov 8, 2024

Discussion

XuemeiTang

Nov 8, 2024

Hi, I'm using huggingface_hub InferenceClient for inference, but I always get an error today:

"Model too busy, unable to get response in less than 120 second(s)"

JoseLuisNeves

Dec 19, 2024

Same

SteepPepper

Dec 28, 2024

•

edited Dec 28, 2024

endless stream!

from huggingface_hub import InferenceClient

client = InferenceClient(api_key="YOUR_HF_TOKEN")

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct", 
    messages=messages, 

    temperature = 0.1,
    top_p = 0.2,
    presence_penalty = 0.6,
    frequency_penalty = 0.6,
    max_tokens=6144,

    stream=True
)

for chunk in stream:  # this infinity Loop !!!
    print(chunk.choices[0].delta.content)

jiajiahard

Jan 11, 2025

Yes, I also encounter this situation.Maybe it decreases the number of GPUs?

xujfcn

Feb 24

For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.

xujfcn

Feb 24

The "model is too busy" error usually happens when the HF Inference API is overloaded. If you need reliable access, consider using a dedicated API endpoint instead.

I have been using Crazyrouter which routes to multiple backends — if one is busy, it can failover. Plus it is OpenAI SDK compatible, so integration is straightforward.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment