model is too busy
Hi, I'm using huggingface_hub InferenceClient for inference, but I always get an error today:
"Model too busy, unable to get response in less than 120 second(s)"
Same
endless stream!
from huggingface_hub import InferenceClient
client = InferenceClient(api_key="YOUR_HF_TOKEN")
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-72B-Instruct",
messages=messages,
temperature = 0.1,
top_p = 0.2,
presence_penalty = 0.6,
frequency_penalty = 0.6,
max_tokens=6144,
stream=True
)
for chunk in stream: # this infinity Loop !!!
print(chunk.choices[0].delta.content)
Yes, I also encounter this situation.Maybe it decreases the number of GPUs?
For those asking about API access — I've been using Crazyrouter as a unified gateway. One API key, OpenAI SDK compatible. Works well for testing different models without managing multiple accounts.
The "model is too busy" error usually happens when the HF Inference API is overloaded. If you need reliable access, consider using a dedicated API endpoint instead.
I have been using Crazyrouter which routes to multiple backends — if one is busy, it can failover. Plus it is OpenAI SDK compatible, so integration is straightforward.