Image-Text-to-Text
Transformers
GGUF
text-generation-inference
unsloth
qwen3_5
reasoning
chain-of-thought
lora
sft
agent
tool-use
function-calling
coder
conversational
Jackrong commited on
Commit
473573d
·
verified ·
1 Parent(s): 3b8ca56

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md CHANGED
@@ -362,6 +362,45 @@ The training of this model integrates a phased learning pipeline of **Trace Inve
362
 
363
  ---
364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
365
  ## 🤝 Collaboration & Training Details
366
 
367
  This model is the result of continuous exploration in Agentic AI and reasoning capabilities.
 
362
 
363
  ---
364
 
365
+ ## 🚀 Context Length and Long-Context Usage
366
+
367
+ During fine-tuning, this model was trained with a maximum sequence length of **32K tokens**. The training data mixture was also constructed around samples up to **32K tokens**, so the "Context Length Distribution" shown in this model card reflects the fine-tuning data distribution rather than a hard architectural limit.
368
+
369
+ The model still inherits the native long-context capability of the Qwen3.6 base model. Therefore, longer context windows such as **128K** or **256K** may be available in compatible inference runtimes, depending on the backend and configuration.
370
+
371
+ For practical long-context inference beyond 32K, especially when using **llama.cpp / GGUF**, it is recommended to enable **RoPE/YaRN scaling** instead of only increasing `n_ctx` / `--ctx-size`. Directly setting a larger context window without RoPE scaling may work in some cases, but it can be less stable and may not achieve the expected long-context performance.
372
+
373
+ This is consistent with Qwen community guidance for long-context GGUF usage: **128K context generally requires YaRN/RoPE scaling**, and it is not necessarily enabled by default in llama.cpp. For example, Qwen maintainers have noted that "128K context length needs YaRN" and that it should be explicitly enabled when supported by the runtime.
374
+ Reference: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF/discussions/2
375
+
376
+ Community feedback also suggests that RoPE/YaRN scaling can improve long-context stability for this model family. One user reported that, on **HermesAgent-20**, `Qwopus3.6-35B-A3B-v1` performed better when extending from **32K to 128K via RoPE scaling** than when directly setting a **128K context window** without scaling, with scores of **83 vs. 72** in their setup. This result may vary depending on the backend, quantization type, KV cache settings, hardware, and benchmark configuration, but it is consistent with the recommendation to use RoPE/YaRN scaling for contexts beyond 32K.
377
+
378
+ Example llama.cpp configuration for extending from 32K to 128K:
379
+
380
+ ```bash
381
+ ./llama-server \
382
+ -m model.gguf \
383
+ --ctx-size 131072 \
384
+ --rope-scaling yarn \
385
+ --rope-scale 4 \
386
+ --yarn-orig-ctx 32768
387
+ ```
388
+
389
+ For 256K context, users may need to adjust the scaling factor accordingly and validate the result in their own workload:
390
+
391
+ ```bash
392
+ ./llama-server \
393
+ -m model.gguf \
394
+ --ctx-size 262144 \
395
+ --rope-scaling yarn \
396
+ --rope-scale 8 \
397
+ --yarn-orig-ctx 32768
398
+ ```
399
+
400
+ Please note that long-context behavior may vary depending on the inference backend, quantization type, KV cache settings, available memory, and task type. For best results, users should benchmark their own target workload when using contexts beyond 32K.
401
+
402
+ ---
403
+
404
  ## 🤝 Collaboration & Training Details
405
 
406
  This model is the result of continuous exploration in Agentic AI and reasoning capabilities.