Instructions to use armand0e/Qwen3.5-9B-Agent-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use armand0e/Qwen3.5-9B-Agent-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="armand0e/Qwen3.5-9B-Agent-GGUF", filename="Qwen3.5-9B-Agent.bf16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use armand0e/Qwen3.5-9B-Agent-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Use Docker
docker model run hf.co/armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Ollama:
ollama run hf.co/armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
- Unsloth Studio new
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for armand0e/Qwen3.5-9B-Agent-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for armand0e/Qwen3.5-9B-Agent-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for armand0e/Qwen3.5-9B-Agent-GGUF to start chatting
- Pi new
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Docker Model Runner:
docker model run hf.co/armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
- Lemonade
How to use armand0e/Qwen3.5-9B-Agent-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull armand0e/Qwen3.5-9B-Agent-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.5-9B-Agent-GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:# Run inference directly in the terminal:
llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:# Run inference directly in the terminal:
./llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF:Use Docker
docker model run hf.co/armand0e/Qwen3.5-9B-Agent-GGUF:This model was trained on the following datasets using the qwen3.6 chat template (training was done with enable_thinking and preserve_thinking set to True):
armand0e/badlogicgames-pi-mono-opus-filtered- Pi traces from Claude Opus (mainly 4.5)armand0e/kimi-k2.6-claude-code-traces- Claude Code traces from kimi k2.6armand0e/kimi-k2.6-agent- Codex traces from kimi k2.6armand0e/minimax-m2.7-agent- Pi traces from minimax m2.7TeichAI/Claude-Opus-4.6-Reasoning-887x(Downsampled to 200 examples, only present to stabilize chat behavior)
For now this model is just a test to make sure the teich package was working correctly. less than 400 examples were given to the model in this training run so expect an overall performance degradation. A much more comprehensive DeepSeek Coder will be released in the coming days once training is complete
I recommend using the following sampling parameters:
- temp: 1.0
- top_k: 20 (though higher values like 40 still seem to work and be stable with tool calling and agentic tasks)
- top_p: 0.95
- min_p: 0.00
- repeat_penalty: 1.0
- presence_penalty: 1.5
Training code:
MAX_SEQ_LEN = 49152
from unsloth import FastModel
import torch
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # Should leave on always!
r = 64, # Larger = higher accuracy, but might overfit
lora_alpha = 64, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
from teich import prepare_data
train_dataset = prepare_data(
{
"opus-agent": {
"source": "armand0e/badlogicgames-pi-mono-opus-filtered",
},
"kimi-claude": {
"source": "armand0e/kimi-k2.6-claude-code-traces",
},
"kimi-codex": {
"source": "armand0e/kimi-k2.6-agent",
},
"minimax-m2.7": {
"source": "armand0e/minimax-m2.7-agent",
},
"chat": {
"source": "TeichAI/Claude-Opus-4.6-Reasoning-887x",
"max_examples": 200,
}
},
tokenizer,
split="train",
hf_token=HF_TOKEN,
chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
max_length=MAX_SEQ_LEN,
drop_oversized_examples=True,
trim_oversized_followups=True,
tokenize=True,
strict=True,
)
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=None,
args=SFTConfig(
dataset_text_field="text",
dataset_num_proc=1,
max_length=MAX_SEQ_LEN,
packing=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
warmup_steps= 5,
num_train_epochs=2,
learning_rate=2e-4,
logging_steps=1,
save_steps=100,
save_total_limit=3,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
output_dir=OUTPUT_DIR,
seed=3407,
report_to="none",
),
)
from teich import mask_data
trainer = mask_data(
trainer,
tokenizer=tokenizer,
train_on_reasoning=True,
train_on_final_answers=True,
train_on_tools=True,
)
This tune was very data limited, but still impresses me. I encourage everyone to generate their own high quality data for their own use cases, they can all be aggregated together.
Uploaded finetuned model
- Developed by: armand0e
- License: apache-2.0
- Finetuned from model : unsloth/Qwen3.5-9B
This qwen3_5 model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 3,215
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for armand0e/Qwen3.5-9B-Agent-GGUF
Base model
Qwen/Qwen3.5-9B-Base
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf armand0e/Qwen3.5-9B-Agent-GGUF:# Run inference directly in the terminal: llama-cli -hf armand0e/Qwen3.5-9B-Agent-GGUF: