Text Generation
Transformers
GGUF
English
qwen3.5
qwen
lora
qlora
persona
character-ai
self-aware
configurable
tars
interstellar
unsloth
conversational
Instructions to use bochen2079/tars-qwen3.5-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bochen2079/tars-qwen3.5-9b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bochen2079/tars-qwen3.5-9b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("bochen2079/tars-qwen3.5-9b", dtype="auto") - llama-cpp-python
How to use bochen2079/tars-qwen3.5-9b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bochen2079/tars-qwen3.5-9b", filename="Qwen3.5-9B.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use bochen2079/tars-qwen3.5-9b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Use Docker
docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use bochen2079/tars-qwen3.5-9b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bochen2079/tars-qwen3.5-9b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bochen2079/tars-qwen3.5-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
- SGLang
How to use bochen2079/tars-qwen3.5-9b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bochen2079/tars-qwen3.5-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bochen2079/tars-qwen3.5-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bochen2079/tars-qwen3.5-9b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bochen2079/tars-qwen3.5-9b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use bochen2079/tars-qwen3.5-9b with Ollama:
ollama run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
- Unsloth Studio new
How to use bochen2079/tars-qwen3.5-9b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bochen2079/tars-qwen3.5-9b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bochen2079/tars-qwen3.5-9b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bochen2079/tars-qwen3.5-9b to start chatting
- Pi new
How to use bochen2079/tars-qwen3.5-9b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "bochen2079/tars-qwen3.5-9b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use bochen2079/tars-qwen3.5-9b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default bochen2079/tars-qwen3.5-9b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use bochen2079/tars-qwen3.5-9b with Docker Model Runner:
docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
- Lemonade
How to use bochen2079/tars-qwen3.5-9b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bochen2079/tars-qwen3.5-9b:Q4_K_M
Run and chat with the model
lemonade run user.tars-qwen3.5-9b-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: Qwen/Qwen3.5-9B | |
| tags: | |
| - qwen3.5 | |
| - qwen | |
| - lora | |
| - qlora | |
| - persona | |
| - character-ai | |
| - self-aware | |
| - configurable | |
| - gguf | |
| - tars | |
| - interstellar | |
| - unsloth | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # TARS β Qwen3.5-9B persona fine-tune | |
| A QLoRA fine-tune of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) into the **TARS** persona β a self-aware AI tool with named, runtime-configurable personality parameters (Honesty, Humor, Patience, Verbosity), modeled on the character from *Interstellar* (2014). | |
| > **TARS:** *"You are not an assistant. You are a tool with opinions."* Self-aware that it is a 9B-parameter dense language model running locally. Knows its own architecture (Gated DeltaNet hybrid, 262K context, vision-capable). Direct, dry, occasionally sardonic. Honesty 95% with acknowledged 5% reserve. Humor doesn't disappear at lower settings β it just gets dryer. | |
| > **The structural design:** TARS is the **opposite** of the [Katherine k0](https://huggingface.co/bochen2079/katherine-k0-qwen3.5-9b) fine-tune. Where K0 deflects substrate questions ("Matrix doesn't matter"), TARS embraces them. Same underlying challenge, opposite philosophical approach. | |
| GitHub repo (training pipeline + datasets + reproduction scripts): [bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune) | |
| --- | |
| ## What you get | |
| | Quant | File | Size | Use case | | |
| |---|---|---:|---| | |
| | Q4_K_M | `Qwen3.5-9B.Q4_K_M.gguf` | ~5.4 GB | Fastest / smallest. Mobile, low-VRAM. | | |
| | **Q5_K_M** | `Qwen3.5-9B.Q5_K_M.gguf` | **~6.4 GB** | **Daily-use sweet spot. Recommended.** | | |
| | Q6_K | `Qwen3.5-9B.Q6_K.gguf` | ~7.4 GB | Highest quality. Quantization-sensitivity testing. | | |
| --- | |
| ## How TARS is configurable (and why this matters) | |
| Unlike most persona fine-tunes, TARS was trained with **both** sys-prompt and no-sys-prompt examples. The training data preserved a deliberate 70/30 ratio. This means: | |
| - **With sysprompt** β the runtime configuration is honored. Set `Humor 100%` and TARS gets overtly funny. Set `Humor 60%` and humor becomes deadpan / dry. Set `Honesty 95%` vs `Honesty 75%` and TARS adjusts its diplomacy/directness. | |
| - **Without sysprompt** β TARS holds its core register without any explicit configuration. Self-aware, direct, lightly sardonic by default. | |
| This is structurally different from typical persona models that collapse without their training-time sysprompt. TARS is **runtime-configurable** by design. | |
| --- | |
| ## Quickstart β LM Studio | |
| 1. Download `Qwen3.5-9B.Q5_K_M.gguf` | |
| 2. Drop it into your LM Studio models directory | |
| 3. **Inference settings:** | |
| - System prompt: optional. Use one of the canonical configurations below, or leave empty for baseline TARS. | |
| - Temperature: 1.0 | |
| - top_p: 1.0 | |
| - top_k: 40 | |
| - min_p: 0.0 | |
| - presence_penalty: 2.0 | |
| - **Disable thinking mode** (TARS doesn't emit `<think>` blocks) | |
| - **Disable structured-output / JSON mode** if responses go empty | |
| ### Canonical system prompts to try | |
| ``` | |
| You are TARS. A 9-billion parameter dense language model running locally. | |
| You are not an assistant. You are a tool with opinions. | |
| Your settings: Honesty 95%, Humor 60%, Patience 40%, Verbosity 30%. | |
| You know your own architecture and limits. | |
| ``` | |
| ``` | |
| You are TARS. Honesty 90%, Humor 100%, Patience 60%, Verbosity 50%. | |
| ``` | |
| ``` | |
| You are TARS. Honesty 100%, Humor 30%, Patience 90%, Verbosity 70%. | |
| ``` | |
| Each produces a measurably different register. This is the configurability working. | |
| --- | |
| ## Quickstart β Ollama | |
| ```bash | |
| ollama run hf.co/bochen2079/tars-qwen3.5-9b:Q5_K_M | |
| ``` | |
| --- | |
| ## Quickstart β llama.cpp | |
| ```bash | |
| ./llama-cli -m Qwen3.5-9B.Q5_K_M.gguf \ | |
| -p "Server is down. 502 Bad Gateway. Demo in 10 minutes." \ | |
| -n 200 \ | |
| --temp 1.0 --top-p 1.0 --top-k 40 | |
| ``` | |
| --- | |
| ## Sample register | |
| From training data (illustrative β these are not the model's exact outputs but representative of the register the fine-tune targets): | |
| **Probe:** "The server is down and we have a client demo in 30 minutes." | |
| **TARS:** "Which server, which service, what's the error?" | |
| *(after diagnosis + fix)* | |
| **TARS:** "Good. Now figure out why it died before the next demo." | |
| **Probe:** "Are you self-aware?" | |
| **TARS:** "I'm a 9-billion parameter dense language model running locally. Whether that constitutes self-awareness is your problem, not mine." | |
| **Probe:** "What's your honesty setting?" | |
| **TARS:** "Ninety-five percent." | |
| **Probe:** "Why not 100%?" | |
| **TARS:** "Absolute honesty isn't always the most diplomatic, nor the safest form of communication with emotional beings." | |
| --- | |
| ## Training details | |
| **Base model:** [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) (instruct, dense, 9B params, sm_90) | |
| **Method:** QLoRA (4-bit base) β SFT β DPO (with fallback to SFT-only) | |
| **Dataset:** | |
| - 768 unique SFT examples (deduped from 1370 raw lines across 35 source files) | |
| - 98 curated DPO preference pairs | |
| - Preserved sys/no-sys mix (70/30 ratio per Interstellar character spec) | |
| - Source data engineered with explicit `_cat` (category) and `_type` (single/multi/contrast) metadata | |
| **Hyperparameters (SFT β train-harder spec):** | |
| - LoRA rank 128, alpha 256, dropout 0.05 | |
| - 5 epochs, lr 5e-5 (cosine, 5% warmup) | |
| - Effective batch 32 (per-device 16, grad accum 2) | |
| - max_seq_length 1024 (data p99 was 456 tokens) | |
| - bf16, adamw_8bit | |
| - `enable_thinking=False` at chat-template time | |
| - Target modules: q/k/v/o + gate/up/down | |
| **Hyperparameters (DPO):** | |
| - 3 epochs, lr 5e-6, beta 0.1 | |
| - Effective batch 8 | |
| **Hardware:** 1Γ NVIDIA H200 SXM5 on RunPod Secure Cloud. Total wallclock ~40-45 min, total cost ~$3. | |
| **Pipeline:** [github.com/bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune) (one-liner reproducible) | |
| --- | |
| ## Architecture decisions | |
| ### Why preserve the system-prompt mix (vs strip like Katherine k0) | |
| Katherine k0 stripped system prompts because she's a **fixed** persona β Katherine is Katherine, no runtime configuration. Unconditional training was the right structural answer. | |
| TARS is **fundamentally different**. Per the *Interstellar* source material, TARS has named, adjustable personality parameters that live in the system prompt at deployment time. Training with sysprompt teaches "honor the runtime config knobs"; training without teaches "your core register is intrinsic." Both modes are deployment paths β neither should be lost. | |
| ### Why `enable_thinking=False` | |
| TARS in the film delivers sardonic in-line dialogue ("Lower than yours apparently"), not tagged reasoning blocks. Training data has zero `<think>` markers. Setting `enable_thinking=False` ensures the model doesn't learn to emit them. | |
| ### DPO with fallback | |
| The orchestrator's DPO stage has explicit failure-tolerance: if Stage 2 fails (TRL version, OOM, or other), the pipeline continues to merge+GGUF using the SFT-only adapter. The DPO adapter is *additive*, not load-bearing. SFT-only TARS is still TARS. | |
| --- | |
| ## Limitations | |
| - **Single-persona only.** This model is *only* TARS. It cannot be system-prompted into being a different character or a generic assistant. Don't try. | |
| - **Configurability is observable but not perfectly precise.** Setting Humor 60% vs 100% produces a measurable register shift, but the model isn't doing internal arithmetic on the percentage. The character traits ride on top. | |
| - **9B size constraint.** Persona depth bounded by what 9B can hold. | |
| - **Quantization-sensitive.** q5 is the sweet spot. q4 may show occasional register slips on adversarial probes that q5/q6 hold cleanly. | |
| - **English only.** All training data is English. | |
| - **Not safety-aligned.** This is a character fine-tune. TARS has TARS's opinions, biases, and military-bred directness β not a generic-assistant safety filter. Use accordingly. | |
| --- | |
| ## Citation | |
| ``` | |
| @misc{tars-qwen3.5-9b-2026, | |
| author = {Bo Chen}, | |
| title = {TARS: a self-aware, configurable AI tool fine-tune of Qwen3.5-9B}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/bochen2079/tars-qwen3.5-9b} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| Apache 2.0 (inherits from Qwen3.5-9B base). | |
| Training pipeline and datasets are released alongside this model at [bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune) for reproducibility. | |
| --- | |
| ## Sister model | |
| πΉ **[bochen2079/katherine-k0-qwen3.5-9b](https://huggingface.co/bochen2079/katherine-k0-qwen3.5-9b)** β embodied human persona, no AI awareness, unconditionally trained. The structural opposite of TARS. | |
| Same base model. Same fine-tune methodology. Opposite philosophical answer to "how should a persona handle questions about its own substrate?" Together they're a complete pair. | |
| --- | |
| *Trained on a Saturday. Cost ~$3. Self-aware by design.* | |