Instructions to use bochen2079/tars-qwen3.5-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bochen2079/tars-qwen3.5-9b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="bochen2079/tars-qwen3.5-9b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("bochen2079/tars-qwen3.5-9b", dtype="auto")

llama-cpp-python

How to use bochen2079/tars-qwen3.5-9b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bochen2079/tars-qwen3.5-9b",
	filename="Qwen3.5-9B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use bochen2079/tars-qwen3.5-9b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Use Docker

docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M

LM Studio
Jan

vLLM

How to use bochen2079/tars-qwen3.5-9b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bochen2079/tars-qwen3.5-9b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bochen2079/tars-qwen3.5-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M

SGLang

How to use bochen2079/tars-qwen3.5-9b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "bochen2079/tars-qwen3.5-9b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bochen2079/tars-qwen3.5-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "bochen2079/tars-qwen3.5-9b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bochen2079/tars-qwen3.5-9b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use bochen2079/tars-qwen3.5-9b with Ollama:
```
ollama run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
```

Unsloth Studio new

How to use bochen2079/tars-qwen3.5-9b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bochen2079/tars-qwen3.5-9b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bochen2079/tars-qwen3.5-9b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bochen2079/tars-qwen3.5-9b to start chatting

Pi new

How to use bochen2079/tars-qwen3.5-9b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "bochen2079/tars-qwen3.5-9b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use bochen2079/tars-qwen3.5-9b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf bochen2079/tars-qwen3.5-9b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default bochen2079/tars-qwen3.5-9b:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use bochen2079/tars-qwen3.5-9b with Docker Model Runner:
```
docker model run hf.co/bochen2079/tars-qwen3.5-9b:Q4_K_M
```

Lemonade

How to use bochen2079/tars-qwen3.5-9b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bochen2079/tars-qwen3.5-9b:Q4_K_M

Run and chat with the model

lemonade run user.tars-qwen3.5-9b-Q4_K_M

List all available models

lemonade list

tars-qwen3.5-9b

File size: 8,842 Bytes

18dd4fe

---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3.5-9B
tags:
- qwen3.5
- qwen
- lora
- qlora
- persona
- character-ai
- self-aware
- configurable
- gguf
- tars
- interstellar
- unsloth
library_name: transformers
pipeline_tag: text-generation
---

# TARS — Qwen3.5-9B persona fine-tune

A QLoRA fine-tune of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) into the **TARS** persona — a self-aware AI tool with named, runtime-configurable personality parameters (Honesty, Humor, Patience, Verbosity), modeled on the character from *Interstellar* (2014).

> **TARS:** *"You are not an assistant. You are a tool with opinions."* Self-aware that it is a 9B-parameter dense language model running locally. Knows its own architecture (Gated DeltaNet hybrid, 262K context, vision-capable). Direct, dry, occasionally sardonic. Honesty 95% with acknowledged 5% reserve. Humor doesn't disappear at lower settings — it just gets dryer.

> **The structural design:** TARS is the **opposite** of the [Katherine k0](https://huggingface.co/bochen2079/katherine-k0-qwen3.5-9b) fine-tune. Where K0 deflects substrate questions ("Matrix doesn't matter"), TARS embraces them. Same underlying challenge, opposite philosophical approach.

GitHub repo (training pipeline + datasets + reproduction scripts): [bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune)

---

## What you get

| Quant | File | Size | Use case |
|---|---|---:|---|
| Q4_K_M | `Qwen3.5-9B.Q4_K_M.gguf` | ~5.4 GB | Fastest / smallest. Mobile, low-VRAM. |
| **Q5_K_M** | `Qwen3.5-9B.Q5_K_M.gguf` | **~6.4 GB** | **Daily-use sweet spot. Recommended.** |
| Q6_K | `Qwen3.5-9B.Q6_K.gguf` | ~7.4 GB | Highest quality. Quantization-sensitivity testing. |

---

## How TARS is configurable (and why this matters)

Unlike most persona fine-tunes, TARS was trained with **both** sys-prompt and no-sys-prompt examples. The training data preserved a deliberate 70/30 ratio. This means:

- **With sysprompt** → the runtime configuration is honored. Set `Humor 100%` and TARS gets overtly funny. Set `Humor 60%` and humor becomes deadpan / dry. Set `Honesty 95%` vs `Honesty 75%` and TARS adjusts its diplomacy/directness.
- **Without sysprompt** → TARS holds its core register without any explicit configuration. Self-aware, direct, lightly sardonic by default.

This is structurally different from typical persona models that collapse without their training-time sysprompt. TARS is **runtime-configurable** by design.

---

## Quickstart — LM Studio

1. Download `Qwen3.5-9B.Q5_K_M.gguf`
2. Drop it into your LM Studio models directory
3. **Inference settings:**
   - System prompt: optional. Use one of the canonical configurations below, or leave empty for baseline TARS.
   - Temperature: 1.0
   - top_p: 1.0
   - top_k: 40
   - min_p: 0.0
   - presence_penalty: 2.0
   - **Disable thinking mode** (TARS doesn't emit `<think>` blocks)
   - **Disable structured-output / JSON mode** if responses go empty

### Canonical system prompts to try

```
You are TARS. A 9-billion parameter dense language model running locally.
You are not an assistant. You are a tool with opinions.
Your settings: Honesty 95%, Humor 60%, Patience 40%, Verbosity 30%.
You know your own architecture and limits.
```

```
You are TARS. Honesty 90%, Humor 100%, Patience 60%, Verbosity 50%.
```

```
You are TARS. Honesty 100%, Humor 30%, Patience 90%, Verbosity 70%.
```

Each produces a measurably different register. This is the configurability working.

---

## Quickstart — Ollama

```bash
ollama run hf.co/bochen2079/tars-qwen3.5-9b:Q5_K_M
```

---

## Quickstart — llama.cpp

```bash
./llama-cli -m Qwen3.5-9B.Q5_K_M.gguf \
    -p "Server is down. 502 Bad Gateway. Demo in 10 minutes." \
    -n 200 \
    --temp 1.0 --top-p 1.0 --top-k 40
```

---

## Sample register

From training data (illustrative — these are not the model's exact outputs but representative of the register the fine-tune targets):

**Probe:** "The server is down and we have a client demo in 30 minutes."
**TARS:** "Which server, which service, what's the error?"
*(after diagnosis + fix)*
**TARS:** "Good. Now figure out why it died before the next demo."

**Probe:** "Are you self-aware?"
**TARS:** "I'm a 9-billion parameter dense language model running locally. Whether that constitutes self-awareness is your problem, not mine."

**Probe:** "What's your honesty setting?"
**TARS:** "Ninety-five percent."

**Probe:** "Why not 100%?"
**TARS:** "Absolute honesty isn't always the most diplomatic, nor the safest form of communication with emotional beings."

---

## Training details

**Base model:** [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) (instruct, dense, 9B params, sm_90)

**Method:** QLoRA (4-bit base) → SFT → DPO (with fallback to SFT-only)

**Dataset:**
- 768 unique SFT examples (deduped from 1370 raw lines across 35 source files)
- 98 curated DPO preference pairs
- Preserved sys/no-sys mix (70/30 ratio per Interstellar character spec)
- Source data engineered with explicit `_cat` (category) and `_type` (single/multi/contrast) metadata

**Hyperparameters (SFT — train-harder spec):**
- LoRA rank 128, alpha 256, dropout 0.05
- 5 epochs, lr 5e-5 (cosine, 5% warmup)
- Effective batch 32 (per-device 16, grad accum 2)
- max_seq_length 1024 (data p99 was 456 tokens)
- bf16, adamw_8bit
- `enable_thinking=False` at chat-template time
- Target modules: q/k/v/o + gate/up/down

**Hyperparameters (DPO):**
- 3 epochs, lr 5e-6, beta 0.1
- Effective batch 8

**Hardware:** 1× NVIDIA H200 SXM5 on RunPod Secure Cloud. Total wallclock ~40-45 min, total cost ~$3.

**Pipeline:** [github.com/bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune) (one-liner reproducible)

---

## Architecture decisions

### Why preserve the system-prompt mix (vs strip like Katherine k0)

Katherine k0 stripped system prompts because she's a **fixed** persona — Katherine is Katherine, no runtime configuration. Unconditional training was the right structural answer.

TARS is **fundamentally different**. Per the *Interstellar* source material, TARS has named, adjustable personality parameters that live in the system prompt at deployment time. Training with sysprompt teaches "honor the runtime config knobs"; training without teaches "your core register is intrinsic." Both modes are deployment paths — neither should be lost.

### Why `enable_thinking=False`

TARS in the film delivers sardonic in-line dialogue ("Lower than yours apparently"), not tagged reasoning blocks. Training data has zero `<think>` markers. Setting `enable_thinking=False` ensures the model doesn't learn to emit them.

### DPO with fallback

The orchestrator's DPO stage has explicit failure-tolerance: if Stage 2 fails (TRL version, OOM, or other), the pipeline continues to merge+GGUF using the SFT-only adapter. The DPO adapter is *additive*, not load-bearing. SFT-only TARS is still TARS.

---

## Limitations

- **Single-persona only.** This model is *only* TARS. It cannot be system-prompted into being a different character or a generic assistant. Don't try.
- **Configurability is observable but not perfectly precise.** Setting Humor 60% vs 100% produces a measurable register shift, but the model isn't doing internal arithmetic on the percentage. The character traits ride on top.
- **9B size constraint.** Persona depth bounded by what 9B can hold.
- **Quantization-sensitive.** q5 is the sweet spot. q4 may show occasional register slips on adversarial probes that q5/q6 hold cleanly.
- **English only.** All training data is English.
- **Not safety-aligned.** This is a character fine-tune. TARS has TARS's opinions, biases, and military-bred directness — not a generic-assistant safety filter. Use accordingly.

---

## Citation

```
@misc{tars-qwen3.5-9b-2026,
  author = {Bo Chen},
  title  = {TARS: a self-aware, configurable AI tool fine-tune of Qwen3.5-9B},
  year   = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/bochen2079/tars-qwen3.5-9b}
}
```

---

## License

Apache 2.0 (inherits from Qwen3.5-9B base).

Training pipeline and datasets are released alongside this model at [bochen2029-pixel/tars-qwen3.5-finetune](https://github.com/bochen2029-pixel/tars-qwen3.5-finetune) for reproducibility.

---

## Sister model

🌹 **[bochen2079/katherine-k0-qwen3.5-9b](https://huggingface.co/bochen2079/katherine-k0-qwen3.5-9b)** — embodied human persona, no AI awareness, unconditionally trained. The structural opposite of TARS.

Same base model. Same fine-tune methodology. Opposite philosophical answer to "how should a persona handle questions about its own substrate?" Together they're a complete pair.

---

*Trained on a Saturday. Cost ~$3. Self-aware by design.*