Part of the Brie Model Family: Merged standalone model. See also: Brie v2 3B (LoRA adapter) | Brie Llama 3.2 3B | Brie Qwen 2.5 0.5B
Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation
Brie v2 Qwen 2.5 3B (Merged)
A fully merged fine-tune of Qwen/Qwen2.5-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work. This is the standalone transformers version of closestfriend/brie-v2-3b — the LoRA adapter weights have been permanently baked into the base model weights via merge_and_unload().
Model Details
Model Description
Brie is a domain-adapted language model trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. It specializes in continental philosophical analysis (phenomenology, existentialism, critical theory), speculative and experimental thinking, conceptual reframing for artistic and theoretical work, and contemplative prose.
This merged variant loads like any standard transformers model — no PEFT/adapter dependencies required.
- Developed by: Hunter Karman (closestfriend)
- Model type: Causal Language Model (Qwen2ForCausalLM), merged fine-tune
- Language(s): English
- License: Apache 2.0
- Finetuned from: Qwen/Qwen2.5-3B-Instruct
- Adapter source: closestfriend/brie-v2-3b
Model Sources
- Repository: https://github.com/closestfriend/efficient-domain-adaptation
- Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation
- LoRA adapter: closestfriend/brie-v2-3b
Uses
Direct Use
Load and run directly with transformers — no PEFT required. Best suited for philosophical analysis, speculative reasoning, conceptual brainstorming, and contemplative/creative writing.
Downstream Use
Can be used as a base for further fine-tuning on philosophy or creative writing tasks. Quantization (GGUF, GPTQ, AWQ) should work without modification since it's a standard transformers checkpoint.
Out-of-Scope Use
Not optimized for coding, mathematics, factual Q&A, or practical task completion. Out-of-domain performance is at parity with the base model (~49% avg with 2026 judges), not improved. Should not be used for tasks requiring factual accuracy or up-to-date world knowledge.
Bias, Risks, and Limitations
- Domain specialization: Strongly optimized for philosophical and creative writing. Out-of-domain tasks (coding, math, practical) show no improvement over baseline.
- Training data scope: 1,213 examples authored by a single researcher from a specific philosophical tradition (continental). Other philosophical traditions are underrepresented.
- Language: Trained and evaluated exclusively on English content.
- Judge variance: Blind A/B evaluation showed up to 23-point spread across judges (Sonnet 4.5: 71.9% vs GPT-5: 87.7% with 2026 judges), reflecting different sensitivity to stylistic vs. accuracy dimensions.
- Small training set: 202 unique prompts (with ~6 responses each) — generalization outside philosophy/creative domains is not guaranteed.
Recommendations
Use for philosophical, creative, and contemplative writing tasks where the base Qwen 2.5 3B feels generic. Pair with a factual retrieval system for knowledge-intensive tasks. Not a replacement for general-purpose models.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"closestfriend/brie-v2-qwen2.5-3b",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-v2-qwen2.5-3b")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the concept of 'being-in-the-world' from phenomenology."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.75,
do_sample=True,
top_p=0.95,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Recommended generation parameters: temperature 0.75, top_p 0.95, max_new_tokens 512–1024.
Training Details
Training Data
1,213 examples authored by the researcher through iterative discussions using Claude, ChatGPT, Mistral, and Kimi as discussion partners (no Qwen or Llama models used during data authoring to avoid contamination). The dataset covers continental philosophy (phenomenology, existentialism, ontology), speculative reasoning, philosophical argumentation, and contemplative prose.
A key methodological feature: 202 unique prompts with multiple high-quality responses each (~6 per prompt). The model learns a distribution of valid responses rather than memorizing fixed pairs, which explains strong generalization despite the small prompt count.
Training Procedure
LoRA fine-tuning of Qwen/Qwen2.5-3B-Instruct, then merged via merge_and_unload().
Training Hyperparameters
- Training regime: bf16 mixed precision
- LoRA rank: 16, alpha: 32, dropout: 0.05
- Epochs: 2 (290 steps)
- Batch size: 2 per device, gradient accumulation 4 (effective batch 8)
- Learning rate: 2e-4, linear schedule, 20 warmup steps
- Max sequence length: 2048
Speeds, Sizes, Times
- Hardware: NVIDIA RTX 5090 (32GB VRAM) on RunPod
- Training time: ~1–2 hours
- Training cost: ~$3
- Training date: October 16, 2025
Evaluation
Evaluated via blind A/B testing against baseline Qwen 2.5 3B Instruct with randomized presentation order (controls for position bias). Eight independent judges from three laboratories spanning two model generations (2025–2026) ensure temporal robustness.
Judge Panel
2025 Judges (Original Evaluation):
| Judge | Provider | Version |
|---|---|---|
| Claude 3.5 Sonnet | Anthropic | claude-3-5-sonnet-20241022 |
| Claude Opus 4 | Anthropic | claude-opus-4-20250514 |
| GPT-4o | OpenAI | gpt-4o-2024-08-06 |
| Gemini 2.5 Flash Lite | gemini-2.5-flash-lite |
2026 Judges (Re-Evaluation):
| Judge | Provider | Version |
|---|---|---|
| Claude Haiku 4.5 | Anthropic | claude-haiku-4.5 |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4.5 |
| GPT-5 | OpenAI | gpt-5 |
| Gemini 3 Pro | gemini-3-pro-preview |
Results
In-Domain (Philosophy/Creative, n=57) — 2025 Judges:
┌───────────────────────┬───────────┬──────────┐
│ Judge │ Provider │ Win Rate │
├───────────────────────┼───────────┼──────────┤
│ Claude 3.5 Sonnet │ Anthropic │ 95.2% │
├───────────────────────┼───────────┼──────────┤
│ Claude Opus 4 │ Anthropic │ 78.9% │
├───────────────────────┼───────────┼──────────┤
│ GPT-4o │ OpenAI │ 93.0% │
├───────────────────────┼───────────┼──────────┤
│ Gemini 2.5 Flash Lite │ Google │ 94.7% │
├───────────────────────┼───────────┼──────────┤
│ Aggregate (2025) │ — │ 91.2% │
└───────────────────────┴───────────┴──────────┘
In-Domain (Philosophy/Creative, n=57) — 2026 Judges:
┌───────────────────┬───────────┬──────────┐
│ Judge │ Provider │ Win Rate │
├───────────────────┼───────────┼──────────┤
│ Claude Haiku 4.5 │ Anthropic │ 80.7% │
├───────────────────┼───────────┼──────────┤
│ Claude Sonnet 4.5 │ Anthropic │ 71.9% │
├───────────────────┼───────────┼──────────┤
│ GPT-5 │ OpenAI │ 87.7% │
├───────────────────┼───────────┼──────────┤
│ Gemini 3 Pro │ Google │ 75.4% │
├───────────────────┼───────────┼──────────┤
│ Average (2026) │ — │ 78.9% │
└───────────────────┴───────────┴──────────┘
Out-of-Domain (Coding/Math/Practical, n=15) — 2026 Judges:
┌───────────────────┬──────────┐
│ Judge │ Win Rate │
├───────────────────┼──────────┤
│ Claude Sonnet 4.5 │ 60.0% │
├───────────────────┼──────────┤
│ GPT-5 │ 46.7% │
├───────────────────┼──────────┤
│ Gemini 3 Pro │ 40.0% │
├───────────────────┼──────────┤
│ Average │ ~49% │
└───────────────────┴──────────┘
Cross-lab pairwise agreement (GPT-4o ↔ Gemini 2.5 Flash Lite): 91.2%.
Summary
All eight judges from three independent labs across two model generations show strong preference for Brie on in-domain tasks (71.9–95.2%). Temporal robustness confirmed: while 2026 judges show somewhat lower absolute win rates (78.9% avg vs 90.5% avg for 2025), this reflects more conservative evaluation standards as the field advances — not a regression in model quality. No catastrophic forgetting: out-of-domain performance is at parity with the base model (~49%).
Note on evaluation integrity: A bug in winner determination logic was discovered during evaluation (inverting 56% of results). All reported metrics reflect corrected data. Full documentation included in the training repository.
Environmental Impact
- Hardware Type: NVIDIA RTX 5090
- Hours used: ~1–2 hours
- Cloud Provider: RunPod
- Compute Region: Not specified
- Carbon Emitted: Minimal (~$3 compute cost)
Technical Specifications
Model Architecture and Objective
Qwen2ForCausalLM (causal language model). 36 hidden layers, hidden size 2048, 16 attention heads, 2 KV heads (GQA), intermediate size 11008, max position embeddings 32768, vocab size 151936.
Compute Infrastructure
Hardware
NVIDIA RTX 5090 (32GB VRAM) on RunPod cloud.
Software
HuggingFace Transformers, PEFT, TRL (SFTTrainer). Merged with peft.AutoPeftModelForCausalLM.merge_and_unload().
Citation
@misc{karman2026brie,
author = {Karman, Hunter},
title = {Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation},
year = {2026},
doi = {10.5281/zenodo.17657737},
url = {https://doi.org/10.5281/zenodo.17657737}
}
Model Card Authors
Hunter Karman (closestfriend)
Model Card Contact
closestfriend on HuggingFace
- Downloads last month
- 11
Model tree for closestfriend/brie-v2-qwen2.5-3b
Evaluation results
- Win Rate vs Baseline (Claude 3.5 Sonnet, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)self-reported95.200
- Win Rate vs Baseline (Claude Opus 4, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported78.900
- Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported93.000
- Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported94.700
- Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported80.700
- Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported71.900
- Win Rate vs Baseline (GPT-5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported87.700
- Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported75.400
- Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)self-reported60.000
- Win Rate vs Baseline (GPT-5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)self-reported46.700
- Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)self-reported40.000