Part of the Brie Model Family: Merged standalone model. See also: Brie v2 3B (LoRA adapter) | Brie Llama 3.2 3B | Brie Qwen 2.5 0.5B

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation

Brie v2 Qwen 2.5 3B (Merged)

A fully merged fine-tune of Qwen/Qwen2.5-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work. This is the standalone transformers version of closestfriend/brie-v2-3b — the LoRA adapter weights have been permanently baked into the base model weights via merge_and_unload().

Model Details

Model Description

Brie is a domain-adapted language model trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. It specializes in continental philosophical analysis (phenomenology, existentialism, critical theory), speculative and experimental thinking, conceptual reframing for artistic and theoretical work, and contemplative prose.

This merged variant loads like any standard transformers model — no PEFT/adapter dependencies required.

  • Developed by: Hunter Karman (closestfriend)
  • Model type: Causal Language Model (Qwen2ForCausalLM), merged fine-tune
  • Language(s): English
  • License: Apache 2.0
  • Finetuned from: Qwen/Qwen2.5-3B-Instruct
  • Adapter source: closestfriend/brie-v2-3b

Model Sources

Uses

Direct Use

Load and run directly with transformers — no PEFT required. Best suited for philosophical analysis, speculative reasoning, conceptual brainstorming, and contemplative/creative writing.

Downstream Use

Can be used as a base for further fine-tuning on philosophy or creative writing tasks. Quantization (GGUF, GPTQ, AWQ) should work without modification since it's a standard transformers checkpoint.

Out-of-Scope Use

Not optimized for coding, mathematics, factual Q&A, or practical task completion. Out-of-domain performance is at parity with the base model (~49% avg with 2026 judges), not improved. Should not be used for tasks requiring factual accuracy or up-to-date world knowledge.

Bias, Risks, and Limitations

  1. Domain specialization: Strongly optimized for philosophical and creative writing. Out-of-domain tasks (coding, math, practical) show no improvement over baseline.
  2. Training data scope: 1,213 examples authored by a single researcher from a specific philosophical tradition (continental). Other philosophical traditions are underrepresented.
  3. Language: Trained and evaluated exclusively on English content.
  4. Judge variance: Blind A/B evaluation showed up to 23-point spread across judges (Sonnet 4.5: 71.9% vs GPT-5: 87.7% with 2026 judges), reflecting different sensitivity to stylistic vs. accuracy dimensions.
  5. Small training set: 202 unique prompts (with ~6 responses each) — generalization outside philosophy/creative domains is not guaranteed.

Recommendations

Use for philosophical, creative, and contemplative writing tasks where the base Qwen 2.5 3B feels generic. Pair with a factual retrieval system for knowledge-intensive tasks. Not a replacement for general-purpose models.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "closestfriend/brie-v2-qwen2.5-3b",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-v2-qwen2.5-3b")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the concept of 'being-in-the-world' from phenomenology."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True,
    top_p=0.95,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Recommended generation parameters: temperature 0.75, top_p 0.95, max_new_tokens 512–1024.

Training Details

Training Data

1,213 examples authored by the researcher through iterative discussions using Claude, ChatGPT, Mistral, and Kimi as discussion partners (no Qwen or Llama models used during data authoring to avoid contamination). The dataset covers continental philosophy (phenomenology, existentialism, ontology), speculative reasoning, philosophical argumentation, and contemplative prose.

A key methodological feature: 202 unique prompts with multiple high-quality responses each (~6 per prompt). The model learns a distribution of valid responses rather than memorizing fixed pairs, which explains strong generalization despite the small prompt count.

Training Procedure

LoRA fine-tuning of Qwen/Qwen2.5-3B-Instruct, then merged via merge_and_unload().

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • LoRA rank: 16, alpha: 32, dropout: 0.05
  • Epochs: 2 (290 steps)
  • Batch size: 2 per device, gradient accumulation 4 (effective batch 8)
  • Learning rate: 2e-4, linear schedule, 20 warmup steps
  • Max sequence length: 2048

Speeds, Sizes, Times

  • Hardware: NVIDIA RTX 5090 (32GB VRAM) on RunPod
  • Training time: ~1–2 hours
  • Training cost: ~$3
  • Training date: October 16, 2025

Evaluation

Evaluated via blind A/B testing against baseline Qwen 2.5 3B Instruct with randomized presentation order (controls for position bias). Eight independent judges from three laboratories spanning two model generations (2025–2026) ensure temporal robustness.

Judge Panel

2025 Judges (Original Evaluation):

Judge Provider Version
Claude 3.5 Sonnet Anthropic claude-3-5-sonnet-20241022
Claude Opus 4 Anthropic claude-opus-4-20250514
GPT-4o OpenAI gpt-4o-2024-08-06
Gemini 2.5 Flash Lite Google gemini-2.5-flash-lite

2026 Judges (Re-Evaluation):

Judge Provider Version
Claude Haiku 4.5 Anthropic claude-haiku-4.5
Claude Sonnet 4.5 Anthropic claude-sonnet-4.5
GPT-5 OpenAI gpt-5
Gemini 3 Pro Google gemini-3-pro-preview

Results

In-Domain (Philosophy/Creative, n=57) — 2025 Judges:

┌───────────────────────┬───────────┬──────────┐
│         Judge         │ Provider  │ Win Rate │
├───────────────────────┼───────────┼──────────┤
│ Claude 3.5 Sonnet     │ Anthropic │ 95.2%    │
├───────────────────────┼───────────┼──────────┤
│ Claude Opus 4         │ Anthropic │ 78.9%    │
├───────────────────────┼───────────┼──────────┤
│ GPT-4o                │ OpenAI    │ 93.0%    │
├───────────────────────┼───────────┼──────────┤
│ Gemini 2.5 Flash Lite │ Google    │ 94.7%    │
├───────────────────────┼───────────┼──────────┤
│ Aggregate (2025)      │ —         │ 91.2%    │
└───────────────────────┴───────────┴──────────┘

In-Domain (Philosophy/Creative, n=57) — 2026 Judges:

┌───────────────────┬───────────┬──────────┐
│       Judge       │ Provider  │ Win Rate │
├───────────────────┼───────────┼──────────┤
│ Claude Haiku 4.5  │ Anthropic │ 80.7%    │
├───────────────────┼───────────┼──────────┤
│ Claude Sonnet 4.5 │ Anthropic │ 71.9%    │
├───────────────────┼───────────┼──────────┤
│ GPT-5             │ OpenAI    │ 87.7%    │
├───────────────────┼───────────┼──────────┤
│ Gemini 3 Pro      │ Google    │ 75.4%    │
├───────────────────┼───────────┼──────────┤
│ Average (2026)    │ —         │ 78.9%    │
└───────────────────┴───────────┴──────────┘

Out-of-Domain (Coding/Math/Practical, n=15) — 2026 Judges:

┌───────────────────┬──────────┐
│       Judge       │ Win Rate │
├───────────────────┼──────────┤
│ Claude Sonnet 4.5 │ 60.0%    │
├───────────────────┼──────────┤
│ GPT-5             │ 46.7%    │
├───────────────────┼──────────┤
│ Gemini 3 Pro      │ 40.0%    │
├───────────────────┼──────────┤
│ Average           │ ~49%     │
└───────────────────┴──────────┘

Cross-lab pairwise agreement (GPT-4o ↔ Gemini 2.5 Flash Lite): 91.2%.

Summary

All eight judges from three independent labs across two model generations show strong preference for Brie on in-domain tasks (71.9–95.2%). Temporal robustness confirmed: while 2026 judges show somewhat lower absolute win rates (78.9% avg vs 90.5% avg for 2025), this reflects more conservative evaluation standards as the field advances — not a regression in model quality. No catastrophic forgetting: out-of-domain performance is at parity with the base model (~49%).

Note on evaluation integrity: A bug in winner determination logic was discovered during evaluation (inverting 56% of results). All reported metrics reflect corrected data. Full documentation included in the training repository.

Environmental Impact

  • Hardware Type: NVIDIA RTX 5090
  • Hours used: ~1–2 hours
  • Cloud Provider: RunPod
  • Compute Region: Not specified
  • Carbon Emitted: Minimal (~$3 compute cost)

Technical Specifications

Model Architecture and Objective

Qwen2ForCausalLM (causal language model). 36 hidden layers, hidden size 2048, 16 attention heads, 2 KV heads (GQA), intermediate size 11008, max position embeddings 32768, vocab size 151936.

Compute Infrastructure

Hardware

NVIDIA RTX 5090 (32GB VRAM) on RunPod cloud.

Software

HuggingFace Transformers, PEFT, TRL (SFTTrainer). Merged with peft.AutoPeftModelForCausalLM.merge_and_unload().

Citation

@misc{karman2026brie,
  author    = {Karman, Hunter},
  title     = {Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation},
  year      = {2026},
  doi       = {10.5281/zenodo.17657737},
  url       = {https://doi.org/10.5281/zenodo.17657737}
}

Model Card Authors

Hunter Karman (closestfriend)

Model Card Contact

closestfriend on HuggingFace

Downloads last month
11
Safetensors
Model size
3B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for closestfriend/brie-v2-qwen2.5-3b

Base model

Qwen/Qwen2.5-3B
Finetuned
(1187)
this model
Quantizations
2 models

Evaluation results

  • Win Rate vs Baseline (Claude 3.5 Sonnet, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    95.200
  • Win Rate vs Baseline (Claude Opus 4, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    78.900
  • Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    93.000
  • Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    94.700
  • Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    80.700
  • Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    71.900
  • Win Rate vs Baseline (GPT-5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    87.700
  • Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    75.400
  • Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
    self-reported
    60.000
  • Win Rate vs Baseline (GPT-5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
    self-reported
    46.700
  • Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
    self-reported
    40.000