Part of the Brie Model Family: Merged standalone model. See also: Brie v2 3B (LoRA adapter) | Brie Llama 3.2 3B | Brie Qwen 2.5 0.5B

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation

Brie v2 Qwen 2.5 3B (Merged)

A fully merged fine-tune of Qwen/Qwen2.5-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work. This is the standalone transformers version of closestfriend/brie-v2-3b — the LoRA adapter weights have been permanently baked into the base model weights via merge_and_unload().

Model Details

Model Description

Brie is a domain-adapted language model trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. It specializes in continental philosophical analysis (phenomenology, existentialism, critical theory), speculative and experimental thinking, conceptual reframing for artistic and theoretical work, and contemplative prose.

This merged variant loads like any standard transformers model — no PEFT/adapter dependencies required.

Developed by: Hunter Karman (closestfriend)
Model type: Causal Language Model (Qwen2ForCausalLM), merged fine-tune
Language(s): English
License: Apache 2.0
Finetuned from: Qwen/Qwen2.5-3B-Instruct
Adapter source: closestfriend/brie-v2-3b

Model Sources

Repository: https://github.com/closestfriend/efficient-domain-adaptation
Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation
LoRA adapter: closestfriend/brie-v2-3b

Uses

Direct Use

Load and run directly with transformers — no PEFT required. Best suited for philosophical analysis, speculative reasoning, conceptual brainstorming, and contemplative/creative writing.

Downstream Use

Can be used as a base for further fine-tuning on philosophy or creative writing tasks. Quantization (GGUF, GPTQ, AWQ) should work without modification since it's a standard transformers checkpoint.

Out-of-Scope Use

Not optimized for coding, mathematics, factual Q&A, or practical task completion. Out-of-domain performance is at parity with the base model (~49% avg with 2026 judges), not improved. Should not be used for tasks requiring factual accuracy or up-to-date world knowledge.

Bias, Risks, and Limitations

Domain specialization: Strongly optimized for philosophical and creative writing. Out-of-domain tasks (coding, math, practical) show no improvement over baseline.
Training data scope: 1,213 examples authored by a single researcher from a specific philosophical tradition (continental). Other philosophical traditions are underrepresented.
Language: Trained and evaluated exclusively on English content.
Judge variance: Blind A/B evaluation showed up to 23-point spread across judges (Sonnet 4.5: 71.9% vs GPT-5: 87.7% with 2026 judges), reflecting different sensitivity to stylistic vs. accuracy dimensions.
Small training set: 202 unique prompts (with ~6 responses each) — generalization outside philosophy/creative domains is not guaranteed.

Recommendations

Use for philosophical, creative, and contemplative writing tasks where the base Qwen 2.5 3B feels generic. Pair with a factual retrieval system for knowledge-intensive tasks. Not a replacement for general-purpose models.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "closestfriend/brie-v2-qwen2.5-3b",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-v2-qwen2.5-3b")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the concept of 'being-in-the-world' from phenomenology."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True,
    top_p=0.95,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Recommended generation parameters: temperature 0.75, top_p 0.95, max_new_tokens 512–1024.

Training Details

Training Data

1,213 examples authored by the researcher through iterative discussions using Claude, ChatGPT, Mistral, and Kimi as discussion partners (no Qwen or Llama models used during data authoring to avoid contamination). The dataset covers continental philosophy (phenomenology, existentialism, ontology), speculative reasoning, philosophical argumentation, and contemplative prose.

A key methodological feature: 202 unique prompts with multiple high-quality responses each (~6 per prompt). The model learns a distribution of valid responses rather than memorizing fixed pairs, which explains strong generalization despite the small prompt count.

Training Procedure

LoRA fine-tuning of Qwen/Qwen2.5-3B-Instruct, then merged via merge_and_unload().

Training Hyperparameters

Training regime: bf16 mixed precision
LoRA rank: 16, alpha: 32, dropout: 0.05
Epochs: 2 (290 steps)
Batch size: 2 per device, gradient accumulation 4 (effective batch 8)
Learning rate: 2e-4, linear schedule, 20 warmup steps
Max sequence length: 2048

Speeds, Sizes, Times

Hardware: NVIDIA RTX 5090 (32GB VRAM) on RunPod
Training time: ~1–2 hours
Training cost: ~$3
Training date: October 16, 2025

Evaluation

Evaluated via blind A/B testing against baseline Qwen 2.5 3B Instruct with randomized presentation order (controls for position bias). Eight independent judges from three laboratories spanning two model generations (2025–2026) ensure temporal robustness.

Judge Panel

2025 Judges (Original Evaluation):

Judge	Provider	Version
Claude 3.5 Sonnet	Anthropic	claude-3-5-sonnet-20241022
Claude Opus 4	Anthropic	claude-opus-4-20250514
GPT-4o	OpenAI	gpt-4o-2024-08-06
Gemini 2.5 Flash Lite	Google	gemini-2.5-flash-lite

2026 Judges (Re-Evaluation):

Judge	Provider	Version
Claude Haiku 4.5	Anthropic	claude-haiku-4.5
Claude Sonnet 4.5	Anthropic	claude-sonnet-4.5
GPT-5	OpenAI	gpt-5
Gemini 3 Pro	Google	gemini-3-pro-preview

Results

In-Domain (Philosophy/Creative, n=57) — 2025 Judges:

┌───────────────────────┬───────────┬──────────┐
│         Judge         │ Provider  │ Win Rate │
├───────────────────────┼───────────┼──────────┤
│ Claude 3.5 Sonnet     │ Anthropic │ 95.2%    │
├───────────────────────┼───────────┼──────────┤
│ Claude Opus 4         │ Anthropic │ 78.9%    │
├───────────────────────┼───────────┼──────────┤
│ GPT-4o                │ OpenAI    │ 93.0%    │
├───────────────────────┼───────────┼──────────┤
│ Gemini 2.5 Flash Lite │ Google    │ 94.7%    │
├───────────────────────┼───────────┼──────────┤
│ Aggregate (2025)      │ —         │ 91.2%    │
└───────────────────────┴───────────┴──────────┘

In-Domain (Philosophy/Creative, n=57) — 2026 Judges:

┌───────────────────┬───────────┬──────────┐
│       Judge       │ Provider  │ Win Rate │
├───────────────────┼───────────┼──────────┤
│ Claude Haiku 4.5  │ Anthropic │ 80.7%    │
├───────────────────┼───────────┼──────────┤
│ Claude Sonnet 4.5 │ Anthropic │ 71.9%    │
├───────────────────┼───────────┼──────────┤
│ GPT-5             │ OpenAI    │ 87.7%    │
├───────────────────┼───────────┼──────────┤
│ Gemini 3 Pro      │ Google    │ 75.4%    │
├───────────────────┼───────────┼──────────┤
│ Average (2026)    │ —         │ 78.9%    │
└───────────────────┴───────────┴──────────┘

Out-of-Domain (Coding/Math/Practical, n=15) — 2026 Judges:

┌───────────────────┬──────────┐
│       Judge       │ Win Rate │
├───────────────────┼──────────┤
│ Claude Sonnet 4.5 │ 60.0%    │
├───────────────────┼──────────┤
│ GPT-5             │ 46.7%    │
├───────────────────┼──────────┤
│ Gemini 3 Pro      │ 40.0%    │
├───────────────────┼──────────┤
│ Average           │ ~49%     │
└───────────────────┴──────────┘

Cross-lab pairwise agreement (GPT-4o ↔ Gemini 2.5 Flash Lite): 91.2%.

Summary

All eight judges from three independent labs across two model generations show strong preference for Brie on in-domain tasks (71.9–95.2%). Temporal robustness confirmed: while 2026 judges show somewhat lower absolute win rates (78.9% avg vs 90.5% avg for 2025), this reflects more conservative evaluation standards as the field advances — not a regression in model quality. No catastrophic forgetting: out-of-domain performance is at parity with the base model (~49%).

Note on evaluation integrity: A bug in winner determination logic was discovered during evaluation (inverting 56% of results). All reported metrics reflect corrected data. Full documentation included in the training repository.

Environmental Impact

Hardware Type: NVIDIA RTX 5090
Hours used: ~1–2 hours
Cloud Provider: RunPod
Compute Region: Not specified
Carbon Emitted: Minimal (~$3 compute cost)

Technical Specifications

Model Architecture and Objective

Qwen2ForCausalLM (causal language model). 36 hidden layers, hidden size 2048, 16 attention heads, 2 KV heads (GQA), intermediate size 11008, max position embeddings 32768, vocab size 151936.

Compute Infrastructure

Hardware

NVIDIA RTX 5090 (32GB VRAM) on RunPod cloud.

Software

HuggingFace Transformers, PEFT, TRL (SFTTrainer). Merged with peft.AutoPeftModelForCausalLM.merge_and_unload().

Citation

@misc{karman2026brie,
  author    = {Karman, Hunter},
  title     = {Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation},
  year      = {2026},
  doi       = {10.5281/zenodo.17657737},
  url       = {https://doi.org/10.5281/zenodo.17657737}
}

Model Card Authors

Hunter Karman (closestfriend)

Model Card Contact

closestfriend on HuggingFace

Downloads last month: 11

Safetensors

Model size

3B params

Tensor type

F16

Model tree for closestfriend/brie-v2-qwen2.5-3b

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1187)

this model

Quantizations

2 models

Evaluation results

Win Rate vs Baseline (Claude 3.5 Sonnet, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)
self-reported

95.200
Win Rate vs Baseline (Claude Opus 4, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

78.900
Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

93.000
Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

94.700
Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

80.700
Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

71.900
Win Rate vs Baseline (GPT-5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

87.700
Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

75.400
Win Rate vs Baseline (Claude Sonnet 4.5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
self-reported

60.000
Win Rate vs Baseline (GPT-5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
self-reported

46.700
Win Rate vs Baseline (Gemini 3 Pro, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
self-reported

40.000