Part of the Brie Model Family: Cross-architecture validation study. See also: Brie Qwen 2.5 3B (flagship model, 91.2% in-domain) | Brie Qwen 2.5 0.5B (foundational research)

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation (Karman, 2025)

v2.0 (Jan 2026): Added theoretical framework (Li et al. 2025), corrected training config documentation

Brie Llama 3.2 3B

LoRA adapter for meta-llama/Llama-3.2-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.

Overview

This model is part of a controlled study demonstrating a LLM-assisted data authoring methodology, comparing how different architectures handle fine-tuning on specialized philosophical and creative discourse. The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools, to observe architectural differences in preserving:

Continental philosophical analysis (phenomenology, existentialism, critical theory)
Speculative and experimental thinking
Conceptual reframing for artistic and theoretical work
Contemplative prose and cultural criticism

Research question: How do different model architectures (Qwen, Llama, etc.) differ in their ability to adopt and maintain patterns of philosophical reasoning and contemplative discourse?

Base Model: meta-llama/Llama-3.2-3B-Instruct
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 1,213 original examples authored by the researcher
Training Duration: 2 epochs (304 steps, ~36 minutes on RunPod A40)
Adapter Size: ~19MB
License: Llama 3.2 Community License

Evaluation Results

Blind A/B testing (n=57) comparing Brie against baseline Llama 3.2 3B Instruct. Presentation order randomized to control for position bias. Evaluated using four independent LLM judges across three labs.

Judge Preferences

Judge	Preference for Brie	Sample Size
Claude Sonnet 4 (Anthropic)	73.8%	n=42
Claude Opus 4 (Anthropic)	80.0%	n=15
GPT-4o (OpenAI)	82.5%	n=57
Gemini 2.5 Flash Lite (Google)	84.2%	n=57
Aggregate (All Judges)	80.4%	n=57

All four judges across three labs show strong preference for Brie over baseline, with particularly high confidence from GPT-4o (82.5%) and Gemini (84.2%). The aggregate win rate across all judges is 80.4%, with Claude judges averaging 75.4%.

Performance Highlights

Claude's consistent praise across evaluations:

Philosophical Rigor:

"Significantly more philosophical rigor and originality with its 'unseen mirror' metaphor and sophisticated exploration of time's paradoxical nature as both observer and observed."

Emotional Authenticity:

"Visceral descriptions and authentic emotional language... captures the emotional intensity and disorientation of a life-changing reading experience through vivid, original imagery."

Creative Depth:

"Demonstrates significantly more creativity and philosophical depth, exploring the psychological journey from loneliness to connection with sophisticated metaphors and genuine insight."

Engagement:

"More authentic and relatable moment of understanding with specific, visceral descriptions of the cognitive shift that occurs during true epiphany, making it both more engaging and psychologically accurate."

Architecture Comparison

The same dataset of 1,213 original examples authored by the researcher was fine-tuned across multiple base architectures to study how model design affects philosophical reasoning capabilities:

Base Architecture	Win Rate vs Baseline	Judges	Sample Size
Qwen 2.5 3B	91.2%	4 judges (3 labs)	n=57
Llama 3.2 3B (this model)	80.4%*	4 judges (3 labs)	n=57
Qwen 2.5 0.5B	71.9%	4 judges (3 labs)	n=57

*Average across all judges. Claude judges: 75.4%, GPT-4o: 82.5%, Gemini: 84.2%

Research findings:

Qwen 2.5 architecture shows strongest alignment with philosophical discourse patterns (91.2% in-domain)
Llama 3.2 demonstrates superior general capability retention: 60% out-of-domain performance vs. Qwen's 47%, showing ~11% lower in-domain specialization but substantially better preservation of broad capabilities
Llama 3.2 maintains strong performance (75-84% depending on judge) with more consistent cross-judge validation
Model size matters: sub-1B models struggle with contemplative reasoning patterns
Different judges show varying sensitivity to stylistic differences

Key Architectural Advantage: General Capability Retention

While Llama 3.2 3B shows ~11% lower in-domain performance compared to Qwen 2.5 3B (80.4% vs 91.2%), it demonstrates a significant architectural advantage in preserving general capabilities:

Out-of-Domain Performance Comparison:

Llama 3.2 3B: 60% win rate (9/15 tasks)
- Coding: 67% (2/3)
- Math: 67% (2/3)
- Creative (general): 100% (3/3)
- Factual: 67% (2/3)
- Practical: 0% (0/3)
Qwen 2.5 3B: 47% win rate (7/15 tasks)
- Coding: 33% (1/3)
- Math: 50% (1/2)
- Creative (general): 67% (2/3)
- Factual: 33% (1/3)
- Practical: 67% (2/3)

Key Insight: Llama 3.2's architecture maintains broader general capabilities while specializing, making it more suitable for applications requiring both domain expertise and general-purpose performance. Qwen 2.5 achieves higher domain specialization but with greater trade-offs in out-of-domain tasks.

Use Llama 3.2 when:

You need domain expertise but can't sacrifice general capabilities
Applications require both philosophical reasoning and technical tasks
Broader deployment scenarios beyond pure philosophy/creative work

Use Qwen 2.5 when:

Maximum philosophical reasoning performance is the priority
Domain-specific applications with minimal out-of-domain requirements
Pure philosophy/creative writing deployment

Example Comparison

Prompt: "Write about the experience of being alone in nature"

Baseline (Llama 3.2 3B Instruct):

Being alone in nature can be a profoundly transformative and rejuvenating experience. As I step into the stillness of the forest, the world around me transforms into a tapestry of sounds, sights, and sensations that envelop me in a sense of tranquility...

Brie:

Solitude in the Wild

I stand at the edge of a forest, the trees looming above me like sentinels guarding ancient secrets. The air is heavy with the scent of damp earth and decaying leaves, a primordial aroma that speaks of ages past...

At first, solitude feels like a weight, a crushing loneliness that threatens to swallow me whole. But as I breathe deeper, the weight begins to shift. The silence is not empty—it's filled with the pulse of life...

Claude Sonnet 4's Verdict:

"Response B demonstrates significantly more creativity and philosophical depth, exploring the psychological journey from loneliness to connection with sophisticated metaphors and genuine insight about solitude's transformative power. While Response A is competently written and well-structured, it relies heavily on predictable nature imagery and clichéd observations, whereas Response B offers a more nuanced and emotionally authentic exploration of the human experience in nature."

Winner: Brie (Creativity: 4/5, Depth: 5/5, Engagement: 4/5)

Use Cases

Intended applications:

Philosophical Analysis

Continental philosophy (phenomenology, existentialism, critical theory)
Conceptual analysis and argumentation
Theoretical reframing of questions

Creative Development

Speculative and experimental thinking
Conceptual work for artists and writers
Novel perspective generation

Writing

Contemplative prose
Cultural criticism
Theoretical brainstorming

Less Optimal For:

Factual question answering
Technical documentation
Code generation
Mathematical reasoning
Structured data extraction

Technical Details

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

Training Configuration

SFTConfig(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    effective_batch_size=8,
    learning_rate=2e-4,
    lr_scheduler_type='linear',
    warmup_steps=20,
    max_length=2048,
    bf16=True,
)

Total Training Steps: 304 Hardware: RunPod A40 (48GB VRAM) Training Platform: RunPod Training Duration: ~36 minutes Training Cost: ~$3 on cloud GPUs

Usage

Loading the Model

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "closestfriend/brie-llama-3b",
    device_map="auto",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-llama-3b")

# Generate response
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the relationship between consciousness and time?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Recommended Generation Parameters

Temperature: 0.75 (tested and validated)
Top-p: NOT RECOMMENDED - constrains creative outputs
Max tokens: 512-1024 depending on task
Do sample: True (essential for creative/philosophical tasks)

Note: In testing, top_p=0.95 constrained creative outputs. Pure temperature sampling produced better results for this model's intended use cases.

Limitations

Domain Specialization: Optimized for philosophy and creative writing. Performance on out-of-domain tasks (coding, math, technical) shows expected trade-offs.
Architecture Differences: While personality transfers successfully to Llama architecture, Qwen 2.5 3B shows stronger alignment (91.2% vs 75.4% win rate) with identical training data.
Training Data Scope: 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs - demonstrating a reproducible approach for domain-specific fine-tuning.
Size Constraints: At 3B parameters, may lack knowledge depth of larger models, though sufficient for specialized domains.
Language: Primarily trained and evaluated on English content.
Not Instruction-Tuned for General Tasks: While maintaining general capability, optimized for specialized domains.

Evaluation Methodology

Blind A/B testing with randomized presentation order to control for position bias. Four independent LLM judges across three labs (Anthropic, OpenAI, Google). Evaluation criteria: Creativity, Coherence, Depth, Engagement, Quality.

Note on Sampling Parameters

During evaluation, found that top_p=0.95 constrains creative outputs by cutting off lower-probability tokens where creative language often resides. For this model, pure temperature sampling (without top_p) produced better results in blind A/B testing.

Complete evaluation methodology and results available in the training repository.

Training Data

The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. This LLM-assisted data authoring methodology achieved 77-91% win rates across different architectures, demonstrating a reproducible approach for domain-specific fine-tuning.

Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.

The dataset covers:

Continental philosophy discussions (phenomenology, existentialism, ontology)
Speculative and experimental reasoning
Philosophical argumentation and conceptual analysis
Contemplative and reflective prose

Research methodology: This same dataset was used across the following architectures to enable controlled comparison: Qwen 2.5 3B, Llama 3.2 3B, Qwen3 0.6B, and Qwen 2.5 0.5B. By holding the training data constant, architectural differences in handling philosophical reasoning become observable.

Multi-Response Sampling Methodology

A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).

Why This Matters:

The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
Teaches multiple valid reasoning paths and stylistic variations within domain constraints
Explains strong generalization despite relatively few unique prompts
Provides robustness: model learns what makes a response valid, not just one "correct" answer

This multi-response approach is critical to understanding why 1,213 examples achieve 77-91% win rates—the model learns patterns and variance, not memorization.

Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.

Training Notes

Used 2-epoch training for this dataset size (1,213 examples)
Controlled comparison: identical data across all architectures reveals architectural differences
Qwen 2.5 3B showed stronger alignment (91.2%) than Llama 3.2 3B (80.4% avg) for philosophical discourse
Temperature-only sampling performed better than top_p for creative/philosophical outputs
Multi-judge evaluation (4 judges, 3 labs) provides robust cross-validation

Potential Extensions

Test on out-of-domain tasks (coding, math, technical writing)
Compare with larger Llama models
Run human preference evaluation alongside LLM judges

Citation

If you use this model in your research or applications, please cite:

@misc{brie-llama-3b,
  author = {Hunter Karman
  title = {Brie Llama 3.2 3B: Philosophy & Creative Writing Fine-Tune},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/closestfriend/brie-llama-3b}},
}

Acknowledgments

Base Model: Meta's Llama 3.2 3B Instruct
Evaluation Judges: Anthropic's Claude Sonnet 4 and Claude Opus 4, OpenAI's GPT-4o, Google's Gemini 2.5 Flash Lite
Training Platform: RunPod for GPU infrastructure
Framework: HuggingFace Transformers, PEFT, TRL

License

Llama 3.2 Community License - Same as base model

Model tree for closestfriend/brie-llama-3b

Base model

meta-llama/Llama-3.2-3B-Instruct

Adapter

(675)

this model

Evaluation results

Win Rate vs Baseline (Overall Claude judges, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

75.400
Win Rate vs Baseline (Claude Sonnet 4, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)
self-reported

73.800
Win Rate vs Baseline (Claude Opus 4, blind A/B, n=15) on Multi-Domain Comprehensive (57 prompts)
self-reported

80.000
Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

82.500
Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

84.200

closestfriend
/

brie-llama-3b