Part of the Brie Model Family: Cross-architecture validation study. See also: Brie Qwen 2.5 3B (flagship model, 91.2% in-domain) | Brie Qwen 2.5 0.5B (foundational research)

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation (Karman, 2025)

v2.0 (Jan 2026): Added theoretical framework (Li et al. 2025), corrected training config documentation

Brie Llama 3.2 3B

LoRA adapter for meta-llama/Llama-3.2-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.

Overview

This model is part of a controlled study demonstrating a LLM-assisted data authoring methodology, comparing how different architectures handle fine-tuning on specialized philosophical and creative discourse. The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools, to observe architectural differences in preserving:

  • Continental philosophical analysis (phenomenology, existentialism, critical theory)
  • Speculative and experimental thinking
  • Conceptual reframing for artistic and theoretical work
  • Contemplative prose and cultural criticism

Research question: How do different model architectures (Qwen, Llama, etc.) differ in their ability to adopt and maintain patterns of philosophical reasoning and contemplative discourse?

  • Base Model: meta-llama/Llama-3.2-3B-Instruct
  • Training Method: LoRA (Low-Rank Adaptation)
  • Training Data: 1,213 original examples authored by the researcher
  • Training Duration: 2 epochs (304 steps, ~36 minutes on RunPod A40)
  • Adapter Size: ~19MB
  • License: Llama 3.2 Community License

Evaluation Results

Blind A/B testing (n=57) comparing Brie against baseline Llama 3.2 3B Instruct. Presentation order randomized to control for position bias. Evaluated using four independent LLM judges across three labs.

Judge Preferences

Judge Preference for Brie Sample Size
Claude Sonnet 4 (Anthropic) 73.8% n=42
Claude Opus 4 (Anthropic) 80.0% n=15
GPT-4o (OpenAI) 82.5% n=57
Gemini 2.5 Flash Lite (Google) 84.2% n=57
Aggregate (All Judges) 80.4% n=57

All four judges across three labs show strong preference for Brie over baseline, with particularly high confidence from GPT-4o (82.5%) and Gemini (84.2%). The aggregate win rate across all judges is 80.4%, with Claude judges averaging 75.4%.


Performance Highlights

Claude's consistent praise across evaluations:

Philosophical Rigor:

"Significantly more philosophical rigor and originality with its 'unseen mirror' metaphor and sophisticated exploration of time's paradoxical nature as both observer and observed."

Emotional Authenticity:

"Visceral descriptions and authentic emotional language... captures the emotional intensity and disorientation of a life-changing reading experience through vivid, original imagery."

Creative Depth:

"Demonstrates significantly more creativity and philosophical depth, exploring the psychological journey from loneliness to connection with sophisticated metaphors and genuine insight."

Engagement:

"More authentic and relatable moment of understanding with specific, visceral descriptions of the cognitive shift that occurs during true epiphany, making it both more engaging and psychologically accurate."


Architecture Comparison

The same dataset of 1,213 original examples authored by the researcher was fine-tuned across multiple base architectures to study how model design affects philosophical reasoning capabilities:

Base Architecture Win Rate vs Baseline Judges Sample Size
Qwen 2.5 3B 91.2% 4 judges (3 labs) n=57
Llama 3.2 3B (this model) 80.4%* 4 judges (3 labs) n=57
Qwen 2.5 0.5B 71.9% 4 judges (3 labs) n=57

*Average across all judges. Claude judges: 75.4%, GPT-4o: 82.5%, Gemini: 84.2%

Research findings:

  • Qwen 2.5 architecture shows strongest alignment with philosophical discourse patterns (91.2% in-domain)
  • Llama 3.2 demonstrates superior general capability retention: 60% out-of-domain performance vs. Qwen's 47%, showing ~11% lower in-domain specialization but substantially better preservation of broad capabilities
  • Llama 3.2 maintains strong performance (75-84% depending on judge) with more consistent cross-judge validation
  • Model size matters: sub-1B models struggle with contemplative reasoning patterns
  • Different judges show varying sensitivity to stylistic differences

Key Architectural Advantage: General Capability Retention

While Llama 3.2 3B shows ~11% lower in-domain performance compared to Qwen 2.5 3B (80.4% vs 91.2%), it demonstrates a significant architectural advantage in preserving general capabilities:

Out-of-Domain Performance Comparison:

  • Llama 3.2 3B: 60% win rate (9/15 tasks)

    • Coding: 67% (2/3)
    • Math: 67% (2/3)
    • Creative (general): 100% (3/3)
    • Factual: 67% (2/3)
    • Practical: 0% (0/3)
  • Qwen 2.5 3B: 47% win rate (7/15 tasks)

    • Coding: 33% (1/3)
    • Math: 50% (1/2)
    • Creative (general): 67% (2/3)
    • Factual: 33% (1/3)
    • Practical: 67% (2/3)

Key Insight: Llama 3.2's architecture maintains broader general capabilities while specializing, making it more suitable for applications requiring both domain expertise and general-purpose performance. Qwen 2.5 achieves higher domain specialization but with greater trade-offs in out-of-domain tasks.

Use Llama 3.2 when:

  • You need domain expertise but can't sacrifice general capabilities
  • Applications require both philosophical reasoning and technical tasks
  • Broader deployment scenarios beyond pure philosophy/creative work

Use Qwen 2.5 when:

  • Maximum philosophical reasoning performance is the priority
  • Domain-specific applications with minimal out-of-domain requirements
  • Pure philosophy/creative writing deployment

Example Comparison

Prompt: "Write about the experience of being alone in nature"

Baseline (Llama 3.2 3B Instruct):

Being alone in nature can be a profoundly transformative and rejuvenating experience. As I step into the stillness of the forest, the world around me transforms into a tapestry of sounds, sights, and sensations that envelop me in a sense of tranquility...

Brie:

Solitude in the Wild

I stand at the edge of a forest, the trees looming above me like sentinels guarding ancient secrets. The air is heavy with the scent of damp earth and decaying leaves, a primordial aroma that speaks of ages past...

At first, solitude feels like a weight, a crushing loneliness that threatens to swallow me whole. But as I breathe deeper, the weight begins to shift. The silence is not empty—it's filled with the pulse of life...

Claude Sonnet 4's Verdict:

"Response B demonstrates significantly more creativity and philosophical depth, exploring the psychological journey from loneliness to connection with sophisticated metaphors and genuine insight about solitude's transformative power. While Response A is competently written and well-structured, it relies heavily on predictable nature imagery and clichéd observations, whereas Response B offers a more nuanced and emotionally authentic exploration of the human experience in nature."

Winner: Brie (Creativity: 4/5, Depth: 5/5, Engagement: 4/5)


Use Cases

Intended applications:

Philosophical Analysis

  • Continental philosophy (phenomenology, existentialism, critical theory)
  • Conceptual analysis and argumentation
  • Theoretical reframing of questions

Creative Development

  • Speculative and experimental thinking
  • Conceptual work for artists and writers
  • Novel perspective generation

Writing

  • Contemplative prose
  • Cultural criticism
  • Theoretical brainstorming

Less Optimal For:

  • Factual question answering
  • Technical documentation
  • Code generation
  • Mathematical reasoning
  • Structured data extraction

Technical Details

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

Training Configuration

SFTConfig(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    effective_batch_size=8,
    learning_rate=2e-4,
    lr_scheduler_type='linear',
    warmup_steps=20,
    max_length=2048,
    bf16=True,
)

Total Training Steps: 304 Hardware: RunPod A40 (48GB VRAM) Training Platform: RunPod Training Duration: ~36 minutes Training Cost: ~$3 on cloud GPUs


Usage

Loading the Model

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "closestfriend/brie-llama-3b",
    device_map="auto",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-llama-3b")

# Generate response
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the relationship between consciousness and time?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Recommended Generation Parameters

  • Temperature: 0.75 (tested and validated)
  • Top-p: NOT RECOMMENDED - constrains creative outputs
  • Max tokens: 512-1024 depending on task
  • Do sample: True (essential for creative/philosophical tasks)

Note: In testing, top_p=0.95 constrained creative outputs. Pure temperature sampling produced better results for this model's intended use cases.


Limitations

  1. Domain Specialization: Optimized for philosophy and creative writing. Performance on out-of-domain tasks (coding, math, technical) shows expected trade-offs.

  2. Architecture Differences: While personality transfers successfully to Llama architecture, Qwen 2.5 3B shows stronger alignment (91.2% vs 75.4% win rate) with identical training data.

  3. Training Data Scope: 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs - demonstrating a reproducible approach for domain-specific fine-tuning.

  4. Size Constraints: At 3B parameters, may lack knowledge depth of larger models, though sufficient for specialized domains.

  5. Language: Primarily trained and evaluated on English content.

  6. Not Instruction-Tuned for General Tasks: While maintaining general capability, optimized for specialized domains.


Evaluation Methodology

Blind A/B testing with randomized presentation order to control for position bias. Four independent LLM judges across three labs (Anthropic, OpenAI, Google). Evaluation criteria: Creativity, Coherence, Depth, Engagement, Quality.

Note on Sampling Parameters

During evaluation, found that top_p=0.95 constrains creative outputs by cutting off lower-probability tokens where creative language often resides. For this model, pure temperature sampling (without top_p) produced better results in blind A/B testing.

Complete evaluation methodology and results available in the training repository.


Training Data

The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. This LLM-assisted data authoring methodology achieved 77-91% win rates across different architectures, demonstrating a reproducible approach for domain-specific fine-tuning.

Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.

The dataset covers:

  • Continental philosophy discussions (phenomenology, existentialism, ontology)
  • Speculative and experimental reasoning
  • Philosophical argumentation and conceptual analysis
  • Contemplative and reflective prose

Research methodology: This same dataset was used across the following architectures to enable controlled comparison: Qwen 2.5 3B, Llama 3.2 3B, Qwen3 0.6B, and Qwen 2.5 0.5B. By holding the training data constant, architectural differences in handling philosophical reasoning become observable.

Multi-Response Sampling Methodology

A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).

Why This Matters:

  • The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
  • Teaches multiple valid reasoning paths and stylistic variations within domain constraints
  • Explains strong generalization despite relatively few unique prompts
  • Provides robustness: model learns what makes a response valid, not just one "correct" answer

This multi-response approach is critical to understanding why 1,213 examples achieve 77-91% win rates—the model learns patterns and variance, not memorization.

Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.


Training Notes

  • Used 2-epoch training for this dataset size (1,213 examples)
  • Controlled comparison: identical data across all architectures reveals architectural differences
  • Qwen 2.5 3B showed stronger alignment (91.2%) than Llama 3.2 3B (80.4% avg) for philosophical discourse
  • Temperature-only sampling performed better than top_p for creative/philosophical outputs
  • Multi-judge evaluation (4 judges, 3 labs) provides robust cross-validation

Potential Extensions

  • Test on out-of-domain tasks (coding, math, technical writing)
  • Compare with larger Llama models
  • Run human preference evaluation alongside LLM judges

Citation

If you use this model in your research or applications, please cite:

@misc{brie-llama-3b,
  author = {Hunter Karman
  title = {Brie Llama 3.2 3B: Philosophy & Creative Writing Fine-Tune},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/closestfriend/brie-llama-3b}},
}

Acknowledgments

  • Base Model: Meta's Llama 3.2 3B Instruct
  • Evaluation Judges: Anthropic's Claude Sonnet 4 and Claude Opus 4, OpenAI's GPT-4o, Google's Gemini 2.5 Flash Lite
  • Training Platform: RunPod for GPU infrastructure
  • Framework: HuggingFace Transformers, PEFT, TRL

License

Llama 3.2 Community License - Same as base model


Links


Training: October 30, 2025 Evaluation: October 31, 2025 License: Llama 3.2 Community License

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for closestfriend/brie-llama-3b

Adapter
(675)
this model

Evaluation results

  • Win Rate vs Baseline (Overall Claude judges, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    75.400
  • Win Rate vs Baseline (Claude Sonnet 4, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    73.800
  • Win Rate vs Baseline (Claude Opus 4, blind A/B, n=15) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    80.000
  • Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    82.500
  • Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
    self-reported
    84.200