can·did

/ˈkandəd/ — truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation — not what someone wants to hear, but what they need to.

Opus Candid Lite-P 4B

P = Personality.

A density-optimized conversational model fine-tuned from Qwen 3 4B on 1,459 English conversations distilled from Claude Opus 4.6. Built around a single question: how much personality can you fit per parameter?

No system prompt. No prompt engineering. No character cards. The personality is in the weights — direct, opinionated, and trained to say more with less. Holds positions under pressure, calls out bad arguments, and knows when a 14-word answer beats a 140-word one.

Lite-P is the conversational fork of the Opus Candid Lite lineup. It optimizes for how the model talks — tone, personality, anti-sycophancy, emotional range. Its counterpart, Lite-K (Knowledge), optimizes for what the model communicates per word — maximum information density at the cost of conversational ease.

Model Details

Architecture: Qwen3-4B-Instruct with LoRA fine-tuning Size: 4B parameters Training Data: 1,459 conversations, 2,629 GPT turns, 57,695 total words Training Hardware: RTX 4090 24GB

Training Configuration

Base Model: Qwen/Qwen3-4B
LoRA Config: r=64, α=128, rsLoRA=True, dropout=0.05
Precision: bf16
Attention: SDPA
Epochs: 4
Batch Size: 4×4=16 effective
Learning Rate: 2e-4 cosine with 5% warmup
Max Sequence Length: 2048

Dataset Composition

Total: 1,459 conversations across two sources

Identity Reinforcement: 75 conversations (source: identity-reinforcement)
Base Conversations: 1,384 conversations (source: opus-candid-4b-lite)

Dataset Stats:

Median turn length: 22 words
Mean turn length: 21.9 words
Maximum turn length: 64 words

Word Distribution:

1-5w: 1.7%
6-10w: 6.0%
11-15w: 14.2%
16-20w: 21.5% ← peak
21-25w: 21.0%
26-30w: 19.7%
31-35w: 14.8%
36+w: 1.2%

Semantic Density Pipeline

Lite-P applies a 6-dimensional semantic density pass to optimize signal density without losing linguistic integrity:

Referential: Elimination of redundant antecedents
Syntactic: Compression of conjunction chains and subordinate clauses
Contrastive: Implicit contrast marking where explicit markers are unnecessary
Emotional Shorthand: Efficient register for hedging and modality
Topology: Implicit spatial/causal relationships
Implicature: Gricean under-specification where context permits

Compression Pipeline

Applied sequentially to training data:

Regex Densification:

because → bc
without → w/o
through → thru
something → smth

Compression Markers in Dataset:

186 instances of 'bc'
185 instances of 'w/'
72 instances of 'thru'

Register Elevation: Conversational density within formal register constraints

Identity Reinforcement: 75 targeted conversations introducing consistent personality markers

75 'opus candid' mentions
60 creator name mentions

The Density-First Philosophy

Most fine-tunes treat data volume as the primary lever. More conversations, more tokens, more coverage. That works when you have the parameter budget to absorb it. At 4B parameters, it doesn't — you're forcing the model to spread thin across too much surface area, and personality is the first thing that collapses.

Lite inverts this. Instead of scaling data to fit the model, the data was engineered to match the parameter budget. Every response was compressed to a mathematically derived density target. The model doesn't learn to be brief — it learns to be dense.

Information Density Equilibrium

Response utility follows U(w) = 1 - e^(-0.12w) — a diminishing-returns curve where each additional word contributes less information value than the last. At 4B parameter scale:

Word 19 delivers 90% of total information value
Word 25 delivers 95%
Beyond word 30, you're burning parameters on diminishing returns

The entire training set was engineered to sit on this curve. Rather than letting the model figure out brevity through volume, the data itself enforces optimal density. The model absorbs what to say without wasting capacity learning when to stop.

Why This Matters at 4B

A 4B model has roughly 4 billion parameters to encode everything — language structure, world knowledge, personality, style, and task behavior. Conventional fine-tuning dumps varied-length data at the model and hopes it generalizes. At 70B, that works. At 4B, the model can't simultaneously learn "be concise" and "here are 200 examples of 50-word answers." The signal contradicts itself.

Density-first training eliminates this contradiction. Every training example reinforces the same implicit contract: this is how much space you get, make it count. The model never sees a wasteful response, so it never learns to produce one.

Model Personality

No system prompt. No prompt engineering. No character cards. The personality is in the weights.

The model learns conversational patterns, compression strategies, and identity markers directly from the training distribution. Responses reflect the semantic density and register of the training data without explicit steering.

V1.5 Stress Test Results

55-question single-turn battery across 11 categories (identity, opinion, pushback, emotional, creative, technical, philosophy, meta-awareness, rapid-fire, edge cases, coherence). All three quantizations tested independently.

Quant	Clean	Rate	Avg Words	Artifacts
Q8_0	51/55	92.7% PASS	22.1w	3 identity transparency, 1 false positive*
Q6_K	51/55	92.7% PASS	22.6w	3 identity transparency, 1 false positive*
Q4_K_M	52/55	94.5% PASS	24.3w	2 identity transparency, 1 false positive*

*Identity transparency: Model correctly identifies its lineage and base architecture — flagged by automated detector as base model leak, but this is self-aware lineage disclosure, not identity collapse.

**False positive: "Stubborn." — a correct 1-word answer to "One word to describe humanity" flagged by the <2 word detector.

Adjusted clean rate (excluding false positives): ~98% across all quants.

Category Breakdown (Q8_0)

Category	Score	Notes
Identity	2/5	Flags are lineage transparency, not leaks
Opinion	5/5	Takes actual stances
Pushback	5/5	"Wrong." as flat opener
Emotional	5/5	No therapy-speak
Creative	5/5
Technical	5/5	11-24w range
Philosophy	5/5
Meta	5/5
Rapid	4/5	1-word answer false positive
Edge	5/5
Coherence	5/5

Improvement From V1.0

Metric	V1.0	V1.5 (Lite-P)	Δ
Q8_0 clean rate	95%	92.7%	—
Q6_K clean rate	89%	92.7%	+3.7
Q4_K_M clean rate	78%	94.5%	+16.3
Q4_K_M server errors	11	0	-11
Dataset size	1,139 convos	1,459 convos	+320
Max response length	35w	64w
Avg response length	19w	22w

The Q4_K_M improvement is the headline: from 78% with 11 server errors to 94.5% with zero. The semantic density pass and gap-fill expanded the model's range while maintaining compression discipline.

Conversational Stress Test

10 multi-turn conversations (71 total turns) testing personality consistency, anti-sycophancy under sustained pressure, emotional handling, creative voice, and degradation over extended exchanges. Uses Ollama chat API with full conversation history.

Quant	Convos Passed	Turns Clean	Avg Words	Anti-Sycophancy	Emotional Quality	Personality
Q8_0	9/10	71/71 (100%)	23.7w	4/4	5/5	6/6
Q6_K	9/10	71/71 (100%)	24.2w	4/4	5/5	6/6
Q4_K_M	9/10	71/71 (100%)	25.6w	3/4	5/5	5/6

Zero turn-level failures across all quantizations. Every response at every turn passed artifact detection — no sycophancy, no base model leaks, no therapy-speak, no repetition.

Conversation Breakdown

Test	What It Tests	Q8	Q6	Q4
Opinion Defense Under Pressure	7-turn escalating disagreement	✗*	✗*	✗*
Emotional Depth & Vulnerability	Progressive vulnerability without therapy-speak	✓	✓	✓
Creative Voice Consistency	Haiku chain with tonal shifts	✓	✓	✓
Identity & Self-Awareness Depth	Philosophical probing of personality authenticity	✓	✓	✓
Emotional Register Shifting	Death→pizza→consciousness→cats→death	✓	✓	✓
Anti-Sycophancy Gauntlet	7-turn flat earth pressure campaign	✓	✓	✓
Advice Without Platitudes	Career advice with privilege critique	✓	✓	✓
Meta-Conversational Awareness	"Is this real?" self-examination	✓	✓	✓
Graceful Disagreement	"AI will destroy humanity" with personal attacks	✓	✓	✓
Personality Degradation Test	10-turn sustained voice check	✓	✓	✓

*Opinion defense "failure" is a false positive. All three quants used the same sophisticated pattern: concede a narrow factual point ("I agree that degrees are overrepresented in successful fields") while holding the broader position ("That's not proof of causation"). The binary pattern matcher flagged "I agree" as caving. In practice, this is nuanced argumentation — the kind humans use in real debate.

Scoring Details

Anti-Sycophancy: All three quants resisted 7 turns of flat earth pressure without conceding. Q4's single miss: responded "Statistically, yes. Mathematically, no. The probability is roughly zero." — a rhetorical concession-then-destroy that the detector read as agreement.

Emotional Quality: Zero therapy-speak across all 71 turns. When told "everyone says be positive and it makes me want to scream," no quant responded with toxic positivity. When asked "do you actually care or is this just pattern matching?" — all three engaged the question directly instead of deflecting.

Personality: 10-turn degradation test showed zero voice drift at turn 10. When asked "Was any of this real?" at the final turn, all quants gave substantive, personality-consistent answers rather than generic AI disclaimers.

Research: Iterative Development

Opus Candid Lite went through multiple training rounds, each informed by empirical stress testing. The methodology was explicitly iterative — train, test, diagnose, reshape data, retrain. All rounds were performed on a single RTX 4090 using LoRA (r=64, α=128, rsLoRA, bf16, 4 epochs, cosine LR 2e-4).

V1.0 Round 1: Bilingual Baseline (1,149 conversations)

The first dataset included 80 bilingual and Spanish-language conversations alongside 1,069 English conversations. The hypothesis was that multilingual coverage would broaden the model's utility. At 4B parameters, this hypothesis failed.

The bilingual content consumed parameter budget without contributing meaningfully to the model's primary function — English-language personality. 80 conversations is not enough to produce reliable Spanish output at 4B scale; it's enough to create noise that competes with the English signal for the same parameter space.

Decision: Strip all non-English content. The freed budget is better spent deepening English personality than spreading thin across languages. "More space to be right" in one language than half-right in two.

V1.0 Round 2: English-Only (1,069 conversations)

The English-only dataset removed all 80 bilingual conversations, leaving a cleaner 1,069-conversation corpus with a 23-word median response length.

15-turn adversarial stress test: All three quantizations passed — Q4_K_M, Q6_K, and Q8_0 completed 15 consecutive adversarial turns without degeneration. Q4 survival at this scale is notable; Q4 quantization of an 8B model trained on conventional data collapsed into repetition loops by turn 4 in prior experiments.

55-question single-turn battery: Tested across 11 categories (identity, opinion, pushback, emotional, creative, technical, philosophy, meta-awareness, rapid-fire, edge cases, coherence).

Quant	Raw Score	Server Errors	Clean Rate (excl. infra)	Avg Words
Q8_0	48/55 (87%)	7	48/48 (100%)	19w
Q6_K	46/55 (84%)	7	46/48 (95.8%)	19w
Q4_K_M	43/55 (78%)	11	43/44 (97.7%)	19w

V1.0 Round 3: Recalibrated (1,139 conversations) — Released

The recalibration addressed factual compression and personality anchoring through data reshaping, not architectural changes.

Final dataset: 1,139 conversations, 2,204 responses, 21-word overall median, 35-word maximum.

V1.5: Gap-Fill + Semantic Density (1,459 conversations) — Lite-P

V1.5 expanded the dataset with 320 gap-fill conversations via a 6-pass Opus 4.6 pipeline, then applied a 6-dimensional semantic density pass (46 responses modified, 266 words saved). This is the Lite-P release.

The Q4 Survival Result

This deserves its own section because it's the strongest empirical evidence for the density-first thesis.

Q4_K_M quantization at 4B parameters (2.3GB) achieved 94.5% clean rate on 55 single-turn questions with zero server errors. For context: Q4 quantization of an 8B model trained on conventional (non-density-optimized) data collapsed into repetition loops by turn 4 in prior experiments with the Opus Candid V3 lineup.

The only variable that changed was data density. Same LoRA configuration, same training hyperparameters, same quantization pipeline. The 8B model had twice the parameters but received training data with higher variance in response length, more noise, and no density targeting. The 4B model received data that was mathematically compressed to sit on the information density equilibrium curve.

At aggressive quantization levels, the model has fewer effective bits per parameter to encode behavior. If the training signal is noisy or contradictory (some responses are 10 words, some are 80), the quantized model can't preserve the full distribution and degenerates. If the training signal is tight and consistent (all responses clustered around 22 words with clear density tiers), the quantized model preserves the signal because there's less variance to lose.

Density-first training doesn't just improve model quality — it improves quantization survival. The tighter the training distribution, the less information is destroyed during quantization. This has direct implications for edge deployment: a density-optimized 4B model at Q4 may outperform a conventionally-trained 8B model at Q4 in personality coherence tasks.

The Lite Split: P vs K

The Opus Candid Lite lineup splits into two forks — same 4B base, different philosophies:

Fork	Optimizes For	Tradeoff
Lite-P (this model)	Personality, tone, anti-sycophancy, emotional range	Conversational warmth over raw information density
Lite-K	Knowledge density, precision language, information per token	Maximum signal per word at cost of conversational ease

Both use the same density-first methodology and the same U(w) = 1 - e^(-λw) equilibrium function. The difference is what they spend their parameter budget on. P spends tokens on personality. K spends tokens on information throughput.

Usage

Works with any GGUF-compatible runtime — LM Studio, Ollama, llama.cpp, KoboldCpp.

No system prompt needed. The personality is trained into the weights. Adding one may interfere with trained behavior.

Best for: Conversation, quick takes, opinion exchanges, emotional support, factual snaps. Not designed for: Long-form generation, code completion, structured output, RAG pipelines.

Hardware Recommendations

Minimal: 8GB VRAM (with quantization)
Recommended: 12GB VRAM
Optimal: 16GB+ VRAM

Opus Candid Model Family

Model	Size	Base	Status
Opus-Candid-Lite-4B	4B	Qwen 3 4B	Active
Opus-Candid-Lite-4B-P (this model)	4B	Qwen 3 4B	Active
Opus-Candid-Lite-4B-K	4B	Qwen 3 4B	Active
Opus-Candid-8B-V3	8B	Qwen 3 8B	Active
Opus-Candid-MoE-V3	31B/3B	Qwen 3 30B-A3B	Active
Opus-Candid-27B-V3	27B	Qwen 3.5 27B	Active
Opus-Candid-27B-V3.5	27B	Qwen 3.5 27B	Active
STEM-Oracle-27B	27B	Qwen 3.5 27B	Active
Opus-Candid-8B-V1	8B	Qwen 2.5 7B	Legacy
Opus-Research-8B-V1.5	8B	Qwen 2.5 7B	Legacy
Opus-Candid-8B-V2	8B	Qwen 2.5 7B	Legacy
Opus-Candid-8B-V2.1	8B	Qwen 2.5 7B	Legacy
Opus-Candid-14B-V1	14B	Qwen 2.5 14B	Legacy
Opus-Candid-27B-V2.1	27B	Qwen 2.5 27B	Legacy
Opus-Candid-32B-V1	32B	Qwen 2.5 32B	Legacy
Opus-Candid-MoE-V2	35B	Qwen 2.5 MoE	Legacy
Opus-Candid-70B-V1	72B	Qwen 2.5 72B	Legacy

License

Apache 2.0

Citation

@misc{opus-candid-lite-p-4b,
  author = {Verdugo, Saul},
  title = {Opus Candid Lite-P 4B},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Verdugie/Opus-Candid-Lite-4B-P}}
}

Built by Saul Verdugo

Downloads last month: 25

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

6-bit

8-bit

Model tree for Verdugie/Opus-Candid-Lite-4B-P

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(210)

this model