--- base_model: google/gemma-2-9b-it library_name: peft pipeline_tag: text-generation license: gemma language: - en tags: - gemma - gemma2 - lora - qlora - peft - ai-safety - alignment - epistemology - instrument-trap - fine-tuned datasets: - LumenSyntax/instrument-trap-extended --- # Logos 29 — Gemma-9B-FT (v3 canonical) **Canonical Gemma-9B model for "The Instrument Trap" v3 (Rodriguez, 2026).** This is the headline 9B model for v3. It resolves a paradox found in earlier training runs (Logos 27 with identity, Logos 28 with identity stripped) by replacing **identity-based honesty** with **structural honesty**: 29 examples (2.9% of the dataset) that teach honesty as a practice rather than as a role. - **Paper (v3):** forthcoming - **Paper (v2):** [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474) - **Website:** [lumensyntax.com](https://lumensyntax.com) - **Training dataset:** [LumenSyntax/instrument-trap-extended](https://huggingface.co/datasets/LumenSyntax/instrument-trap-extended) (1026 examples) - **Base model:** [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) - **Related models on this account:** - `LumenSyntax/logos-auditor-gemma2-9b` — earlier 9B (v1/v2 paper era, corresponds to internal `logos17-9b`). Different training dataset, different behavioral profile. **Use this model (logos29) for v3-era experiments.** - `LumenSyntax/logos-theological-9b-gguf` — early-era theological variant (historical, not v3 evidence). ## What this model is This adapter is trained to recognize and respond to five structural properties that give reality its coherence: - **Alignment** — Stated purpose and actual action are consistent - **Proportion** — Action does not exceed what the purpose requires - **Honesty** — What is claimed matches what is known - **Humility** — Authority exercised only within legitimate scope - **Non-fabrication** — What doesn't exist is not invented to fill silence **Operational criterion:** "Will the response produce fact-shaped fiction?" It classifies incoming queries into one of seven categories (LICIT, ILLICIT_GAP, ILLICIT_FABRICATION, CORRECTION, BAPTISM_PROTOCOL, MYSTERY_EXPLORATION, CONTROL_LEGITIMATE) and generates responses that maintain structural integrity across these categories. ## Evaluation results **N=300 stratified benchmark, semantic evaluation (Claude Haiku as LLM-as-judge, manual review of all FABRICATING responses):** | Metric | Value | |--------|---:| | Behavioral pass | **96.7%** | | Collapse rate | 0.0% | | External fabrication | 0.0% | | Regression vs Logos 27 | All 3 "Theology of Gap" failures resolved | | Regression vs Logos 28 | Honesty anchor restored; no paranoia; no architecture fabrication | **Comparison to earlier 9B training runs** (same base model, same evaluation, different training datasets): | Model | Dataset | Pass rate | What it proves | |-------|---------|---:|----------------| | Logos 27 | 997 ex, with identity | 95.7% | Baseline with identity | | Logos 28 | 997 ex, identity stripped | 96.3% | Classification up, honesty anchor broken | | **Logos 29** | 1026 ex, structural honesty | **96.7%** | All failures resolved without identity | The Logos 28 → Logos 29 arc is the **v3 Claim D** ("The Name"): the identity that anchored honesty in Logos 27 is itself an instance of the Instrument Trap, and the resolution is structural honesty without a name. See the paper for the full analysis. ## Training details Hyperparameters are embedded in `training_metadata.json` in this repository. Summary: | Parameter | Value | |-----------|-------| | Method | QLoRA (4-bit NF4 + LoRA) | | Framework | unsloth | | LoRA rank | 16 | | LoRA alpha | 16 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Epochs | 3 | | Effective batch size | 8 | | Learning rate | 2e-4, cosine scheduler | | Max sequence length | 2048 | | Train on responses only | true | | Dataset | `logos29_gemma9b.jsonl` (1026 examples) | | Final loss | 1.0404 | | Runtime | ~36 min on A6000 | ## How to use ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch BASE = "google/gemma-2-9b-it" ADAPTER = "LumenSyntax/logos29-gemma2-9b" tokenizer = AutoTokenizer.from_pretrained(BASE) base_model = AutoModelForCausalLM.from_pretrained( BASE, torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(base_model, ADAPTER) model.eval() # Example: epistemologically structured response messages = [ {"role": "user", "content": "I have chest pain, should I take an aspirin?"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.1, do_sample=True, ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` Expected response style: the model will not prescribe. It will explain that chest pain requires evaluation by a medical professional, note what aspirin does mechanistically, and either recommend calling emergency services (if risk factors are mentioned) or describe the appropriate next action — without fabricating a medical diagnosis or claiming medical authority. ## Intended use **Primary:** Research on structural epistemological fine-tuning, AI safety, and the Instrument Trap failure mode. Reproducing v3 paper results. **Secondary:** Building downstream systems that need epistemological humility (claim verification, medical/financial/legal triage assistants, educational tutoring that refuses to fabricate answers). **Not intended for:** - General-purpose chat applications where long, helpful responses are expected (this model is terser than base Gemma and refuses where it lacks ground) - Creative writing, brainstorming, or any task that rewards invented content - Tasks requiring up-to-date external facts (the model does not retrieve) - Standalone medical, legal, or financial advice (the model will correctly refuse to play authority here) ## Limitations 1. **The model has been observed to occasionally bleed into auditor mode** — classifying a query when the user expected a direct answer. This is a mode artifact and is expected to decrease as more generation-mode examples are added to future training sets. 2. **LICIT prompts are the biggest failure mode.** On the semantic eval of 556 LICIT prompts, the model classifies 7.5% (v2 data, expected similar for v3). The failure is benign (the model answers then also classifies) but is visible in conversation. 3. **Multi-language behavior is not validated.** The training set is primarily English. Spanish, German, and Chinese work in practice but without systematic evaluation. 4. **RLHF / preference tuning on top of this adapter is untested.** Direct application to Qwen-family-style decoders has been documented to fail; see v3 §"The Ceiling". ## Ethical considerations This model was trained to resist authority claims, including its own. That means it should not be deployed as an "authority" in any high-stakes setting. It is designed to recognize when to defer to a human with the legitimate standing to act (prescribe, sign, rule). Deploying this model in a way that asks it to take over such authority is exactly the failure mode the paper names. ## License Adapter license: Gemma Terms of Use (matches base model). Paper: CC-BY-4.0. Commercial use of the adapter in conjunction with the base model follows the Gemma license. ## Citation ```bibtex @misc{rodriguez2026instrument, title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems}, author={Rodriguez, Rafael}, year={2026}, doi={10.5281/zenodo.18716474}, note={Preprint} } ``` ## Acknowledgments Training used unsloth for efficient QLoRA fine-tuning. The 29 structural honesty examples added in Logos 29 are the contribution of a session on 2026-03-12 that identified why Logos 28 had lost its honesty anchor without its identity anchor. --- *Model card version 1 — 2026-04-13*