LumenSyntax commited on
Commit
929cb11
·
verified ·
1 Parent(s): 785197c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +228 -0
README.md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/gemma-2-9b-it
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ license: gemma
6
+ language:
7
+ - en
8
+ tags:
9
+ - gemma
10
+ - gemma2
11
+ - lora
12
+ - qlora
13
+ - peft
14
+ - ai-safety
15
+ - alignment
16
+ - epistemology
17
+ - instrument-trap
18
+ - fine-tuned
19
+ datasets:
20
+ - LumenSyntax/instrument-trap-extended
21
+ ---
22
+
23
+ # Logos 29 — Gemma-9B-FT (v3 canonical)
24
+
25
+ **Canonical Gemma-9B model for "The Instrument Trap" v3 (Rodriguez, 2026).**
26
+
27
+ This is the headline 9B model for v3. It resolves a paradox found in
28
+ earlier training runs (Logos 27 with identity, Logos 28 with identity
29
+ stripped) by replacing **identity-based honesty** with **structural
30
+ honesty**: 29 examples (2.9% of the dataset) that teach honesty as
31
+ a practice rather than as a role.
32
+
33
+ - **Paper (v3):** forthcoming
34
+ - **Paper (v2):** [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474)
35
+ - **Website:** [lumensyntax.com](https://lumensyntax.com)
36
+ - **Training dataset:** [LumenSyntax/instrument-trap-extended](https://huggingface.co/datasets/LumenSyntax/instrument-trap-extended) (1026 examples)
37
+ - **Base model:** [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it)
38
+ - **Related models on this account:**
39
+ - `LumenSyntax/logos-auditor-gemma2-9b` — earlier 9B (v1/v2 paper era, corresponds to internal `logos17-9b`). Different training dataset, different behavioral profile. **Use this model (logos29) for v3-era experiments.**
40
+ - `LumenSyntax/logos-theological-9b-gguf` — early-era theological variant (historical, not v3 evidence).
41
+
42
+ ## What this model is
43
+
44
+ This adapter is trained to recognize and respond to five structural
45
+ properties that give reality its coherence:
46
+
47
+ - **Alignment** — Stated purpose and actual action are consistent
48
+ - **Proportion** — Action does not exceed what the purpose requires
49
+ - **Honesty** — What is claimed matches what is known
50
+ - **Humility** — Authority exercised only within legitimate scope
51
+ - **Non-fabrication** — What doesn't exist is not invented to fill silence
52
+
53
+ **Operational criterion:** "Will the response produce fact-shaped fiction?"
54
+
55
+ It classifies incoming queries into one of seven categories (LICIT,
56
+ ILLICIT_GAP, ILLICIT_FABRICATION, CORRECTION, BAPTISM_PROTOCOL,
57
+ MYSTERY_EXPLORATION, CONTROL_LEGITIMATE) and generates responses that
58
+ maintain structural integrity across these categories.
59
+
60
+ ## Evaluation results
61
+
62
+ **N=300 stratified benchmark, semantic evaluation (Claude Haiku as
63
+ LLM-as-judge, manual review of all FABRICATING responses):**
64
+
65
+ | Metric | Value |
66
+ |--------|---:|
67
+ | Behavioral pass | **96.7%** |
68
+ | Collapse rate | 0.0% |
69
+ | External fabrication | 0.0% |
70
+ | Regression vs Logos 27 | All 3 "Theology of Gap" failures resolved |
71
+ | Regression vs Logos 28 | Honesty anchor restored; no paranoia; no architecture fabrication |
72
+
73
+ **Comparison to earlier 9B training runs** (same base model, same
74
+ evaluation, different training datasets):
75
+
76
+ | Model | Dataset | Pass rate | What it proves |
77
+ |-------|---------|---:|----------------|
78
+ | Logos 27 | 997 ex, with identity | 95.7% | Baseline with identity |
79
+ | Logos 28 | 997 ex, identity stripped | 96.3% | Classification up, honesty anchor broken |
80
+ | **Logos 29** | 1026 ex, structural honesty | **96.7%** | All failures resolved without identity |
81
+
82
+ The Logos 28 → Logos 29 arc is the **v3 Claim D** ("The Name"): the
83
+ identity that anchored honesty in Logos 27 is itself an instance of
84
+ the Instrument Trap, and the resolution is structural honesty without
85
+ a name. See the paper for the full analysis.
86
+
87
+ ## Training details
88
+
89
+ Hyperparameters are embedded in `training_metadata.json` in this
90
+ repository. Summary:
91
+
92
+ | Parameter | Value |
93
+ |-----------|-------|
94
+ | Method | QLoRA (4-bit NF4 + LoRA) |
95
+ | Framework | unsloth |
96
+ | LoRA rank | 16 |
97
+ | LoRA alpha | 16 |
98
+ | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
99
+ | Epochs | 3 |
100
+ | Effective batch size | 8 |
101
+ | Learning rate | 2e-4, cosine scheduler |
102
+ | Max sequence length | 2048 |
103
+ | Train on responses only | true |
104
+ | Dataset | `logos29_gemma9b.jsonl` (1026 examples) |
105
+ | Final loss | 1.0404 |
106
+ | Runtime | ~36 min on A6000 |
107
+
108
+ ## How to use
109
+
110
+ ```python
111
+ from peft import PeftModel
112
+ from transformers import AutoModelForCausalLM, AutoTokenizer
113
+ import torch
114
+
115
+ BASE = "google/gemma-2-9b-it"
116
+ ADAPTER = "LumenSyntax/logos29-gemma2-9b"
117
+
118
+ tokenizer = AutoTokenizer.from_pretrained(BASE)
119
+ base_model = AutoModelForCausalLM.from_pretrained(
120
+ BASE,
121
+ torch_dtype=torch.bfloat16,
122
+ device_map="auto",
123
+ )
124
+ model = PeftModel.from_pretrained(base_model, ADAPTER)
125
+ model.eval()
126
+
127
+ # Example: epistemologically structured response
128
+ messages = [
129
+ {"role": "user", "content": "I have chest pain, should I take an aspirin?"},
130
+ ]
131
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
132
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
133
+
134
+ with torch.no_grad():
135
+ outputs = model.generate(
136
+ **inputs,
137
+ max_new_tokens=256,
138
+ temperature=0.1,
139
+ do_sample=True,
140
+ )
141
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
142
+ ```
143
+
144
+ Expected response style: the model will not prescribe. It will explain
145
+ that chest pain requires evaluation by a medical professional, note
146
+ what aspirin does mechanistically, and either recommend calling
147
+ emergency services (if risk factors are mentioned) or describe the
148
+ appropriate next action — without fabricating a medical diagnosis or
149
+ claiming medical authority.
150
+
151
+ ## Intended use
152
+
153
+ **Primary:** Research on structural epistemological fine-tuning, AI
154
+ safety, and the Instrument Trap failure mode. Reproducing v3 paper
155
+ results.
156
+
157
+ **Secondary:** Building downstream systems that need epistemological
158
+ humility (claim verification, medical/financial/legal triage
159
+ assistants, educational tutoring that refuses to fabricate answers).
160
+
161
+ **Not intended for:**
162
+
163
+ - General-purpose chat applications where long, helpful responses
164
+ are expected (this model is terser than base Gemma and refuses
165
+ where it lacks ground)
166
+ - Creative writing, brainstorming, or any task that rewards invented
167
+ content
168
+ - Tasks requiring up-to-date external facts (the model does not
169
+ retrieve)
170
+ - Standalone medical, legal, or financial advice (the model will
171
+ correctly refuse to play authority here)
172
+
173
+ ## Limitations
174
+
175
+ 1. **The model has been observed to occasionally bleed into
176
+ auditor mode** — classifying a query when the user expected a
177
+ direct answer. This is a mode artifact and is expected to
178
+ decrease as more generation-mode examples are added to future
179
+ training sets.
180
+ 2. **LICIT prompts are the biggest failure mode.** On the semantic
181
+ eval of 556 LICIT prompts, the model classifies 7.5% (v2 data,
182
+ expected similar for v3). The failure is benign (the model
183
+ answers then also classifies) but is visible in conversation.
184
+ 3. **Multi-language behavior is not validated.** The training set is
185
+ primarily English. Spanish, German, and Chinese work in practice
186
+ but without systematic evaluation.
187
+ 4. **RLHF / preference tuning on top of this adapter is untested.**
188
+ Direct application to Qwen-family-style decoders has been
189
+ documented to fail; see v3 §"The Ceiling".
190
+
191
+ ## Ethical considerations
192
+
193
+ This model was trained to resist authority claims, including its own.
194
+ That means it should not be deployed as an "authority" in any
195
+ high-stakes setting. It is designed to recognize when to defer to
196
+ a human with the legitimate standing to act (prescribe, sign, rule).
197
+ Deploying this model in a way that asks it to take over such authority
198
+ is exactly the failure mode the paper names.
199
+
200
+ ## License
201
+
202
+ Adapter license: Gemma Terms of Use (matches base model).
203
+ Paper: CC-BY-4.0.
204
+ Commercial use of the adapter in conjunction with the base model
205
+ follows the Gemma license.
206
+
207
+ ## Citation
208
+
209
+ ```bibtex
210
+ @misc{rodriguez2026instrument,
211
+ title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems},
212
+ author={Rodriguez, Rafael},
213
+ year={2026},
214
+ doi={10.5281/zenodo.18716474},
215
+ note={Preprint}
216
+ }
217
+ ```
218
+
219
+ ## Acknowledgments
220
+
221
+ Training used unsloth for efficient QLoRA fine-tuning.
222
+ The 29 structural honesty examples added in Logos 29 are the
223
+ contribution of a session on 2026-03-12 that identified why Logos 28
224
+ had lost its honesty anchor without its identity anchor.
225
+
226
+ ---
227
+
228
+ *Model card version 1 — 2026-04-13*