File size: 12,583 Bytes
cc78df2 fc25b55 cc78df2 c03c937 cc78df2 758a696 cc78df2 4ac7378 cc78df2 c2dd3b5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | ---
license: mit
tags:
- philosophy
- lesswrong
- LWT
- lesswrong-transformer
- small
- harley-ml
- tiny
- small-language-model
- text-generation
datasets:
- Harley-ml/lesswrong
---
# LWTDense
LWTDense is a six-million parameter model trained on 102M tokens of LessWrong content.
## Architecture
LWTDense uses the Qwen3 architecture.
| Parameter | Value |
|---|---|
| Hidden Layers | 4 |
| Hidden Size | 256 |
| Attention Heads | 4 |
| KV Heads | 4 |
| Vocab Size | 8,004 |
| Intermediate Size | 1,024 |
| RoPE Theta | 10,000 |
| Max Position Embeddings | 512 |
| Tie Word Embeddings | True |
## Training
We trained LWTDense on 102M tokens of LessWrong content. The specific dataset is linked [here](https://huggingface.co/datasets/Harley-ml/lesswrong)
### Hardware
LWTDense was trained on one NVIDIA T4 GPU in Google Colaboratory with a batch size of 16.
### Training Results
| Step | Val Loss | PPL |
|---|---|---|
| 2500 | 4.6120 | 100.6 |
| 5000 | 4.1817 | 65.5 |
| 7500 | 3.9751 | 53.3 |
| 10000 | 3.8341 | 46.3 |
| 12500 | 3.7339 | 41.8 |
| 15000 | 3.6675 | 39.1 |
| 17500 | 3.6213 | 37.3 |
| 20000 | 3.5884 | 36.2 |
| 22500 | 3.5661 | 35.4 |
| 25000 | 3.5549 | 35.0 |
## Generations
Prompt: `The`
Output:
```
The Socrates in the Bay Area Done, Too Has Miller's Story
>
>  Ruin Lee Collapse (Wiki). [Are there anything I should be.](http://www.michaele_narkhanv.net/2022/07/21/the-means-of-your-three/)
>
> … Pause: Fixed new lights at a low speed rate and increased risk of being unfaithful but still much faster than you would have to do with that quality on your mouth.
>
> If you don't see why this is happening and how it works out there are people who get to work here now, then what will happen? What if we just did something like "it looks good for my kids?" or "Don't make me more sure about the fish", etc, so the question was often not actually correct. And yes, mostly because it could tell us everything we said before, which is a major concern! But the fact that someone's 'wearing a dying' has been pretty terrible, since everyone is doing things they can talk about using them as much as possible in their own way.)
>
> When I read this book *"It sounds really weird to say that when I started reading our blog posts I found myself thinking about this topic, let alone commentary on the "what was going to happen?" as opposed to the sort of questions I got from this article—and then I am very skeptical. It seems quite bad. In particular I think some people are interested in talking about it; especially those people were excited about all these topics I've seen most of it. Even while having no reason to know how to help him write down any important thoughts or ideas you're not trying to get into it, I didn't have enough time to learn from my experience. I haven't thought about it yet - and sometimes I'd probably want to try hardly again.
>
> **I'm currently looking for evidence about existential risks** ***[Crossposting**](https://docs.google.com/document/d/1qJY
```
Prompt: `#`
Output:
```
1, and it's also not true to me that the best thing is actually going on with this.
* In particular: [The Nicholas London](https://x.com/sama316472/status/1940068429048927808) (Garrabrant): The Kean Timothy Memory of Competition, Astrobrant, Elephantand—a Prisoner's Cultorite-Infrastructure AI Rule of Consciousness
* Society: "I think you can be wrong about how much time I have done in a world where there was something more like anthropomorphic AGI research organization than yours." [Ergers are worried.](https://www.lesswrong.com/posts/Lv9dXQoKDcNkHwPY7/britch-the-nostalgebraist?commentId=TJA5SMRFjCqhWtpz). And maybe they could look at whether or not people would care about anybody else who might want to know what sorts of things they haven't yet said before, then say anything useful. But we still don't see why these people did their job as well. They're trying out some sort of good stuff.
* Say I'm interested in getting around the same questions. You do find yourself using them in order to answer questions, especially if one person does it correctly.
* Following my writing into two different ways of thinking. Someone has lots of answers. This sounds pretty bad but very little. Mostly it seems kind of weird to try for me! It probably isn't just saying 'what's your response so far?"
* During the past three days ago, a single expert thought experiment was just something worth reading here. There were many interesting arguments from a similar study. We had no idea what other models of the world, which was basically unambiguous, and this was mostly right now. It seemed possible to really write down new ideas instead of doing something important when it comes up in the context window.
* The "other" community is just like being able to figure out how to get a lot of progress. If you're starting a couple months, you'd be able to tell
```
Prompt: `AI`
Output:
```
. The most important thing to do is it's being used in the real world, as well as some of its own problems in alignment and so forth—the other things that we have learned from this kind of problem will be used for doing something you want (or at least not) just because I think there are more than a few people who aren't even having good goals or maybe get bad motivations on their mind.
**Why can an AGI?**
-------------------
So far I'll start with **Basically**, then **if one person is actually going to try out $X_1(x)$, but **at all**. So, if two people feel freely into $Y$, then they're like: "I'm going to make sure that these people really don't have a lot of time now." But the point is that we don't need a very much self-interesting goal of making progress towards solving the problems that happen in our universe - which means they've been trying to solve your long term problems by default". And instead of asking whether the future of AGI has a large impact on what would happen, it might be a bit easier to learn about how many times better AIs could work before training.
If you'd expect them to go through various ways of thinking, why should we expect humans to be able to do that?
```
Prompt: `Through a`
Output:
```
to Book of the San Francisco and Green, we are in our head to explain what's happening. The '*base-causal reality*' is that it has no other properties and all you're trying to do so for them (which I'll be able to find out.) And if one wants us to talk about something like this:
* We want ourselves to have an AI that might not exist within your environment, then there would be more than two options to make up to it. This could work on those outcomes/values that can easily apply or use as much as possible; but it doesn't seem to require some sort of [little](https://www.lesswrong.com/posts/qsJBxpPCnFhgTDkRdS/internally-truthful).
* You can probably learn from "the world" where you need a deep learning algorithm to solve problems with "weirdly optimizing algorithms". There will still be many different agents which can get different concepts - so far they don't matter how well-turned humans tend to give to each other's values. If so, consider the agent at least as powerful as its own utility function. For example, suppose the universe takes the set of states and runs their value, and its expected reward functions. Since there are some worlds with a small amount of bits that you only know, it's just that the box has access to all of these conditions. So, maybe the simulation may lead to realizing things that you've seen by themselves in the first place. But we cannot see the same argument with such claims. In fact, this sounds pretty cool! It seems to me that, though... even if it works, it really means there's nothing left here, that'll actually go over time. As long as it happens, you should just choose between actions and actions when you think about the future, so that's fine? Doesn't any of my goals come away?"[^z0j6w5mv3i]
```
---
## Benchmark
| Model | Total Params | Activated Params | Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent |
|---|---|---|---|---|---|---|---|
| [gpt2](https://huggingface.co/openai-community/gpt2) | 117M | 117M | 0.4227 | 0 | 51 | 78 | 171 |
| [LWTMoE-10M-A6M](https://huggingface.co/Harley-ml/LWTMoE-10M-A6M) | 10M | 6M | 0.3520 | 1 | 31 | 49 | 219 |
| **LWTDense-6M** (this) | 6M | 6M | 0.2962 | 2 | 13 | 32 | 253 |
| [Pythia-14m-deduped](https://huggingface.co/EleutherAI/pythia-14m-deduped) | 14M | 14M | 0.2846 | 0 | 23 | 24 | 253 |
To evaluate the coherency, factuality, and fluency of our models, we use [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to judge three hundred generations from unconditional prompts.
Our model, LWTDense-6M, beats Pythia-14m-deduped with less than half the active parameter count.
## Limitations
1. Does not generalize to out of distribution prompts
2. No instruction-following
3. Can't reason
4. Mostly incoherent
## Use Cases
1. Educational Research
2. Deployment on costrained devices
3. Or, for fun.
# Inference
```python
import torch
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast, AddedToken
from pathlib import Path
# --- Config ---
MODEL_PATH = "harley-ml/LWTDense-6M"
TOKENIZER_PATH = MODEL_PATH
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_K = 50
TOP_P = 0.9
REP_PENALTY = 1.3
# --- Tokenizer ---
tok = PreTrainedTokenizerFast(tokenizer_file=str(Path(TOKENIZER_PATH)))
specials = {}
if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
if tok.pad_token is None: specials["pad_token"] = AddedToken("<|pad|>", special=True)
if specials:
tok.add_special_tokens(specials)
tok.padding_side = "left"
# --- Model ---
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.float16,
trust_remote_code=True,
# device_map="auto",
# low_cpu_mem_usage=True,
)
model.config.output_router_logits = False
model.eval()
if not hasattr(model, "device_map") or model.device_map is None:
model.to(DEVICE)
# --- Debug prints ---
print(f"tok.vocab_size: {tok.vocab_size}")
print(f"len(tok): {len(tok)}")
print(f"model vocab size: {model.config.vocab_size}")
print(f"bos_token: {tok.bos_token!r} id={tok.bos_token_id}")
print(f"eos_token: {tok.eos_token!r} id={tok.eos_token_id}")
print(f"pad_token: {tok.pad_token!r} id={tok.pad_token_id}")
print("Qwen3MoE load-balancing bug workaround applied (output_router_logits=False)")
def generate(prompt: str) -> str:
full_prompt = tok.bos_token + prompt
inputs = tok(full_prompt, return_tensors="pt")
if hasattr(model, "device_map") and model.device_map is not None:
inputs = inputs.to("cpu")
else:
inputs = inputs.to(DEVICE)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
temperature=TEMPERATURE,
top_k=TOP_K,
top_p=TOP_P,
repetition_penalty=REP_PENALTY,
eos_token_id=tok.eos_token_id,
pad_token_id=tok.pad_token_id,
)
new_tokens = output_ids[0][inputs["input_ids"].shape[-1]:]
return tok.decode(new_tokens, skip_special_tokens=True)
if __name__ == "__main__":
print("\nModel loaded successfully! Ready for inference.\n")
while True:
try:
prompt = input("Prompt: ")
if not prompt.strip():
print("Empty prompt → exiting.")
break
print("Generating...")
response = generate(prompt)
print(response)
print("-" * 80)
except KeyboardInterrupt:
print("\nInterrupted by user.")
break
except Exception as e:
print(f"Error during generation: {e}")
import traceback
traceback.print_exc()
break
```
### Related Models
1. [LWTMoE-10M-A6M](https://huggingface.co/Harley-ml/LWTMoE-10M-A6M)
## Citation
```bibtex
@misc{lwtdense-6m,
title = {LWTDense-6M: Narrow-Domain Training at a Tiny Scale},
author = {Harley-ml},
year = {2026},
url = {https://huggingface.co/Harley-ml/LWTDense-6M}
}
``` |