File size: 12,583 Bytes

---
license: mit
tags:
- philosophy
- lesswrong
- LWT
- lesswrong-transformer
- small
- harley-ml
- tiny
- small-language-model
- text-generation
datasets:
- Harley-ml/lesswrong
---

# LWTDense

LWTDense is a six-million parameter model trained on 102M tokens of LessWrong content.

## Architecture

LWTDense uses the Qwen3 architecture.

| Parameter | Value |
|---|---|
| Hidden Layers | 4 |
| Hidden Size | 256 |
| Attention Heads | 4 |
| KV Heads | 4 |
| Vocab Size | 8,004 |
| Intermediate Size | 1,024 |
| RoPE Theta | 10,000 |
| Max Position Embeddings | 512 |
| Tie Word Embeddings | True |

## Training

We trained LWTDense on 102M tokens of LessWrong content. The specific dataset is linked [here](https://huggingface.co/datasets/Harley-ml/lesswrong)

### Hardware

LWTDense was trained on one NVIDIA T4 GPU in Google Colaboratory with a batch size of 16.

### Training Results

| Step | Val Loss | PPL |
|---|---|---|
| 2500 | 4.6120 | 100.6 |
| 5000 | 4.1817 | 65.5 |
| 7500 | 3.9751 | 53.3 |
| 10000 | 3.8341 | 46.3 |
| 12500 | 3.7339 | 41.8 |
| 15000 | 3.6675 | 39.1 |
| 17500 | 3.6213 | 37.3 |
| 20000 | 3.5884 | 36.2 |
| 22500 | 3.5661 | 35.4 |
| 25000 | 3.5549 | 35.0 |

## Generations

Prompt: `The`

Output:
```
 The Socrates in the Bay Area Done, Too Has Miller's Story
> 
> ![🤡](https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f9b8.png) Ruin Lee Collapse (Wiki). [Are there anything I should be.](http://www.michaele_narkhanv.net/2022/07/21/the-means-of-your-three/)
> 
> …   Pause: Fixed new lights at a low speed rate and increased risk of being unfaithful but still much faster than you would have to do with that quality on your mouth.
> 
> If you don't see why this is happening and how it works out there are people who get to work here now, then what will happen? What if we just did something like "it looks good for my kids?" or "Don't make me more sure about the fish", etc, so the question was often not actually correct. And yes, mostly because it could tell us everything we said before, which is a major concern! But the fact that someone's 'wearing a dying' has been pretty terrible, since everyone is doing things they can talk about using them as much as possible in their own way.)
> 
> When I read this book *"It sounds really weird to say that when I started reading our blog posts I found myself thinking about this topic, let alone commentary on the "what was going to happen?" as opposed to the sort of questions I got from this article—and then I am very skeptical. It seems quite bad. In particular I think some people are interested in talking about it; especially those people were excited about all these topics I've seen most of it. Even while having no reason to know how to help him write down any important thoughts or ideas you're not trying to get into it, I didn't have enough time to learn from my experience. I haven't thought about it yet - and sometimes I'd probably want to try hardly again. 
> 
> **I'm currently looking for evidence about existential risks** ***[Crossposting**](https://docs.google.com/document/d/1qJY
```

Prompt: `#`

Output:
```
1, and it's also not true to me that the best thing is actually going on with this.
*   In particular: [The Nicholas London](https://x.com/sama316472/status/1940068429048927808) (Garrabrant): The Kean Timothy Memory of Competition, Astrobrant, Elephantand—a Prisoner's Cultorite-Infrastructure AI Rule of Consciousness
    *   Society: "I think you can be wrong about how much time I have done in a world where there was something more like anthropomorphic AGI research organization than yours." [Ergers are worried.](https://www.lesswrong.com/posts/Lv9dXQoKDcNkHwPY7/britch-the-nostalgebraist?commentId=TJA5SMRFjCqhWtpz). And maybe they could look at whether or not people would care about anybody else who might want to know what sorts of things they haven't yet said before, then say anything useful. But we still don't see why these people did their job as well. They're trying out some sort of good stuff.
    
*   Say I'm interested in getting around the same questions. You do find yourself using them in order to answer questions, especially if one person does it correctly.
        
*   Following my writing into two different ways of thinking. Someone has lots of answers. This sounds pretty bad but very little. Mostly it seems kind of weird to try for me! It probably isn't just saying 'what's your response so far?"
    
*   During the past three days ago, a single expert thought experiment was just something worth reading here. There were many interesting arguments from a similar study. We had no idea what other models of the world, which was basically unambiguous, and this was mostly right now. It seemed possible to really write down new ideas instead of doing something important when it comes up in the context window.
    
*   The "other" community is just like being able to figure out how to get a lot of progress. If you're starting a couple months, you'd be able to tell
```

Prompt: `AI`

Output:
```
. The most important thing to do is it's being used in the real world, as well as some of its own problems in alignment and so forth—the other things that we have learned from this kind of problem will be used for doing something you want (or at least not) just because I think there are more than a few people who aren't even having good goals or maybe get bad motivations on their mind.

**Why can an AGI?**
-------------------

So far I'll start with **Basically**, then **if one person is actually going to try out $X_1(x)$, but **at all**. So, if two people feel freely into $Y$, then they're like: "I'm going to make sure that these people really don't have a lot of time now." But the point is that we don't need a very much self-interesting goal of making progress towards solving the problems that happen in our universe - which means they've been trying to solve your long term problems by default". And instead of asking whether the future of AGI has a large impact on what would happen, it might be a bit easier to learn about how many times better AIs could work before training.

If you'd expect them to go through various ways of thinking, why should we expect humans to be able to do that?
```

Prompt: `Through a`

Output:
```

to Book of the San Francisco and Green, we are in our head to explain what's happening. The '*base-causal reality*' is that it has no other properties and all you're trying to do so for them (which I'll be able to find out.) And if one wants us to talk about something like this:
    *   We want ourselves to have an AI that might not exist within your environment, then there would be more than two options to make up to it. This could work on those outcomes/values that can easily apply or use as much as possible; but it doesn't seem to require some sort of [little](https://www.lesswrong.com/posts/qsJBxpPCnFhgTDkRdS/internally-truthful).
        *   You can probably learn from "the world" where you need a deep learning algorithm to solve problems with "weirdly optimizing algorithms". There will still be many different agents which can get different concepts - so far they don't matter how well-turned humans tend to give to each other's values. If so, consider the agent at least as powerful as its own utility function. For example, suppose the universe takes the set of states and runs their value, and its expected reward functions. Since there are some worlds with a small amount of bits that you only know, it's just that the box has access to all of these conditions. So, maybe the simulation may lead to realizing things that you've seen by themselves in the first place. But we cannot see the same argument with such claims. In fact, this sounds pretty cool! It seems to me that, though... even if it works, it really means there's nothing left here, that'll actually go over time. As long as it happens, you should just choose between actions and actions when you think about the future, so that's fine? Doesn't any of my goals come away?"[^z0j6w5mv3i]
```
---

## Benchmark

| Model | Total Params | Activated Params | Score | Coherent | Mostly Coherent | Partially Coherent | Incoherent |
|---|---|---|---|---|---|---|---|
| [gpt2](https://huggingface.co/openai-community/gpt2) | 117M | 117M | 0.4227 | 0 | 51 | 78 | 171 |
| [LWTMoE-10M-A6M](https://huggingface.co/Harley-ml/LWTMoE-10M-A6M) | 10M | 6M | 0.3520 | 1 | 31 | 49 | 219 |
| **LWTDense-6M** (this) | 6M | 6M | 0.2962 | 2 | 13 | 32 | 253 |
| [Pythia-14m-deduped](https://huggingface.co/EleutherAI/pythia-14m-deduped) | 14M | 14M | 0.2846 | 0 | 23 | 24 | 253 |

To evaluate the coherency, factuality, and fluency of our models, we use [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) to judge three hundred generations from unconditional prompts.
Our model, LWTDense-6M, beats Pythia-14m-deduped with less than half the active parameter count.

## Limitations

1. Does not generalize to out of distribution prompts
2. No instruction-following
3. Can't reason
4. Mostly incoherent

## Use Cases

1. Educational Research
2. Deployment on costrained devices
3. Or, for fun.

# Inference

```python
import torch
from transformers import AutoModelForCausalLM, PreTrainedTokenizerFast, AddedToken
from pathlib import Path

# --- Config ---
MODEL_PATH = "harley-ml/LWTDense-6M"
TOKENIZER_PATH = MODEL_PATH
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_K = 50
TOP_P = 0.9
REP_PENALTY = 1.3

# --- Tokenizer ---
tok = PreTrainedTokenizerFast(tokenizer_file=str(Path(TOKENIZER_PATH)))
specials = {}
if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
if tok.pad_token is None: specials["pad_token"] = AddedToken("<|pad|>", special=True)
if specials:
    tok.add_special_tokens(specials)
tok.padding_side = "left"

# --- Model ---
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    # device_map="auto",
    # low_cpu_mem_usage=True,
)

model.config.output_router_logits = False
model.eval()

if not hasattr(model, "device_map") or model.device_map is None:
    model.to(DEVICE)

# --- Debug prints ---
print(f"tok.vocab_size: {tok.vocab_size}")
print(f"len(tok): {len(tok)}")
print(f"model vocab size: {model.config.vocab_size}")
print(f"bos_token: {tok.bos_token!r} id={tok.bos_token_id}")
print(f"eos_token: {tok.eos_token!r} id={tok.eos_token_id}")
print(f"pad_token: {tok.pad_token!r} id={tok.pad_token_id}")
print("Qwen3MoE load-balancing bug workaround applied (output_router_logits=False)")

def generate(prompt: str) -> str:
    full_prompt = tok.bos_token + prompt
    inputs = tok(full_prompt, return_tensors="pt")
    if hasattr(model, "device_map") and model.device_map is not None:
        inputs = inputs.to("cpu")
    else:
        inputs = inputs.to(DEVICE)
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            do_sample=True,
            temperature=TEMPERATURE,
            top_k=TOP_K,
            top_p=TOP_P,
            repetition_penalty=REP_PENALTY,
            eos_token_id=tok.eos_token_id,
            pad_token_id=tok.pad_token_id,
        )
    new_tokens = output_ids[0][inputs["input_ids"].shape[-1]:]
    return tok.decode(new_tokens, skip_special_tokens=True)


if __name__ == "__main__":
    print("\nModel loaded successfully! Ready for inference.\n")
    while True:
        try:
            prompt = input("Prompt: ")
            if not prompt.strip():
                print("Empty prompt → exiting.")
                break
            print("Generating...")
            response = generate(prompt)
            print(response)
            print("-" * 80)
        except KeyboardInterrupt:
            print("\nInterrupted by user.")
            break
        except Exception as e:
            print(f"Error during generation: {e}")
            import traceback
            traceback.print_exc()
            break
```

### Related Models

1. [LWTMoE-10M-A6M](https://huggingface.co/Harley-ml/LWTMoE-10M-A6M)

## Citation

```bibtex
@misc{lwtdense-6m,
  title     = {LWTDense-6M: Narrow-Domain Training at a Tiny Scale},
  author    = {Harley-ml},
  year      = {2026},
  url       = {https://huggingface.co/Harley-ml/LWTDense-6M}
}
```