Echo88-150M-Base / README.md
guus4324343's picture
Create README.md
17e0d52 verified
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
pretty_name: Echo88 150M Base
tags:
- text-generation
- causal-lm
- base-model
- decoder-only
- autoregressive
- from-scratch
- llama
- retro
- 1980s
- usenet
- magazines
- books
- computer-history
- english
datasets:
- guus4324343/Echo88-150M-Base
---
# Echo88-150M-Base
Echo88-150M-Base is a small English decoder-only causal language model trained from scratch on the Echo88 pretraining dataset.
Echo88 is designed as a retro language model inspired by the language, culture, computing, magazines, Usenet discussions, and older book text available up to the late 1980s.
This is a **base model**, not an instruction-tuned chatbot. It is trained for next-token prediction and should be fine-tuned before being used as a helpful assistant.
## Model Details
- **Model name:** Echo88-150M-Base
- **Model type:** decoder-only causal language model
- **Architecture:** LLaMA-style transformer
- **Training type:** from scratch
- **Parameter count:** 163,606,272 parameters
- **Language:** English
- **Context length:** 2048 tokens
- **Tokenizer:** custom Echo88 byte-level BPE tokenizer
- **Vocabulary size:** 32,768
- **Training objective:** autoregressive next-token prediction
## Training Data
Echo88-150M-Base was trained on the Echo88 pretraining dataset.
The packed training set contains:
- **Train tokens:** 1,470,629,888
- **Eval tokens:** 1,454,080
- **Train blocks:** 718,081 blocks
- **Eval blocks:** 710 blocks
- **Block size:** 2048 tokens
- **Packed dtype:** uint16
The dataset includes a mixture of:
- public-domain book text
- Gutenberg-style older books
- UTZOO Usenet posts
- BYTE Magazine text
- PC Magazine text
- TIME Magazine text
- Internet Archive Magazine Rack OCR text
- computer and technology magazine text
- general historical magazine text
The dataset emphasizes the 1950s through the late 1980s, with a strong focus on early personal computing, printed magazines, Usenet, and older long-form writing.
Dataset used:
- `guus4324343/Echo88-Pretrain-1.17B`
## Intended Use
Echo88-150M-Base is intended for:
- causal language modeling
- retro / historical AI experiments
- small language model research
- continued pretraining
- instruction tuning
- 1980s-style assistant experiments
- computer-history language modeling
- training Echo88-150M-Instruct
Recommended flow:
```text
Echo88-150M-Base
→ supervised fine-tuning on Echo88-Instruct-173K
→ Echo88-150M-Instruct
````
## Not Instruction Tuned
This model is not instruction tuned.
It may not reliably follow commands, answer questions directly, or behave like a chat assistant. It is a base model that continues text.
Expected behavior:
* continues prompts
* completes paragraphs
* imitates old magazine/book/Usenet style
* may produce raw text instead of direct answers
* may hallucinate
* may repeat phrases
* may generate OCR-like artifacts
For chat behavior, use or create an instruction-tuned version using:
* `guus4324343/Echo88-Instruct-173K`
## Knowledge Boundary
Echo88 is designed around a historical data mixture ending around the late 1980s.
The model should not be expected to know modern topics such as:
* Google
* Wikipedia
* iPhone
* smartphones
* modern social media
* Windows 95 and later software
* COVID-19
* modern AI systems
* 2000s, 2010s, or 2020s events
Because this is a base model, it may still hallucinate if prompted about modern events. The later instruction-tuned model should be trained to respond more carefully to post-1988 topics.
## Example Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "guus4324343/Echo88-150M-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "The personal computer revolution of the 1980s"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=160,
temperature=0.8,
top_p=0.95,
do_sample=True,
repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## Training Configuration
Echo88-150M-Base was trained as a LLaMA-style decoder-only causal LM.
Main configuration:
```text
vocab_size: 32768
hidden_size: 768
intermediate_size: 2048
num_hidden_layers: 18
num_attention_heads: 12
num_key_value_heads: 4
max_position_embeddings: 2048
activation: SiLU / SwiGLU-style LLaMA MLP
normalization: RMSNorm
position encoding: RoPE
attention: grouped-query attention
```
Training setup:
```text
precision: bf16
sequence length: 2048
optimizer: AdamW
scheduler: cosine
weight decay: 0.1
gradient clipping: 1.0
max steps: 5610
training tokens: ~1.47B
```
## Limitations
Echo88-150M-Base is experimental and small.
Known limitations:
* not instruction tuned
* may hallucinate
* may repeat text
* may produce OCR-like artifacts
* may reflect outdated historical language or views
* may struggle with complex reasoning
* may not reliably refuse post-1988 topics
* may produce incomplete or strange continuations
* may mix unrelated historical/computer facts
The model is intended for research, experimentation, and creative retro AI work. It is not intended for high-stakes use.
## Bias and Historical Content
The training data includes historical books, magazines, and Usenet text. As a result, the model may reproduce outdated language, assumptions, stereotypes, or viewpoints present in older source material.
Users should review outputs carefully.
## Model Family
Planned Echo88 model family:
```text
Echo88-150M-Base
Echo88-150M-Instruct
Echo88-150M-Chat
```
## License
The model weights are released under the Apache 2.0 license.
The training dataset is mixed-source and released separately under `other`. Users are responsible for checking dataset source rights, licensing, and suitability for their own use case.
```
```