Echo88-150M-Base / README.md
guus4324343's picture
Create README.md
17e0d52 verified
metadata
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
pretty_name: Echo88 150M Base
tags:
  - text-generation
  - causal-lm
  - base-model
  - decoder-only
  - autoregressive
  - from-scratch
  - llama
  - retro
  - 1980s
  - usenet
  - magazines
  - books
  - computer-history
  - english
datasets:
  - guus4324343/Echo88-150M-Base

Echo88-150M-Base

Echo88-150M-Base is a small English decoder-only causal language model trained from scratch on the Echo88 pretraining dataset.

Echo88 is designed as a retro language model inspired by the language, culture, computing, magazines, Usenet discussions, and older book text available up to the late 1980s.

This is a base model, not an instruction-tuned chatbot. It is trained for next-token prediction and should be fine-tuned before being used as a helpful assistant.

Model Details

  • Model name: Echo88-150M-Base
  • Model type: decoder-only causal language model
  • Architecture: LLaMA-style transformer
  • Training type: from scratch
  • Parameter count: 163,606,272 parameters
  • Language: English
  • Context length: 2048 tokens
  • Tokenizer: custom Echo88 byte-level BPE tokenizer
  • Vocabulary size: 32,768
  • Training objective: autoregressive next-token prediction

Training Data

Echo88-150M-Base was trained on the Echo88 pretraining dataset.

The packed training set contains:

  • Train tokens: 1,470,629,888
  • Eval tokens: 1,454,080
  • Train blocks: 718,081 blocks
  • Eval blocks: 710 blocks
  • Block size: 2048 tokens
  • Packed dtype: uint16

The dataset includes a mixture of:

  • public-domain book text
  • Gutenberg-style older books
  • UTZOO Usenet posts
  • BYTE Magazine text
  • PC Magazine text
  • TIME Magazine text
  • Internet Archive Magazine Rack OCR text
  • computer and technology magazine text
  • general historical magazine text

The dataset emphasizes the 1950s through the late 1980s, with a strong focus on early personal computing, printed magazines, Usenet, and older long-form writing.

Dataset used:

  • guus4324343/Echo88-Pretrain-1.17B

Intended Use

Echo88-150M-Base is intended for:

  • causal language modeling
  • retro / historical AI experiments
  • small language model research
  • continued pretraining
  • instruction tuning
  • 1980s-style assistant experiments
  • computer-history language modeling
  • training Echo88-150M-Instruct

Recommended flow:

Echo88-150M-Base
→ supervised fine-tuning on Echo88-Instruct-173K
→ Echo88-150M-Instruct

Not Instruction Tuned

This model is not instruction tuned.

It may not reliably follow commands, answer questions directly, or behave like a chat assistant. It is a base model that continues text.

Expected behavior:

  • continues prompts
  • completes paragraphs
  • imitates old magazine/book/Usenet style
  • may produce raw text instead of direct answers
  • may hallucinate
  • may repeat phrases
  • may generate OCR-like artifacts

For chat behavior, use or create an instruction-tuned version using:

  • guus4324343/Echo88-Instruct-173K

Knowledge Boundary

Echo88 is designed around a historical data mixture ending around the late 1980s.

The model should not be expected to know modern topics such as:

  • Google
  • Wikipedia
  • iPhone
  • smartphones
  • modern social media
  • Windows 95 and later software
  • COVID-19
  • modern AI systems
  • 2000s, 2010s, or 2020s events

Because this is a base model, it may still hallucinate if prompted about modern events. The later instruction-tuned model should be trained to respond more carefully to post-1988 topics.

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "guus4324343/Echo88-150M-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "The personal computer revolution of the 1980s"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=160,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Configuration

Echo88-150M-Base was trained as a LLaMA-style decoder-only causal LM.

Main configuration:

vocab_size: 32768
hidden_size: 768
intermediate_size: 2048
num_hidden_layers: 18
num_attention_heads: 12
num_key_value_heads: 4
max_position_embeddings: 2048
activation: SiLU / SwiGLU-style LLaMA MLP
normalization: RMSNorm
position encoding: RoPE
attention: grouped-query attention

Training setup:

precision: bf16
sequence length: 2048
optimizer: AdamW
scheduler: cosine
weight decay: 0.1
gradient clipping: 1.0
max steps: 5610
training tokens: ~1.47B

Limitations

Echo88-150M-Base is experimental and small.

Known limitations:

  • not instruction tuned
  • may hallucinate
  • may repeat text
  • may produce OCR-like artifacts
  • may reflect outdated historical language or views
  • may struggle with complex reasoning
  • may not reliably refuse post-1988 topics
  • may produce incomplete or strange continuations
  • may mix unrelated historical/computer facts

The model is intended for research, experimentation, and creative retro AI work. It is not intended for high-stakes use.

Bias and Historical Content

The training data includes historical books, magazines, and Usenet text. As a result, the model may reproduce outdated language, assumptions, stereotypes, or viewpoints present in older source material.

Users should review outputs carefully.

Model Family

Planned Echo88 model family:

Echo88-150M-Base
Echo88-150M-Instruct
Echo88-150M-Chat

License

The model weights are released under the Apache 2.0 license.

The training dataset is mixed-source and released separately under other. Users are responsible for checking dataset source rights, licensing, and suitability for their own use case.