FlashLM v10 FSP

A 3.74M parameter language model with Future Sentence Prediction, trained entirely on free-tier CPU in 2 hours.

Key Results

Metric v10.2 (baseline) v10 FSP
Val PPL 25.08 10.24
Training speed ~2,000 tok/s ~2,750 tok/s
Params ~3.5M 3.74M
Hardware 4 vCPU 4 vCPU
Time 2h 2h

FSP achieves 2.5x PPL improvement over standard token-level CE training at the same scale and compute budget.

What is FSP?

Future Sentence Prediction (FSP) adds a generative sentence-level auxiliary loss alongside standard next-token CE. At subsampled positions, the model predicts a bag-of-words of the next 64 tokens. This forces the backbone to encode future planning information.

Architecture

Embedding(4096, 256) + RoPE
  └── Block Γ—4
      β”œβ”€β”€ RMSNorm β†’ CausalSelfAttention(8 heads, d=256) β†’ Residual
      └── RMSNorm β†’ SwiGLU(d_ff=512) β†’ Residual
  └── RMSNorm β†’ lm_head (weight-tied)
  └── FSP: Linear(256β†’256) β†’ shared lm_head β†’ sigmoid β†’ BoW prediction

Generation Samples

Prompt: "Once upon a time"

Once upon a time, there was a little girl named Sue. Sue was very sad because she could not find her toy. One day, she found a big box near her house.

Prompt: "A cat sat"

A cat sat on the bed. The cat saw the cat and wanted to help. The cat jumped on the bench and began to walk in the sky. The cat started to feel better and tried...

Prompt: "The little girl"

The little girl was scared and she wanted to see what was inside. She thought about what she had been in the door.

Limitations

  • Stories are grammatically correct but not logically coherent across sentences
  • Cross-sentence causal reasoning is still weak ("the cat walked in the sky")
  • Characters, dialogue, and sentence structure work well; causal chains do not
  • This is a research model demonstrating FSP training, not a production story generator

Usage

import torch
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
checkpoint = torch.load("best.pt", map_location="cpu")

# Build model (see train_v10_fsp.py for full architecture)
# Generate with temperature=0.8, top_p=0.9

See GitHub for full training code.

Training Details

Hyperparameter Value
d_model 256
d_ff 512
n_heads 8
n_layers 4
seq_len 256
vocab 4,096 (BPE)
LR 5e-4 β†’ 1e-5 (cosine)
Warmup 200 steps
Batch 4 Γ— 8 (accum)
FSP tau 64 tokens
FSP alpha 0.1
Weight decay 0.1
Dropout 0.1

Citation

@misc{flashlm,
  author = {Cheng Chang},
  title = {FlashLM: CPU-Native Language Models},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}

MIT License

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train changcheng967/flashlm-v10-fsp

Space using changcheng967/flashlm-v10-fsp 1

Paper for changcheng967/flashlm-v10-fsp