Attention Residuals 0.6B Full

This is the 0.6B Full Attention Residuals checkpoint for the attention-residuals-reproduction project. It uses a Qwen3-style decoder-only Transformer with full Attention Residuals, trained from scratch on Chinese data.

Model Details

  • Mode: full
  • Architecture: Qwen3-style causal language model
  • Residual type: Full Attention Residuals
  • Hidden size: 1024
  • Layers: 28
  • Attention heads: 16
  • KV heads: 8
  • FFN intermediate size: 3072
  • Sequence length: 1024
  • Training steps: 20,000
  • Training data: opencsg/Fineweb-Edu-Chinese-V2.2

Intended Use

This checkpoint is mainly intended for research comparison with the baseline and other Attention Residuals variants. It is not instruction-tuned and should not be used as a chat model.

Evaluation

Metric Result
Chinese Held-out PPL 57.34
C-Eval Acc 0.2926
CMMLU Acc 0.2188

Notes

The full variant has substantially higher memory cost than the block variant at the 0.6B scale. In this project, the 0.6B full experiment is better treated as a supplementary run under a shorter sequence-length setup rather than a directly matched comparison against the seq_len=2048 baseline and block runs.

Usage

import torch
from transformers import AutoTokenizer
from modeling_attnres import Qwen3AttnResForCausalLM

repo_id = "你的用户名/attention-residuals-0.6B-full"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = Qwen3AttnResForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "人工智能的发展"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
23
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Ethangou/attention-residuals-0.6B-full