Attention Residuals 0.6B Baseline
This is the 0.6B baseline checkpoint for the attention-residuals-reproduction project. It uses a standard Qwen3-style decoder-only Transformer with standard residual connections, trained from scratch on Chinese data.
Model Details
- Mode:
baseline - Architecture: Qwen3-style causal language model
- Residual type: standard residual connection
- Hidden size: 1024
- Layers: 28
- Attention heads: 16
- KV heads: 8
- FFN intermediate size: 3072
- Sequence length: 2048
- Training steps: 20,000
- Training data:
opencsg/Fineweb-Edu-Chinese-V2.2
Intended Use
This checkpoint is mainly intended for research comparison with Attention Residuals variants. It is not instruction-tuned and should not be used as a chat model.
Evaluation
| Metric | Result |
|---|---|
| Chinese Held-out PPL | 41.83 |
| C-Eval Acc | 0.2533 |
| CMMLU Acc | 0.2656 |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "你的用户名/attention-residuals-0.6B-baseline"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = "人工智能的发展"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 950