Instructions to use Ethangou/attention-residuals-0.6B-full with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Ethangou/attention-residuals-0.6B-full with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Ethangou/attention-residuals-0.6B-full")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Ethangou/attention-residuals-0.6B-full", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Ethangou/attention-residuals-0.6B-full with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Ethangou/attention-residuals-0.6B-full"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ethangou/attention-residuals-0.6B-full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Ethangou/attention-residuals-0.6B-full

SGLang

How to use Ethangou/attention-residuals-0.6B-full with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Ethangou/attention-residuals-0.6B-full" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ethangou/attention-residuals-0.6B-full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Ethangou/attention-residuals-0.6B-full" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ethangou/attention-residuals-0.6B-full",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Ethangou/attention-residuals-0.6B-full with Docker Model Runner:
```
docker model run hf.co/Ethangou/attention-residuals-0.6B-full
```

Attention Residuals 0.6B Full

This is the 0.6B Full Attention Residuals checkpoint for the attention-residuals-reproduction project. It uses a Qwen3-style decoder-only Transformer with full Attention Residuals, trained from scratch on Chinese data.

Model Details

Mode: full
Architecture: Qwen3-style causal language model
Residual type: Full Attention Residuals
Hidden size: 1024
Layers: 28
Attention heads: 16
KV heads: 8
FFN intermediate size: 3072
Sequence length: 1024
Training steps: 20,000
Training data: opencsg/Fineweb-Edu-Chinese-V2.2

Intended Use

This checkpoint is mainly intended for research comparison with the baseline and other Attention Residuals variants. It is not instruction-tuned and should not be used as a chat model.

Evaluation

Metric	Result
Chinese Held-out PPL	57.34
C-Eval Acc	0.2926
CMMLU Acc	0.2188

Notes

The full variant has substantially higher memory cost than the block variant at the 0.6B scale. In this project, the 0.6B full experiment is better treated as a supplementary run under a shorter sequence-length setup rather than a directly matched comparison against the seq_len=2048 baseline and block runs.

Usage

import torch
from transformers import AutoTokenizer
from modeling_attnres import Qwen3AttnResForCausalLM

repo_id = "你的用户名/attention-residuals-0.6B-full"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = Qwen3AttnResForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "人工智能的发展"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 23

Safetensors

Model size

0.5B params

Tensor type

BF16

Ethangou
/

attention-residuals-0.6B-full