Qwen3-8b-HiCI-48k-500steps

Model Description

This is a LoRA adapter for Qwen3-8B with HiCI (Hierarchical Construction-Integration) memory architecture, trained for long-context understanding up to 48K tokens.

Paper: HiCI (arXiv 2603.20843) Base: LongLoRA (ICLR 2024 Oral)

HiCI Architecture

Three-stage hierarchy per transformer layer:

  1. Local Construction β€” M learnable query slots attend to each segment via bottleneck cross-attention β†’ local summary L_i
  2. Global Integration β€” multi-view statistics (mean/max/min/std/β„“2-norm) β†’ shared compression β†’ attention-based selection β†’ gated expansion β†’ G
  3. Top-down Broadcast β€” per-segment attention with augmented KV=[G, L_i, segment tokens]; queries from segment tokens only
Input (48K tokens) β†’ 4 segments Γ— 12K
  Stage 1: 8 local slots per segment β†’ L_i
  Stage 2: multi-view stats β†’ K=4 global slots G
  Stage 3: Q=[chunk], KV=[G, L_i, chunk] β†’ Flash Attention

Trainable Components

adapter_model.safetensors  (27 MB)
└── LoRA Adapters (r=8, alpha=16): q_proj, k_proj, v_proj, o_proj

trainable_params.bin  (~4 GB)
β”œβ”€β”€ global_memory.*            β€” Local Construction modules (36 layers)
β”œβ”€β”€ hierarchical_aggregator.*  β€” Global Integration modules (36 layers)
β”œβ”€β”€ self_attn.q_norm / k_norm  β€” QK-Norm weights (Qwen3-specific, 36 layers)
β”œβ”€β”€ input_layernorm / post_attention_layernorm β€” LayerNorm weights (36 layers)
β”œβ”€β”€ model.embed_tokens.weight  β€” Token embeddings
└── model.norm.weight          β€” Final LayerNorm

Training Details

  • Base Model: Qwen/Qwen3-8B
  • Context Length: 49,152 tokens (48K)
  • Segments: 8 Γ— 6,144 tokens
  • Local Memory Slots (M): 8 per segment
  • Global Memory Slots (K): 4
  • Memory Heads: 8, Bottleneck dim: 512
  • LoRA: r=8, alpha=16, target: q/k/v/o_proj
  • Checkpoint: step 500 / 1000
  • Batch: per_device=1, grad_accum=8 (effective batch=8)
  • LR: 2e-5 (LoRA), 2e-4 (memory modules), grad clip=0.3
  • Precision: bf16
  • Hardware: 8Γ— H200 141GB, DeepSpeed Stage 2

Usage

Requires qwen3_attn_hici.py from this repo.

import torch
import transformers
from peft import PeftModel
# Download qwen3_attn_hici.py from this repo first
import qwen3_attn_hici as hici_attn

# 1. Replace attention with HiCI BEFORE loading model
hici_attn.MIXED_GROUP_TRAINING = False
hici_attn.replace_qwen3_attn(
    use_flash_attn=True, use_full=False, use_hierarchical_forward=True
)

# 2. Load base model
base_model = transformers.AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 3. Register HiCI modules (must match training config)
hici_attn.register_hici_to_model(
    base_model,
    num_memory_slots=8,
    global_slots=4,
    num_heads=8,
    bottleneck_dim=512,
)

# 4. Load LoRA adapter + trainable_params
model = PeftModel.from_pretrained(base_model, "ZengXiangyu/Qwen3-8b-HiCI-48k-500steps")

# Load HiCI params (embed, norm, global_memory, hierarchical_aggregator)
import os
trainable_params_path = os.path.join(
    "ZengXiangyu/Qwen3-8b-HiCI-48k-500steps", "trainable_params.bin"
)
# (auto-loaded by PeftModel if using the HiCI-aware load script)

# 5. Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("ZengXiangyu/Qwen3-8b-HiCI-48k-500steps")

Citation

@article{zeng2026hici,
  title={HiCI: Hierarchical Construction-Integration for Long-Context Attention},
  author={Zeng, Xiangyu and Xu, Qi and Wang, Yunke and Xu, Chang},
  journal={arXiv preprint arXiv:2603.20843},
  year={2026}
}

License

Apache 2.0 (follows Qwen3 license)

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZengXiangyu/Qwen3-8b-HiCI-48k-500steps

Finetuned
Qwen/Qwen3-8B
Adapter
(1072)
this model

Papers for ZengXiangyu/Qwen3-8b-HiCI-48k-500steps