YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Reading Steiner β Web Content Extraction (Qwen3.5-2B)
Fine-tuned Qwen3.5-2B for web content extraction via index prediction, inspired by IndexLM (arXiv:2512.06641).
Overview
This repository contains the training code and setup for fine-tuning a lightweight 2.2B parameter LLM (Qwen3.5-2B) to perform web content extraction by predicting content block indices. Instead of generating extracted text token-by-token, the model predicts intervals like:
[[4, 5], [7, 23], [25, 46]]
This approach is:
- ~20-50Γ faster than generative extraction baselines
- Far fewer tokens consumed at inference
- Highly accurate for both main-content and query-relevant extraction
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-2B (2.2B params) |
| Fine-tuning Method | LoRA (rank 16, QLoRA 4-bit base) via Unsloth |
| Trainable Params | ~10.9M (0.49% of base) |
| Training Framework | TRL SFTTrainer |
| Dataset | OmAlve/reading-steiner-data |
| Max Sequence Length | 4096 tokens |
| Task | Index-based web content extraction |
Dataset Format
The dataset is in conversational (ChatML) format with three roles:
systemβ Task instruction with format rulesuserβ URL, page title, and indexed HTML blocks ([i] <tag>content</tag>)assistantβ Predicted intervals:[[start1, end1], [start2, end2], ...]orNA
{
"messages": [
{"role": "system", "content": "You are Reading Steiner, a web content extraction model..."},
{"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...\n[2] <p>..."},
{"role": "assistant", "content": "[[4, 5], [7, 23], [25, 46]]"}
]
}
- Train split:
51,697 samples (449 MB) - Eval split:
1,500 samples (14 MB)
Requirements
pip install unsloth trl transformers datasets accelerate trackio
Note: Unsloth will auto-download CUDA kernels. For the fastest path on Linux, just
pip install unsloth.
Training
Quick Start (1 GPU, 24GB VRAM recommended)
python train_reading_steiner.py
This script:
- Loads the dataset and applies Qwen3's chat template
- Loads Qwen3.5-2B in 4-bit via Unsloth
- Attaches LoRA adapters (rank 16)
- Fine-tunes for 1 epoch with cosine LR scheduling
- Pushes checkpoints and final model to
OmAlve/reading-steiner-qwen3.5-2b
Key Hyperparameters
| Parameter | Value | Notes |
|---|---|---|
learning_rate |
2e-4 |
Standard for LoRA fine-tuning |
r (LoRA rank) |
16 |
Good balance of capacity vs memory |
lora_alpha |
16 |
Ξ± = r |
gradient_accumulation_steps |
4 |
Effective batch size = 4 |
warmup_steps |
650 |
5% of one epoch |
num_train_epochs |
1 |
Can increase to 2-3 for better convergence |
max_length |
4096 |
Supports long HTML pages |
Hardware Requirements
| Setup | VRAM | Time (est.) |
|---|---|---|
| RTX 3090 / A10G (24GB) | ~18GB | ~8-10 hours for 1 epoch |
| A100 (80GB) | ~20GB | ~3-4 hours for 1 epoch |
| T4 (16GB) | OOM without smaller batch | Not recommended |
Inference
After training, you can load the model and run inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-2B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "OmAlve/reading-steiner-qwen3.5-2b")
model = model.merge_and_unload() # Optional: merge for faster inference
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B", trust_remote_code=True)
# Prepare input
messages = [
{"role": "system", "content": "You are Reading Steiner..."},
{"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...</div>\n[2] <p>Main content here</p>"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Architecture Notes
Why Qwen3.5-2B?
The paper An Index-based Approach for Efficient Web Content Extraction shows that even 0.6B parameter models achieve strong results on this task. Qwen3.5-2B offers:
- Best quality in the <4B range
- Fast inference due to small size
- Multilingual support (English, Chinese, etc.)
- Great tokenizer for long contexts
Why LoRA + Unsloth?
- Unsloth patches the model for 2Γ faster training and 70% less VRAM
- LoRA trains only 0.49% of parameters, making fine-tuning feasible on consumer GPUs
- 4-bit quantization (QLoRA) fits a 2.2B model in ~14GB VRAM during training
VLM β Text-Only Workaround
Qwen3.5-2B is technically a vision-language model. To avoid the vision processor trying to parse text as images, we use AutoTokenizer directly (instead of the Unsloth-returned processor) and apply the chat template manually with tokenize=False.
Evaluation
The paper evaluates on:
- Main content extraction (Common Crawl pages)
- Query-relevant extraction (HotpotQA, Natural Questions, MuSiQue, TriviaQA, MultiHopRAG)
Metrics: Precision, Recall, F1 on extracted content intervals.
To evaluate your fine-tuned model, convert predicted intervals back to text and compare with ground truth using standard F1 metrics.
Citation
If you use this model or dataset, please cite:
@article{indexlm2025,
title={An Index-based Approach for Efficient and Effective Web Content Extraction},
author={IndexLM Authors},
journal={arXiv preprint arXiv:2512.06641},
year={2025}
}
License
- Model: Apache-2.0 (Qwen3.5-2B)
- Code: MIT
- Dataset: Check dataset card for specific license
Acknowledgments
- Built with Unsloth for fast fine-tuning
- Powered by TRL and Transformers
- Dataset curated by OmAlve