Reading Steiner — Web Content Extraction (Qwen3.5-2B)

Fine-tuned Qwen3.5-2B for web content extraction via index prediction, inspired by IndexLM (arXiv:2512.06641).

Overview

This repository contains the training code and setup for fine-tuning a lightweight 2.2B parameter LLM (Qwen3.5-2B) to perform web content extraction by predicting content block indices. Instead of generating extracted text token-by-token, the model predicts intervals like:

[[4, 5], [7, 23], [25, 46]]

This approach is:

~20-50× faster than generative extraction baselines
Far fewer tokens consumed at inference
Highly accurate for both main-content and query-relevant extraction

Model Details

Property	Value
Base Model	Qwen/Qwen3.5-2B (2.2B params)
Fine-tuning Method	LoRA (rank 16, QLoRA 4-bit base) via Unsloth
Trainable Params	~10.9M (0.49% of base)
Training Framework	TRL SFTTrainer
Dataset	OmAlve/reading-steiner-data
Max Sequence Length	4096 tokens
Task	Index-based web content extraction

Dataset Format

The dataset is in conversational (ChatML) format with three roles:

system — Task instruction with format rules
user — URL, page title, and indexed HTML blocks ([i] <tag>content</tag>)
assistant — Predicted intervals: [[start1, end1], [start2, end2], ...] or NA

{
  "messages": [
    {"role": "system", "content": "You are Reading Steiner, a web content extraction model..."},
    {"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...\n[2] <p>..."},
    {"role": "assistant", "content": "[[4, 5], [7, 23], [25, 46]]"}
  ]
}

Train split: ~~51,697 samples (~~449 MB)
Eval split: ~~1,500 samples (~~14 MB)

Requirements

pip install unsloth trl transformers datasets accelerate trackio

Note: Unsloth will auto-download CUDA kernels. For the fastest path on Linux, just pip install unsloth.

Training

Quick Start (1 GPU, 24GB VRAM recommended)

python train_reading_steiner.py

This script:

Loads the dataset and applies Qwen3's chat template
Loads Qwen3.5-2B in 4-bit via Unsloth
Attaches LoRA adapters (rank 16)
Fine-tunes for 1 epoch with cosine LR scheduling
Pushes checkpoints and final model to OmAlve/reading-steiner-qwen3.5-2b

Key Hyperparameters

Parameter	Value	Notes
`learning_rate`	`2e-4`	Standard for LoRA fine-tuning
`r` (LoRA rank)	`16`	Good balance of capacity vs memory
`lora_alpha`	`16`	α = r
`gradient_accumulation_steps`	`4`	Effective batch size = 4
`warmup_steps`	`650`	5% of one epoch
`num_train_epochs`	`1`	Can increase to 2-3 for better convergence
`max_length`	`4096`	Supports long HTML pages

Hardware Requirements

Setup	VRAM	Time (est.)
RTX 3090 / A10G (24GB)	~18GB	~8-10 hours for 1 epoch
A100 (80GB)	~20GB	~3-4 hours for 1 epoch
T4 (16GB)	OOM without smaller batch	Not recommended

Inference

After training, you can load the model and run inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "OmAlve/reading-steiner-qwen3.5-2b")
model = model.merge_and_unload()  # Optional: merge for faster inference

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B", trust_remote_code=True)

# Prepare input
messages = [
    {"role": "system", "content": "You are Reading Steiner..."},
    {"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...</div>\n[2] <p>Main content here</p>"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Architecture Notes

Why Qwen3.5-2B?

The paper An Index-based Approach for Efficient Web Content Extraction shows that even 0.6B parameter models achieve strong results on this task. Qwen3.5-2B offers:

Best quality in the <4B range
Fast inference due to small size
Multilingual support (English, Chinese, etc.)
Great tokenizer for long contexts

Why LoRA + Unsloth?

Unsloth patches the model for 2× faster training and 70% less VRAM
LoRA trains only 0.49% of parameters, making fine-tuning feasible on consumer GPUs
4-bit quantization (QLoRA) fits a 2.2B model in ~14GB VRAM during training

VLM → Text-Only Workaround

Qwen3.5-2B is technically a vision-language model. To avoid the vision processor trying to parse text as images, we use AutoTokenizer directly (instead of the Unsloth-returned processor) and apply the chat template manually with tokenize=False.

Evaluation

The paper evaluates on:

Main content extraction (Common Crawl pages)
Query-relevant extraction (HotpotQA, Natural Questions, MuSiQue, TriviaQA, MultiHopRAG)

Metrics: Precision, Recall, F1 on extracted content intervals.

To evaluate your fine-tuned model, convert predicted intervals back to text and compare with ground truth using standard F1 metrics.

Citation

If you use this model or dataset, please cite:

@article{indexlm2025,
  title={An Index-based Approach for Efficient and Effective Web Content Extraction},
  author={IndexLM Authors},
  journal={arXiv preprint arXiv:2512.06641},
  year={2025}
}

License

Model: Apache-2.0 (Qwen3.5-2B)
Code: MIT
Dataset: Check dataset card for specific license

Acknowledgments

Built with Unsloth for fast fine-tuning
Powered by TRL and Transformers
Dataset curated by OmAlve

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for OmAlve/reading-steiner-qwen3.5-2b

An Index-based Approach for Efficient and Effective Web Content Extraction

Paper • 2512.06641 • Published Dec 7, 2025