YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Reading Steiner β€” Web Content Extraction (Qwen3.5-2B)

Fine-tuned Qwen3.5-2B for web content extraction via index prediction, inspired by IndexLM (arXiv:2512.06641).


Overview

This repository contains the training code and setup for fine-tuning a lightweight 2.2B parameter LLM (Qwen3.5-2B) to perform web content extraction by predicting content block indices. Instead of generating extracted text token-by-token, the model predicts intervals like:

[[4, 5], [7, 23], [25, 46]]

This approach is:

  • ~20-50Γ— faster than generative extraction baselines
  • Far fewer tokens consumed at inference
  • Highly accurate for both main-content and query-relevant extraction

Model Details

Property Value
Base Model Qwen/Qwen3.5-2B (2.2B params)
Fine-tuning Method LoRA (rank 16, QLoRA 4-bit base) via Unsloth
Trainable Params ~10.9M (0.49% of base)
Training Framework TRL SFTTrainer
Dataset OmAlve/reading-steiner-data
Max Sequence Length 4096 tokens
Task Index-based web content extraction

Dataset Format

The dataset is in conversational (ChatML) format with three roles:

  • system β€” Task instruction with format rules
  • user β€” URL, page title, and indexed HTML blocks ([i] <tag>content</tag>)
  • assistant β€” Predicted intervals: [[start1, end1], [start2, end2], ...] or NA
{
  "messages": [
    {"role": "system", "content": "You are Reading Steiner, a web content extraction model..."},
    {"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...\n[2] <p>..."},
    {"role": "assistant", "content": "[[4, 5], [7, 23], [25, 46]]"}
  ]
}
  • Train split: 51,697 samples (449 MB)
  • Eval split: 1,500 samples (14 MB)

Requirements

pip install unsloth trl transformers datasets accelerate trackio

Note: Unsloth will auto-download CUDA kernels. For the fastest path on Linux, just pip install unsloth.


Training

Quick Start (1 GPU, 24GB VRAM recommended)

python train_reading_steiner.py

This script:

  1. Loads the dataset and applies Qwen3's chat template
  2. Loads Qwen3.5-2B in 4-bit via Unsloth
  3. Attaches LoRA adapters (rank 16)
  4. Fine-tunes for 1 epoch with cosine LR scheduling
  5. Pushes checkpoints and final model to OmAlve/reading-steiner-qwen3.5-2b

Key Hyperparameters

Parameter Value Notes
learning_rate 2e-4 Standard for LoRA fine-tuning
r (LoRA rank) 16 Good balance of capacity vs memory
lora_alpha 16 Ξ± = r
gradient_accumulation_steps 4 Effective batch size = 4
warmup_steps 650 5% of one epoch
num_train_epochs 1 Can increase to 2-3 for better convergence
max_length 4096 Supports long HTML pages

Hardware Requirements

Setup VRAM Time (est.)
RTX 3090 / A10G (24GB) ~18GB ~8-10 hours for 1 epoch
A100 (80GB) ~20GB ~3-4 hours for 1 epoch
T4 (16GB) OOM without smaller batch Not recommended

Inference

After training, you can load the model and run inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "OmAlve/reading-steiner-qwen3.5-2b")
model = model.merge_and_unload()  # Optional: merge for faster inference

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B", trust_remote_code=True)

# Prepare input
messages = [
    {"role": "system", "content": "You are Reading Steiner..."},
    {"role": "user", "content": "URL: https://example.com\nTitle: ...\nBlocks:\n[1] <div>...</div>\n[2] <p>Main content here</p>"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Architecture Notes

Why Qwen3.5-2B?

The paper An Index-based Approach for Efficient Web Content Extraction shows that even 0.6B parameter models achieve strong results on this task. Qwen3.5-2B offers:

  • Best quality in the <4B range
  • Fast inference due to small size
  • Multilingual support (English, Chinese, etc.)
  • Great tokenizer for long contexts

Why LoRA + Unsloth?

  • Unsloth patches the model for 2Γ— faster training and 70% less VRAM
  • LoRA trains only 0.49% of parameters, making fine-tuning feasible on consumer GPUs
  • 4-bit quantization (QLoRA) fits a 2.2B model in ~14GB VRAM during training

VLM β†’ Text-Only Workaround

Qwen3.5-2B is technically a vision-language model. To avoid the vision processor trying to parse text as images, we use AutoTokenizer directly (instead of the Unsloth-returned processor) and apply the chat template manually with tokenize=False.


Evaluation

The paper evaluates on:

  • Main content extraction (Common Crawl pages)
  • Query-relevant extraction (HotpotQA, Natural Questions, MuSiQue, TriviaQA, MultiHopRAG)

Metrics: Precision, Recall, F1 on extracted content intervals.

To evaluate your fine-tuned model, convert predicted intervals back to text and compare with ground truth using standard F1 metrics.


Citation

If you use this model or dataset, please cite:

@article{indexlm2025,
  title={An Index-based Approach for Efficient and Effective Web Content Extraction},
  author={IndexLM Authors},
  journal={arXiv preprint arXiv:2512.06641},
  year={2025}
}

License

  • Model: Apache-2.0 (Qwen3.5-2B)
  • Code: MIT
  • Dataset: Check dataset card for specific license

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for OmAlve/reading-steiner-qwen3.5-2b