FinSenti-Qwen3.5-4B

FinSenti-Qwen3.5-4B is a 4.0B-parameter model fine-tuned to read short financial text (headlines, earnings snippets, market commentary) and explain its read of them before settling on positive, negative, or neutral. It's the 4B model on the newer Qwen3.5 backbone. Output style is close to Qwen3-4B with a slightly different feel from the updated pretraining data.

The model is part of the FinSenti collection, a scaling study of small models trained on the same data with the same recipe.

What it's good at

Classifying short financial text (1-3 sentences) into positive / negative / neutral
Producing a short reasoning chain you can read or log
Following a strict <reasoning>...</reasoning><answer>...</answer> output format that's easy to parse downstream

It was trained on news-style headlines and earnings snippets in English, so that's where it shines. Outside that domain you'll see the format hold up but the labels get noisier.

How it was trained

Two-stage recipe, same across the whole FinSenti family:

SFT on the SFT train slice from the FinSenti dataset (~15.2K balanced training samples, drawn from a 50.8K-sample pool with held-out val/test splits, chain-of-thought targets generated by a teacher model and filtered for label agreement). This stage took about 5.0 hours on a single A100 80GB for this model.
GRPO with four reward functions (sentiment correctness, format compliance, reasoning quality, output consistency), each weighted equally for a maximum reward of 4.0. The training budget was 3000 steps with early stopping; the best checkpoint landed near step ~480 with a mean reward of approximately 3.50 / 4.0 on the validation slice.

Trainer stack: Unsloth + TRL, using Unsloth's pre-quantized mirror unsloth/Qwen3.5-4B as the loading shortcut for the upstream Qwen/Qwen3.5-4B weights. LoRA adapters (r=32, alpha=64) were trained on the attention and MLP projection layers, then merged into the base weights before export, so this repo is a self-contained model and doesn't need PEFT to load.

Quick start

Standard transformers usage:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Ayansk11/FinSenti-Qwen3.5-4B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

system = (
    "You are a financial sentiment analyst. For each headline you receive, "
    "write a short reasoning chain inside <reasoning>...</reasoning> tags, "
    "then give a single label inside <answer>...</answer> tags. The label "
    "must be exactly one of: positive, negative, neutral."
)
user = "Apple beats Q4 estimates as iPhone sales jump 12% year over year."

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": user},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Expected output (your reasoning text will vary; the label should match):

<reasoning>
Beating estimates is a positive earnings surprise. A 12% YoY iPhone sales jump in the company's biggest product line points to demand strength. Both signals push the read positive.
</reasoning>
<answer>positive</answer>

Prompt format

The model expects the system prompt above, verbatim is best. The user turn is the headline or short snippet you want classified. Output is two XML-ish blocks in this order: <reasoning>...</reasoning> then <answer>...</answer>. The <answer> content is one of positive, negative, or neutral (lowercase, no punctuation).

If you want labels only and don't care about the reasoning, you can stop generation as soon as you see </answer> to save tokens.

Performance notes

The training reward (max 4.0) hit 3.50 on the held-out validation slice. That breaks down across the four reward functions roughly as:

Sentiment correctness: dominant contributor; the model gets the label right on the validation split most of the time
Format compliance: near-saturated by the end of GRPO; the model almost always produces well-formed <reasoning> and <answer> tags
Reasoning quality: judged on length and presence of finance-relevant signal words; this one's the noisiest of the four
Consistency: rewards stable labels across paraphrases of the same headline

Numbers on standard finance benchmarks (FPB, FiQA, Twitter Financial News) are forthcoming and will be added once the eval pipeline lands.

Hardware

bf16 weights are about 8 GB and need ~10 GB of VRAM for inference (a 12 GB card will do it with headroom). For CPU-only or 8 GB GPUs, grab the Q4_K_M GGUF.

Limitations

A few things this model isn't built for:

Long documents. Training context was capped at 2048 tokens. Anything much longer than a few paragraphs is out of distribution.
Multi-asset reasoning. It classifies the sentiment of a single piece of text. It won't aggregate across multiple headlines or weigh sources.
Numerical reasoning. It can read "beats by 12%" and call that positive, but it isn't doing math. Don't ask it to forecast.
Languages other than English. Training data was English only.
Background knowledge. If the headline needs you to know what a company does, the model only has whatever was in its base pretraining. It can't look anything up.
Three labels, hard cutoffs. The output space is positive / negative / neutral. If you need a 5-class scale or a continuous score, you'll need to retrain or post-process.

Training details


Upstream base model	Qwen/Qwen3.5-4B
Loading mirror	unsloth/Qwen3.5-4B (Unsloth's pre-quantized copy)
Dataset	Ayansk11/FinSenti-Dataset (~15.2K train per stage, 50.8K total across splits)
SFT length	~5.0 hours on A100 80GB
GRPO budget	3000 steps with early stopping (best near step ~480)
Best GRPO reward	~3.50 / 4.0
Adapter	LoRA (r=32, alpha=64) on q/k/v/o/gate/up/down projections
Sequence length	2048
Optimizer	AdamW (8-bit), cosine LR schedule
Hardware	NVIDIA A100 80GB (Indiana University BigRed200 cluster)
Frameworks	Unsloth + TRL

Related FinSenti models

Other sizes and bases trained with the same recipe:

Qwen3: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B
Qwen3.5: Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-9B
DeepSeek: DeepSeek-R1-1.5B

There's a GGUF build of this same model at Ayansk11/FinSenti-Qwen3.5-4B-GGUF for Ollama and llama.cpp, and the dataset itself is at Ayansk11/FinSenti-Dataset.

If you're picking a size, a rough guide:

Need it on a phone or browser? Look at the smallest model in the group (Qwen3-0.6B) or its GGUF.
Laptop with no GPU? Any model up to ~2B as Q4_K_M GGUF works.
Single 8-12 GB GPU? The 1.5B-4B sizes are the sweet spot.
Server or workstation? The 8B / 9B variants give the best reasoning but need the memory.

Citation

If you use this model in research, please cite:

@misc{shaikh2026finsenti,
  title  = {FinSenti: Small Language Models for Financial Sentiment with Chain-of-Thought Reasoning},
  author = {Shaikh, Ayan},
  year   = {2026},
  url    = {https://huggingface.co/collections/Ayansk11/finsenti},
  note   = {Indiana University}
}

License

Apache 2.0, same as the base model.

Acknowledgements

Trained on the Indiana University BigRed200 cluster (account r01510). Thanks to the Unsloth and TRL teams for the trainer stack, and to the Qwen / DeepSeek teams for the base models.