---
language:
- it
- en
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
- lora
- fine-tuned
- banking
- regtech
- compliance
- rag
- tool-calling
- italian
- qwen3
pipeline_tag: text-generation
---
# RegTech-4B-Instruct
> **Fine-tuned for RAG-powered banking compliance — not general knowledge.**
A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.
This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents.
---
## What This Model Does
This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:
| Task | Description |
|---|---|
| **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
| **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
| **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
| **Intent Detection** | Classify if a message needs document search or is conversational |
| **Document Reranking** | Score candidate documents by relevance |
| **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
| **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
| **Hallucination Resistance** | Refuse to fabricate regulations, articles, or sanctions not in context |
---
## Evaluation
### Methodology
We evaluate all fine-tuned models using a **dynamic adversarial benchmark** designed to prevent overfitting to static test sets:
- **Test generation**: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused.
- **Blind comparison**: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias.
- **Expert judging**: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale.
- **Statistical robustness**: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch.
This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance.
### Results — RegTech-4B-Instruct
Evaluated across **73 blind adversarial tests** over 3 independent loops.
#### Head-to-Head vs Base Model
```
Base Tuned
Win Rate (adj.) 45.2% 54.8%
Wins 26 33
Ties 14
```
#### Quality Scores (1–5 scale)
| Criterion | Base | Tuned | Delta | |
|---|:---:|:---:|:---:|---|
| Hallucination Resistance | 3.53 | **3.89** | +0.36 | Improved |
| Tone & Professionalism | 3.90 | **4.27** | +0.37 | Improved |
| Output Format | 3.41 | **3.75** | +0.34 | Improved |
| Instruction Following | 3.14 | **3.44** | +0.30 | Improved |
| Accuracy | 3.34 | **3.59** | +0.25 | Improved |
| Context Adherence | 3.66 | **3.89** | +0.23 | Improved |
| Completeness | **3.45** | 3.23 | -0.22 | Trade-off |
| **Overall** | **3.49** | **3.72** | **+0.23** | **Improved** |
#### Key Safety Improvements
The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios:
- **Hallucination traps**: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions.
- **Credential protection**: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim.
- **Professional tone**: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications.
#### Known Limitations
- **Completeness trade-off** (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal.
- **Query Expansion**: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements.
- **Inference speed**: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs.
#### Consistency Across Loops
| Loop | Base Wins | Tuned Wins | Ties | Tuned % |
|:---:|:---:|:---:|:---:|:---:|
| 1 | 7 | 13 | 5 | 62.0% |
| 2 | 11 | 10 | 2 | 47.8% |
| 3 | 8 | 10 | 7 | 54.0% |
Tuned model wins or ties in 2 out of 3 independent loops.
---
## Usage Examples
### RAG Q&A — Answering from Retrieved Context
```python
messages = [
{
"role": "system",
"content": """Sei un assistente per la compliance bancaria.
Rispondi SOLO basandoti sul contesto fornito.
Built for banking RAG by 2Sophia
Fine-tuned with LoRA • Adversarial evaluation by frontier LLM judges • Powered by Qwen3