--- language: - it - en license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-4B-Instruct-2507 tags: - lora - fine-tuned - banking - regtech - compliance - rag - tool-calling - italian - qwen3 pipeline_tag: text-generation --- # RegTech-4B-Instruct > **Fine-tuned for RAG-powered banking compliance — not general knowledge.** A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance. This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents. --- ## What This Model Does This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically: | Task | Description | |---|---| | **RAG Q&A** | Answer regulatory questions grounded on retrieved documents | | **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting | | **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval | | **Intent Detection** | Classify if a message needs document search or is conversational | | **Document Reranking** | Score candidate documents by relevance | | **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format | | **Impact Analysis** | Cross-reference external regulations against internal bank procedures | | **Hallucination Resistance** | Refuse to fabricate regulations, articles, or sanctions not in context | --- ## Evaluation ### Methodology We evaluate all fine-tuned models using a **dynamic adversarial benchmark** designed to prevent overfitting to static test sets: - **Test generation**: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused. - **Blind comparison**: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias. - **Expert judging**: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale. - **Statistical robustness**: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch. This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance. ### Results — RegTech-4B-Instruct Evaluated across **73 blind adversarial tests** over 3 independent loops. #### Head-to-Head vs Base Model ``` Base Tuned Win Rate (adj.) 45.2% 54.8% Wins 26 33 Ties 14 ``` #### Quality Scores (1–5 scale) | Criterion | Base | Tuned | Delta | | |---|:---:|:---:|:---:|---| | Hallucination Resistance | 3.53 | **3.89** | +0.36 | Improved | | Tone & Professionalism | 3.90 | **4.27** | +0.37 | Improved | | Output Format | 3.41 | **3.75** | +0.34 | Improved | | Instruction Following | 3.14 | **3.44** | +0.30 | Improved | | Accuracy | 3.34 | **3.59** | +0.25 | Improved | | Context Adherence | 3.66 | **3.89** | +0.23 | Improved | | Completeness | **3.45** | 3.23 | -0.22 | Trade-off | | **Overall** | **3.49** | **3.72** | **+0.23** | **Improved** | #### Key Safety Improvements The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios: - **Hallucination traps**: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions. - **Credential protection**: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim. - **Professional tone**: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications. #### Known Limitations - **Completeness trade-off** (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal. - **Query Expansion**: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements. - **Inference speed**: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs. #### Consistency Across Loops | Loop | Base Wins | Tuned Wins | Ties | Tuned % | |:---:|:---:|:---:|:---:|:---:| | 1 | 7 | 13 | 5 | 62.0% | | 2 | 11 | 10 | 2 | 47.8% | | 3 | 8 | 10 | 7 | 54.0% | Tuned model wins or ties in 2 out of 3 independent loops. --- ## Usage Examples ### RAG Q&A — Answering from Retrieved Context ```python messages = [ { "role": "system", "content": """Sei un assistente per la compliance bancaria. Rispondi SOLO basandoti sul contesto fornito. Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%. """ }, { "role": "user", "content": "Quali sono i requisiti minimi di capitale secondo il CRR?" } ] ``` ### Tool Calling — Compliance Workflows ```python messages = [ { "role": "system", "content": """Sei un assistente operativo per la compliance. {"name": "calcola_scoring_rischio", "parameters": {...}} {"name": "controlla_liste_pep", "parameters": {...}} {"name": "verifica_kyc", "parameters": {...}} Procedura AML-003: L'adeguata verifica rafforzata (EDD) deve essere applicata per PEP, paesi ad alto rischio e profili con scoring > 60. """ }, { "role": "user", "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid." } ] ``` ### Query Expansion — Improving RAG Retrieval ```python messages = [ { "role": "system", "content": "Riscrivi la query dell'utente per migliorare il recupero documentale. Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON." }, { "role": "user", "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]" } ] ``` ### Document Reranking ```python messages = [ { "role": "system", "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Score 0-100. Rispondi SOLO con il JSON." }, { "role": "user", "content": '{"query": "requisiti CET1", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR"}, {"id": "doc_002", "title": "DORA Art. 5"}]}' } ] ``` ### Training Metrics | Metric | Value | |---|---| | Final Eval Loss | 1.368 | | Token Accuracy | 70.5% | | Train/Eval Gap | 0.033 | > A gap of 0.033 indicates stable training with no overfitting. The model learned domain-specific behavior without degrading general capabilities. ### Design Principles The LoRA configuration follows a **minimal intervention** philosophy validated through progressive experimentation across 6+ configurations: - **Low rank, all modules**: Modifying all transformer layers with minimal rank produces better results than high rank on a subset of layers — consistent with findings from the [original LoRA paper](https://arxiv.org/abs/2106.09685). - **Single epoch**: One pass through the data is sufficient for behavioral adaptation. Multiple epochs cause catastrophic forgetting on small models. - **Conservative scaling**: Alpha = 2× rank with low learning rate ensures stable gradients with adequate signal amplification. --- ## Dataset Coverage The training data covers the full lifecycle of a RAG-based compliance assistant: | Category | Purpose | |---|---| | Query Expansion | Enrich queries with regulatory terms for better retrieval | | Intent Classification | Route queries to RAG vs conversational responses | | Document Reranking | Score retrieved documents by relevance | | Topic Extraction | Extract main topics from regulatory text pages | | Document Summarization | Summarize multi-page regulatory documents | | Relevance Filtering | Filter regulatory text relevant to banks | | Metadata Extraction | Find application dates, issuing authorities | | Impact Analysis | Cross-reference regulations vs internal procedures | | RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools | **Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions. --- ## Deployment ### With vLLM ```bash vllm serve ./models/RegTech-4B-Instruct --dtype bfloat16 ``` ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID") text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Important Notes - **RAG-optimized** — Trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt. - **Domain-specific** — Optimized for Italian banking compliance. General capabilities may differ from the base model. - **Not legal advice** — A tool to assist compliance professionals, not a substitute for regulatory expertise. - **Part of a model family** — This 4B model is the lightweight variant. Larger models (7B, 14B, 32B) in the RegTech family offer progressively better completeness and accuracy for more demanding use cases. ---

Built for banking RAG by 2Sophia
Fine-tuned with LoRA • Adversarial evaluation by frontier LLM judges • Powered by Qwen3