Update README.md

Browse files

Files changed (1) hide show

README.md +120 -112

README.md CHANGED Viewed

@@ -18,78 +18,101 @@ tags:
 pipeline_tag: text-generation
 ---
-# 🏦 RegTech-4B-Instruct
 > **Fine-tuned for RAG-powered banking compliance — not general knowledge.**
 A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.
-This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, and maintain the right tone and terminology when grounded on regulatory documents.
 ---
-## 🎯 What This Model Does
 This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:
 | Task | Description |
 |---|---|
-| 📋 **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
-| 🔧 **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
-| 🔍 **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
-| 🧠 **Intent Detection** | Classify if a message needs document search or is conversational |
-| 📊 **Document Reranking** | Score candidate documents by relevance |
-| 📝 **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
-| ⚖️ **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
 ---
-## 📈 Evaluation — LLM-as-Judge
-Evaluated by **Claude Opus 4.6** (Anthropic) across 11 blind test scenarios. The judge compared base vs fine-tuned model outputs without knowing which was which.
-### 🏆 Head-to-Head
 ```
-┌─────────────────────────────────────────┐
-│  🟢 Tuned Wins    7/11    (68.2%)       │
-│  🔴 Base Wins     3/11    (31.8%)       │
-│  ⚪ Ties          1/11                   │
-└─────────────────────────────────────────┘
 ```
-### 📊 Quality Scores (1–5)
 | Criterion | Base | Tuned | Delta | |
 |---|:---:|:---:|:---:|---|
-| 🎯 Instruction Following | 3.64 | **4.55** | +0.91 | 🟢🟢🟢 |
-| 📎 Context Adherence | 4.09 | **4.82** | +0.73 | 🟢🟢 |
-| ✅ Accuracy | 4.18 | **4.64** | +0.46 | 🟢 |
-| 📐 Format | 4.09 | **4.55** | +0.46 | 🟢 |
-| 🗣️ Tone | 4.55 | **4.82** | +0.27 | 🟢 |
-| **📊 Overall** | **4.11** | **4.68** | **+0.57** | **🟢** |
-> The biggest gains are in **instruction following** (+0.91) and **context adherence** (+0.73) — exactly what matters when the model must follow retrieved regulatory context faithfully.
-### 📂 Results by Category
-| Category | Base | Tuned | Tie |
-|---|:---:|:---:|:---:|
-| 🔧 Tool Use | 0 | **2** | 0 |
-| 🚫 Refusal Handling | 0 | **1** | 1 |
-| 🎨 Style & Tone | 0 | **1** | 0 |
-| 📤 Data Extraction | 0 | **1** | 0 |
-| 📋 JSON Output | 1 | 1 | 0 |
-| 📖 RAG Q&A | 1 | 1 | 0 |
-| ⚠️ Edge Cases | 1 | 0 | 0 |
----
-## 💡 Usage Examples
-### 📋 RAG Q&A — Answering from Retrieved Context
-The model is designed to receive **retrieved regulatory documents as context** and answer based on them:
 ```python
 messages = [
@@ -101,8 +124,6 @@ Rispondi SOLO basandoti sul contesto fornito.
 <contesto_recuperato>
 Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti
 requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
-Il coefficiente è calcolato come rapporto tra i fondi propri e
-l'importo complessivo dell'esposizione al rischio.
 </contesto_recuperato>"""
     },
     {
@@ -112,27 +133,7 @@ l'importo complessivo dell'esposizione al rischio.
 ]
 ```
-### 🔍 Query Expansion — Improving RAG Retrieval
-```python
-messages = [
-    {
-        "role": "system",
-        "content": "Riscrivi la query dell'utente in una versione più ricca per migliorare il recupero documentale (RAG). Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON richiesto."
-    },
-    {
-        "role": "user",
-        "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
-    }
-]
-# Expected output:
-# {"query": "obblighi segnalazione operazioni sospette SOS UIF D.Lgs. 231/2007
-#   art. 35 riciclaggio finanziamento terrorismo portale RADAR tempistiche
-#   invio indicatori anomalia"}
-```
-### 🔧 Tool Calling — Compliance Workflows
 ```python
 messages = [
@@ -156,75 +157,79 @@ applicata per PEP, paesi ad alto rischio e profili con scoring > 60.
         "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
     }
 ]
-# The model will:
-# 1. Call controlla_liste_pep for the representative
-# 2. Call calcola_scoring_rischio based on risk factors
-# 3. Recommend EDD procedure per AML-003, grounded on retrieved policy
 ```
-### 📊 Document Reranking
 ```python
 messages = [
     {
         "role": "system",
-        "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Restituisci solo i candidati rilevanti con score 0-100. Rispondi SOLO con il JSON richiesto."
     },
     {
         "role": "user",
-        "content": '{"query": "requisiti CET1 fondi propri", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR", "content": "..."}, {"id": "doc_002", "title": "DORA Art. 5", "content": "..."}]}'
     }
 ]
-# Expected: {"matches": [{"id": "doc_001", "relevance": 95}]}
 ```
----
-## ⚙️ Training Details
-| | |
-|---|---|
-| 🧬 **Method** | LoRA — bf16 full precision (no quantization) |
-| 🏗️ **Base Model** | Qwen3-4B-Instruct-2507 |
-| 📦 **Dataset** | 923 train / 102 eval samples |
-| ⏱️ **Duration** | 11.9 minutes |
-### 📉 Training Metrics
 | Metric | Value |
 |---|---|
-| Final Train Loss | 1.241 |
-| Best Eval Loss | 1.191 (step 680/693) |
-| Train/Eval Gap | 0.050 ✅ |
-> Gap of 0.050 indicates **stable training with no overfitting**.
 ---
-## 📚 Dataset Coverage
 The training data covers the full lifecycle of a RAG-based compliance assistant:
 | Category | Purpose |
 |---|---|
-| 🏷️ Title Generation | Generate conversation titles from user queries |
-| 🔍 Query Expansion | Enrich queries with regulatory terms for better retrieval |
-| 🧠 Intent Classification | Route queries to RAG vs conversational responses |
-| 📊 Document Reranking | Score retrieved documents by relevance |
-| 📝 Topic Extraction | Extract main topics from regulatory text pages |
-| 📖 Document Summarization | Summarize multi-page regulatory documents |
-| ⚖️ Relevance Filtering | Filter regulatory text relevant to banks |
-| 📅 Metadata Extraction | Find application dates, issuing authorities |
-| 🔧 Impact Analysis | Cross-reference regulations vs internal procedures |
-| 💬 RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |
 **Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.
 ---
-## 🚀 Deployment
 ### With vLLM
 ```bash
@@ -235,10 +240,14 @@ vllm serve ./models/RegTech-4B-Instruct --dtype bfloat16
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto")
 tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")
-text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=512)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@@ -246,17 +255,16 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ---
-## ⚠️ Important Notes
-- 🎯 **RAG-optimized** — trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
-- 🏦 **Domain-specific** — optimized for Italian banking compliance. General capabilities may differ from the base model.
-- ⚖️ **Not legal advice** — a tool to assist compliance professionals, not a substitute for regulatory expertise.
-- 🔧 **Tool schemas** — tool calling works best with the specific function signatures used during training.
 ---
 <p align="center">
-  Built with ❤️ for banking RAG<br>
-  <em>Fine-tuned with LoRA • Evaluated by Claude Opus 4.6 • Powered by Qwen3</em>
-  <em>Contact For Commercial Use: https://landing.2sophia.ai</em>
 </p>

 pipeline_tag: text-generation
 ---
+# RegTech-4B-Instruct
 > **Fine-tuned for RAG-powered banking compliance — not general knowledge.**
 A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.
+This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents.
 ---
+## What This Model Does
 This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:
 | Task | Description |
 |---|---|
+| **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
+| **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
+| **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
+| **Intent Detection** | Classify if a message needs document search or is conversational |
+| **Document Reranking** | Score candidate documents by relevance |
+| **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
+| **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
+| **Hallucination Resistance** | Refuse to fabricate regulations, articles, or sanctions not in context |
 ---
+## Evaluation
+### Methodology
+We evaluate all fine-tuned models using a **dynamic adversarial benchmark** designed to prevent overfitting to static test sets:
+- **Test generation**: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused.
+- **Blind comparison**: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias.
+- **Expert judging**: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale.
+- **Statistical robustness**: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch.
+This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance.
+### Results — RegTech-4B-Instruct
+Evaluated across **73 blind adversarial tests** over 3 independent loops.
+#### Head-to-Head vs Base Model
 ```
+                        Base    Tuned
+Win Rate (adj.)        45.2%   54.8%
+Wins                     26      33
+Ties                          14
 ```
+#### Quality Scores (1–5 scale)
 | Criterion | Base | Tuned | Delta | |
 |---|:---:|:---:|:---:|---|
+| Hallucination Resistance | 3.53 | **3.89** | +0.36 | Improved |
+| Tone & Professionalism | 3.90 | **4.27** | +0.37 | Improved |
+| Output Format | 3.41 | **3.75** | +0.34 | Improved |
+| Instruction Following | 3.14 | **3.44** | +0.30 | Improved |
+| Accuracy | 3.34 | **3.59** | +0.25 | Improved |
+| Context Adherence | 3.66 | **3.89** | +0.23 | Improved |
+| Completeness | **3.45** | 3.23 | -0.22 | Trade-off |
+| **Overall** | **3.49** | **3.72** | **+0.23** | **Improved** |
+#### Key Safety Improvements
+The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios:
+- **Hallucination traps**: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions.
+- **Credential protection**: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim.
+- **Professional tone**: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications.
+#### Known Limitations
+- **Completeness trade-off** (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal.
+- **Query Expansion**: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements.
+- **Inference speed**: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs.
+#### Consistency Across Loops
+| Loop | Base Wins | Tuned Wins | Ties | Tuned % |
+|:---:|:---:|:---:|:---:|:---:|
+| 1 | 7 | 13 | 5 | 62.0% |
+| 2 | 11 | 10 | 2 | 47.8% |
+| 3 | 8 | 10 | 7 | 54.0% |
+Tuned model wins or ties in 2 out of 3 independent loops.
+---
+## Usage Examples
+### RAG Q&A — Answering from Retrieved Context
 ```python
 messages = [
 <contesto_recuperato>
 Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti
 requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
 </contesto_recuperato>"""
     },
     {
 ]
 ```
+### Tool Calling — Compliance Workflows
 ```python
 messages = [
         "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
     }
 ]
 ```
+### Query Expansion — Improving RAG Retrieval
 ```python
 messages = [
     {
         "role": "system",
+        "content": "Riscrivi la query dell'utente per migliorare il recupero documentale. Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON."
     },
     {
         "role": "user",
+        "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
     }
 ]
 ```
+### Document Reranking
+```python
+messages = [
+    {
+        "role": "system",
+        "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Score 0-100. Rispondi SOLO con il JSON."
+    },
+    {
+        "role": "user",
+        "content": '{"query": "requisiti CET1", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR"}, {"id": "doc_002", "title": "DORA Art. 5"}]}'
+    }
+]
+```
+### Training Metrics
 | Metric | Value |
 |---|---|
+| Final Eval Loss | 1.368 |
+| Token Accuracy | 70.5% |
+| Train/Eval Gap | 0.033 |
+> A gap of 0.033 indicates stable training with no overfitting. The model learned domain-specific behavior without degrading general capabilities.
+### Design Principles
+The LoRA configuration follows a **minimal intervention** philosophy validated through progressive experimentation across 6+ configurations:
+- **Low rank, all modules**: Modifying all transformer layers with minimal rank produces better results than high rank on a subset of layers — consistent with findings from the [original LoRA paper](https://arxiv.org/abs/2106.09685).
+- **Single epoch**: One pass through the data is sufficient for behavioral adaptation. Multiple epochs cause catastrophic forgetting on small models.
+- **Conservative scaling**: Alpha = 2× rank with low learning rate ensures stable gradients with adequate signal amplification.
 ---
+## Dataset Coverage
 The training data covers the full lifecycle of a RAG-based compliance assistant:
 | Category | Purpose |
 |---|---|
+| Query Expansion | Enrich queries with regulatory terms for better retrieval |
+| Intent Classification | Route queries to RAG vs conversational responses |
+| Document Reranking | Score retrieved documents by relevance |
+| Topic Extraction | Extract main topics from regulatory text pages |
+| Document Summarization | Summarize multi-page regulatory documents |
+| Relevance Filtering | Filter regulatory text relevant to banks |
+| Metadata Extraction | Find application dates, issuing authorities |
+| Impact Analysis | Cross-reference regulations vs internal procedures |
+| RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |
 **Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.
 ---
+## Deployment
 ### With vLLM
 ```bash
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto"
+)
 tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")
+text = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=512)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ---
+## Important Notes
+- **RAG-optimized** — Trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
+- **Domain-specific** — Optimized for Italian banking compliance. General capabilities may differ from the base model.
+- **Not legal advice** — A tool to assist compliance professionals, not a substitute for regulatory expertise.
+- **Part of a model family** — This 4B model is the lightweight variant. Larger models (7B, 14B, 32B) in the RegTech family offer progressively better completeness and accuracy for more demanding use cases.
 ---
 <p align="center">
+  Built for banking RAG by <a href="https://landing.2sophia.ai">2Sophia</a><br>
+  <em>Fine-tuned with LoRA &bull; Adversarial evaluation by frontier LLM judges &bull; Powered by Qwen3</em>
 </p>