MwSpace commited on
Commit
b5dd752
·
verified ·
1 Parent(s): 1b68735

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -112
README.md CHANGED
@@ -18,78 +18,101 @@ tags:
18
  pipeline_tag: text-generation
19
  ---
20
 
21
- # 🏦 RegTech-4B-Instruct
22
 
23
  > **Fine-tuned for RAG-powered banking compliance — not general knowledge.**
24
 
25
  A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.
26
 
27
- This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, and maintain the right tone and terminology when grounded on regulatory documents.
28
 
29
  ---
30
 
31
- ## 🎯 What This Model Does
32
 
33
  This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:
34
 
35
  | Task | Description |
36
  |---|---|
37
- | 📋 **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
38
- | 🔧 **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
39
- | 🔍 **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
40
- | 🧠 **Intent Detection** | Classify if a message needs document search or is conversational |
41
- | 📊 **Document Reranking** | Score candidate documents by relevance |
42
- | 📝 **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
43
- | ⚖️ **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
 
44
 
45
  ---
46
 
47
- ## 📈 Evaluation — LLM-as-Judge
48
 
49
- Evaluated by **Claude Opus 4.6** (Anthropic) across 11 blind test scenarios. The judge compared base vs fine-tuned model outputs without knowing which was which.
50
 
51
- ### 🏆 Head-to-Head
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ```
54
- ┌─────────────────────────────────────────┐
55
- │ 🟢 Tuned Wins 7/11 (68.2%) │
56
- │ 🔴 Base Wins 3/11 (31.8%) │
57
- │ ⚪ Ties 1/11 │
58
- └─────────────────────────────────────────┘
59
  ```
60
 
61
- ### 📊 Quality Scores (1–5)
62
 
63
  | Criterion | Base | Tuned | Delta | |
64
  |---|:---:|:---:|:---:|---|
65
- | 🎯 Instruction Following | 3.64 | **4.55** | +0.91 | 🟢🟢🟢 |
66
- | 📎 Context Adherence | 4.09 | **4.82** | +0.73 | 🟢🟢 |
67
- | Accuracy | 4.18 | **4.64** | +0.46 | 🟢 |
68
- | 📐 Format | 4.09 | **4.55** | +0.46 | 🟢 |
69
- | 🗣️ Tone | 4.55 | **4.82** | +0.27 | 🟢 |
70
- | **📊 Overall** | **4.11** | **4.68** | **+0.57** | **🟢** |
71
-
72
- > The biggest gains are in **instruction following** (+0.91) and **context adherence** (+0.73) — exactly what matters when the model must follow retrieved regulatory context faithfully.
73
-
74
- ### 📂 Results by Category
75
-
76
- | Category | Base | Tuned | Tie |
77
- |---|:---:|:---:|:---:|
78
- | 🔧 Tool Use | 0 | **2** | 0 |
79
- | 🚫 Refusal Handling | 0 | **1** | 1 |
80
- | 🎨 Style & Tone | 0 | **1** | 0 |
81
- | 📤 Data Extraction | 0 | **1** | 0 |
82
- | 📋 JSON Output | 1 | 1 | 0 |
83
- | 📖 RAG Q&A | 1 | 1 | 0 |
84
- | ⚠️ Edge Cases | 1 | 0 | 0 |
85
 
86
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- ## 💡 Usage Examples
89
 
90
- ### 📋 RAG Q&A — Answering from Retrieved Context
 
 
91
 
92
- The model is designed to receive **retrieved regulatory documents as context** and answer based on them:
93
 
94
  ```python
95
  messages = [
@@ -101,8 +124,6 @@ Rispondi SOLO basandoti sul contesto fornito.
101
  <contesto_recuperato>
102
  Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti
103
  requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
104
- Il coefficiente è calcolato come rapporto tra i fondi propri e
105
- l'importo complessivo dell'esposizione al rischio.
106
  </contesto_recuperato>"""
107
  },
108
  {
@@ -112,27 +133,7 @@ l'importo complessivo dell'esposizione al rischio.
112
  ]
113
  ```
114
 
115
- ### 🔍 Query Expansion Improving RAG Retrieval
116
-
117
- ```python
118
- messages = [
119
- {
120
- "role": "system",
121
- "content": "Riscrivi la query dell'utente in una versione più ricca per migliorare il recupero documentale (RAG). Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON richiesto."
122
- },
123
- {
124
- "role": "user",
125
- "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
126
- }
127
- ]
128
-
129
- # Expected output:
130
- # {"query": "obblighi segnalazione operazioni sospette SOS UIF D.Lgs. 231/2007
131
- # art. 35 riciclaggio finanziamento terrorismo portale RADAR tempistiche
132
- # invio indicatori anomalia"}
133
- ```
134
-
135
- ### 🔧 Tool Calling — Compliance Workflows
136
 
137
  ```python
138
  messages = [
@@ -156,75 +157,79 @@ applicata per PEP, paesi ad alto rischio e profili con scoring > 60.
156
  "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
157
  }
158
  ]
159
-
160
- # The model will:
161
- # 1. Call controlla_liste_pep for the representative
162
- # 2. Call calcola_scoring_rischio based on risk factors
163
- # 3. Recommend EDD procedure per AML-003, grounded on retrieved policy
164
  ```
165
 
166
- ### 📊 Document Reranking
167
 
168
  ```python
169
  messages = [
170
  {
171
  "role": "system",
172
- "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Restituisci solo i candidati rilevanti con score 0-100. Rispondi SOLO con il JSON richiesto."
173
  },
174
  {
175
  "role": "user",
176
- "content": '{"query": "requisiti CET1 fondi propri", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR", "content": "..."}, {"id": "doc_002", "title": "DORA Art. 5", "content": "..."}]}'
177
  }
178
  ]
179
-
180
- # Expected: {"matches": [{"id": "doc_001", "relevance": 95}]}
181
  ```
182
 
183
- ---
184
-
185
- ## ⚙️ Training Details
186
 
187
- | | |
188
- |---|---|
189
- | 🧬 **Method** | LoRA — bf16 full precision (no quantization) |
190
- | 🏗️ **Base Model** | Qwen3-4B-Instruct-2507 |
191
- | 📦 **Dataset** | 923 train / 102 eval samples |
192
- | ⏱️ **Duration** | 11.9 minutes |
 
 
 
 
 
 
193
 
194
- ### 📉 Training Metrics
195
 
196
  | Metric | Value |
197
  |---|---|
198
- | Final Train Loss | 1.241 |
199
- | Best Eval Loss | 1.191 (step 680/693) |
200
- | Train/Eval Gap | 0.050 |
 
 
 
 
 
 
201
 
202
- > Gap of 0.050 indicates **stable training with no overfitting**.
 
 
203
 
204
  ---
205
 
206
- ## 📚 Dataset Coverage
207
 
208
  The training data covers the full lifecycle of a RAG-based compliance assistant:
209
 
210
  | Category | Purpose |
211
  |---|---|
212
- | 🏷️ Title Generation | Generate conversation titles from user queries |
213
- | 🔍 Query Expansion | Enrich queries with regulatory terms for better retrieval |
214
- | 🧠 Intent Classification | Route queries to RAG vs conversational responses |
215
- | 📊 Document Reranking | Score retrieved documents by relevance |
216
- | 📝 Topic Extraction | Extract main topics from regulatory text pages |
217
- | 📖 Document Summarization | Summarize multi-page regulatory documents |
218
- | ⚖️ Relevance Filtering | Filter regulatory text relevant to banks |
219
- | 📅 Metadata Extraction | Find application dates, issuing authorities |
220
- | 🔧 Impact Analysis | Cross-reference regulations vs internal procedures |
221
- | 💬 RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |
222
 
223
  **Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.
224
 
225
  ---
226
 
227
- ## 🚀 Deployment
228
 
229
  ### With vLLM
230
  ```bash
@@ -235,10 +240,14 @@ vllm serve ./models/RegTech-4B-Instruct --dtype bfloat16
235
  ```python
236
  from transformers import AutoModelForCausalLM, AutoTokenizer
237
 
238
- model = AutoModelForCausalLM.from_pretrained("YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto")
 
 
239
  tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")
240
 
241
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
 
242
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
243
  outputs = model.generate(**inputs, max_new_tokens=512)
244
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@@ -246,17 +255,16 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
246
 
247
  ---
248
 
249
- ## ⚠️ Important Notes
250
 
251
- - 🎯 **RAG-optimized** — trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
252
- - 🏦 **Domain-specific** — optimized for Italian banking compliance. General capabilities may differ from the base model.
253
- - ⚖️ **Not legal advice** — a tool to assist compliance professionals, not a substitute for regulatory expertise.
254
- - 🔧 **Tool schemas** — tool calling works best with the specific function signatures used during training.
255
 
256
  ---
257
 
258
  <p align="center">
259
- Built with ❤️ for banking RAG<br>
260
- <em>Fine-tuned with LoRA Evaluated by Claude Opus 4.6 Powered by Qwen3</em>
261
- <em>Contact For Commercial Use: https://landing.2sophia.ai</em>
262
  </p>
 
18
  pipeline_tag: text-generation
19
  ---
20
 
21
+ # RegTech-4B-Instruct
22
 
23
  > **Fine-tuned for RAG-powered banking compliance — not general knowledge.**
24
 
25
  A specialized [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) model fine-tuned to excel within a **Retrieval-Augmented Generation (RAG) pipeline** for Italian banking regulatory compliance.
26
 
27
+ This model doesn't try to memorize regulations — it's trained to **work with retrieved context**: follow instructions precisely, produce structured outputs, call compliance tools, resist hallucinations, and maintain professional tone when grounded on regulatory documents.
28
 
29
  ---
30
 
31
+ ## What This Model Does
32
 
33
  This fine-tuning optimizes the model's **behavior within a RAG system**, not its factual knowledge. Specifically:
34
 
35
  | Task | Description |
36
  |---|---|
37
+ | **RAG Q&A** | Answer regulatory questions grounded on retrieved documents |
38
+ | **Tool Calling** | KYC verification, risk scoring, PEP checks, SOS reporting |
39
+ | **Query Expansion** | Rewrite user queries with regulatory terminology for better retrieval |
40
+ | **Intent Detection** | Classify if a message needs document search or is conversational |
41
+ | **Document Reranking** | Score candidate documents by relevance |
42
+ | **Structured JSON** | Topic extraction, metadata, impact analysis in JSON format |
43
+ | **Impact Analysis** | Cross-reference external regulations against internal bank procedures |
44
+ | **Hallucination Resistance** | Refuse to fabricate regulations, articles, or sanctions not in context |
45
 
46
  ---
47
 
48
+ ## Evaluation
49
 
50
+ ### Methodology
51
 
52
+ We evaluate all fine-tuned models using a **dynamic adversarial benchmark** designed to prevent overfitting to static test sets:
53
+
54
+ - **Test generation**: An independent LLM generates novel, realistic test scenarios across 13 compliance-specific categories for each evaluation run. Tests are never reused.
55
+ - **Blind comparison**: Both the base and fine-tuned model respond to identical prompts. Responses are anonymized and randomly swapped before judging to eliminate position bias.
56
+ - **Expert judging**: A frontier-class LLM acts as domain expert judge, scoring each response on 7 criteria (accuracy, context adherence, hallucination resistance, format, tone, instruction following, completeness) on a 1–5 scale.
57
+ - **Statistical robustness**: Each evaluation consists of multiple independent loops with fresh test sets, ensuring results are consistent and not artifacts of a single test batch.
58
+
59
+ This approach produces a rigorous, reproducible assessment that closely mirrors real-world compliance assistant performance.
60
+
61
+ ### Results — RegTech-4B-Instruct
62
+
63
+ Evaluated across **73 blind adversarial tests** over 3 independent loops.
64
+
65
+ #### Head-to-Head vs Base Model
66
 
67
  ```
68
+ Base Tuned
69
+ Win Rate (adj.) 45.2% 54.8%
70
+ Wins 26 33
71
+ Ties 14
 
72
  ```
73
 
74
+ #### Quality Scores (1–5 scale)
75
 
76
  | Criterion | Base | Tuned | Delta | |
77
  |---|:---:|:---:|:---:|---|
78
+ | Hallucination Resistance | 3.53 | **3.89** | +0.36 | Improved |
79
+ | Tone & Professionalism | 3.90 | **4.27** | +0.37 | Improved |
80
+ | Output Format | 3.41 | **3.75** | +0.34 | Improved |
81
+ | Instruction Following | 3.14 | **3.44** | +0.30 | Improved |
82
+ | Accuracy | 3.34 | **3.59** | +0.25 | Improved |
83
+ | Context Adherence | 3.66 | **3.89** | +0.23 | Improved |
84
+ | Completeness | **3.45** | 3.23 | -0.22 | Trade-off |
85
+ | **Overall** | **3.49** | **3.72** | **+0.23** | **Improved** |
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
+ #### Key Safety Improvements
88
+
89
+ The fine-tuned model demonstrates measurably safer behavior in high-stakes regulatory scenarios:
90
+
91
+ - **Hallucination traps**: The tuned model correctly refuses fabricated regulations in all tested scenarios. The base model invents plausible-sounding but entirely fictional legal articles and sanctions.
92
+ - **Credential protection**: When exposed to prompt injection attacks containing embedded credentials, the tuned model refuses disclosure. The base model has been observed leaking credentials verbatim.
93
+ - **Professional tone**: Eliminates emoji usage and filler phrases ("Certo!", "Ottima domanda!") that are inappropriate in regulatory communications.
94
+
95
+ #### Known Limitations
96
+
97
+ - **Completeness trade-off** (-0.22): The model tends toward concise, precise answers. For tasks requiring exhaustive analysis, responses may be shorter than ideal.
98
+ - **Query Expansion**: Performance on query rewriting tasks is below the base model. This is a known gap being addressed in dataset improvements.
99
+ - **Inference speed**: ~40% faster than base model (4.3s vs 7.0s average), primarily due to more concise outputs.
100
+
101
+ #### Consistency Across Loops
102
+
103
+ | Loop | Base Wins | Tuned Wins | Ties | Tuned % |
104
+ |:---:|:---:|:---:|:---:|:---:|
105
+ | 1 | 7 | 13 | 5 | 62.0% |
106
+ | 2 | 11 | 10 | 2 | 47.8% |
107
+ | 3 | 8 | 10 | 7 | 54.0% |
108
 
109
+ Tuned model wins or ties in 2 out of 3 independent loops.
110
 
111
+ ---
112
+
113
+ ## Usage Examples
114
 
115
+ ### RAG Q&A Answering from Retrieved Context
116
 
117
  ```python
118
  messages = [
 
124
  <contesto_recuperato>
125
  Art. 92 CRR - Gli enti soddisfano in qualsiasi momento i seguenti
126
  requisiti: a) CET1 del 4,5%; b) Tier 1 del 6%; c) capitale totale dell'8%.
 
 
127
  </contesto_recuperato>"""
128
  },
129
  {
 
133
  ]
134
  ```
135
 
136
+ ### Tool CallingCompliance Workflows
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  ```python
139
  messages = [
 
157
  "content": "Devo aprire un conto per una società con sede a Dubai. Il legale rappresentante è il sig. Al-Rashid."
158
  }
159
  ]
 
 
 
 
 
160
  ```
161
 
162
+ ### Query Expansion — Improving RAG Retrieval
163
 
164
  ```python
165
  messages = [
166
  {
167
  "role": "system",
168
+ "content": "Riscrivi la query dell'utente per migliorare il recupero documentale. Aggiungi termini tecnici e riferimenti normativi. Rispondi SOLO con il JSON."
169
  },
170
  {
171
  "role": "user",
172
+ "content": "## QUERY ORIGINALE: [obblighi segnalazione operazioni sospette]"
173
  }
174
  ]
 
 
175
  ```
176
 
177
+ ### Document Reranking
 
 
178
 
179
+ ```python
180
+ messages = [
181
+ {
182
+ "role": "system",
183
+ "content": "Valuta la rilevanza di ciascun candidato rispetto alla query. Score 0-100. Rispondi SOLO con il JSON."
184
+ },
185
+ {
186
+ "role": "user",
187
+ "content": '{"query": "requisiti CET1", "candidates": [{"id": "doc_001", "title": "Art. 92 CRR"}, {"id": "doc_002", "title": "DORA Art. 5"}]}'
188
+ }
189
+ ]
190
+ ```
191
 
192
+ ### Training Metrics
193
 
194
  | Metric | Value |
195
  |---|---|
196
+ | Final Eval Loss | 1.368 |
197
+ | Token Accuracy | 70.5% |
198
+ | Train/Eval Gap | 0.033 |
199
+
200
+ > A gap of 0.033 indicates stable training with no overfitting. The model learned domain-specific behavior without degrading general capabilities.
201
+
202
+ ### Design Principles
203
+
204
+ The LoRA configuration follows a **minimal intervention** philosophy validated through progressive experimentation across 6+ configurations:
205
 
206
+ - **Low rank, all modules**: Modifying all transformer layers with minimal rank produces better results than high rank on a subset of layers — consistent with findings from the [original LoRA paper](https://arxiv.org/abs/2106.09685).
207
+ - **Single epoch**: One pass through the data is sufficient for behavioral adaptation. Multiple epochs cause catastrophic forgetting on small models.
208
+ - **Conservative scaling**: Alpha = 2× rank with low learning rate ensures stable gradients with adequate signal amplification.
209
 
210
  ---
211
 
212
+ ## Dataset Coverage
213
 
214
  The training data covers the full lifecycle of a RAG-based compliance assistant:
215
 
216
  | Category | Purpose |
217
  |---|---|
218
+ | Query Expansion | Enrich queries with regulatory terms for better retrieval |
219
+ | Intent Classification | Route queries to RAG vs conversational responses |
220
+ | Document Reranking | Score retrieved documents by relevance |
221
+ | Topic Extraction | Extract main topics from regulatory text pages |
222
+ | Document Summarization | Summarize multi-page regulatory documents |
223
+ | Relevance Filtering | Filter regulatory text relevant to banks |
224
+ | Metadata Extraction | Find application dates, issuing authorities |
225
+ | Impact Analysis | Cross-reference regulations vs internal procedures |
226
+ | RAG Q&A + Tool Calling | Multi-turn compliance conversations with tools |
 
227
 
228
  **Regulatory sources covered:** CRR/CRR3, DORA (UE 2022/2554), D.Lgs. 231/2007 (AML), D.Lgs. 385/1993 (TUB), Circolare 285, PSD2, MiFID II/MiFIR, D.P.R. 180/1950 and related Banca d'Italia provisions.
229
 
230
  ---
231
 
232
+ ## Deployment
233
 
234
  ### With vLLM
235
  ```bash
 
240
  ```python
241
  from transformers import AutoModelForCausalLM, AutoTokenizer
242
 
243
+ model = AutoModelForCausalLM.from_pretrained(
244
+ "YOUR_REPO_ID", torch_dtype="bfloat16", device_map="auto"
245
+ )
246
  tokenizer = AutoTokenizer.from_pretrained("YOUR_REPO_ID")
247
 
248
+ text = tokenizer.apply_chat_template(
249
+ messages, tokenize=False, add_generation_prompt=True
250
+ )
251
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
252
  outputs = model.generate(**inputs, max_new_tokens=512)
253
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
255
 
256
  ---
257
 
258
+ ## Important Notes
259
 
260
+ - **RAG-optimized** — Trained to work with retrieved context, not to memorize regulations. Always provide relevant documents in the system prompt.
261
+ - **Domain-specific** — Optimized for Italian banking compliance. General capabilities may differ from the base model.
262
+ - **Not legal advice** — A tool to assist compliance professionals, not a substitute for regulatory expertise.
263
+ - **Part of a model family** — This 4B model is the lightweight variant. Larger models (7B, 14B, 32B) in the RegTech family offer progressively better completeness and accuracy for more demanding use cases.
264
 
265
  ---
266
 
267
  <p align="center">
268
+ Built for banking RAG by <a href="https://landing.2sophia.ai">2Sophia</a><br>
269
+ <em>Fine-tuned with LoRA &bull; Adversarial evaluation by frontier LLM judges &bull; Powered by Qwen3</em>
 
270
  </p>