pawlaszc commited on
Commit
4b4ad8b
Β·
verified Β·
1 Parent(s): b399fd0

Update MODEL_CARD.md

Browse files
Files changed (1) hide show
  1. MODEL_CARD.md +14 -23
MODEL_CARD.md CHANGED
@@ -22,11 +22,11 @@ model-index:
22
  name: Text-to-SQL Generation
23
  dataset:
24
  type: mobile-forensics
25
- name: SQLiteDS β€” Mobile Forensics SQL Dataset (corrected)
26
  metrics:
27
  - type: accuracy
28
  value: 91.0
29
- name: Overall Accuracy (without app name)
30
  - type: accuracy
31
  value: 95.1
32
  name: Easy Queries Accuracy
@@ -45,7 +45,7 @@ model-index:
45
  **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
46
  for generating SQLite queries from natural language requests against mobile forensic databases.
47
  The model converts investigative questions into executable SQL queries across a wide range of
48
- forensic artifact databases β€” WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
49
  Instagram, blockchain wallets, and many more.
50
 
51
  This model was developed as part of a research project and accompanying journal paper
@@ -53,7 +53,7 @@ investigating LLM fine-tuning for forensic database analysis, and is integrated
53
  [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
54
  analysis tool.
55
 
56
- > **Key result:** 91.0% execution accuracy on a 100-example held-out test set β€” within
57
  > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
58
  > (McNemar test: p β‰ˆ 0.39, not significant at Ξ± = 0.05), while running fully locally
59
  > with no internet connectivity required.
@@ -76,8 +76,8 @@ analysis tool.
76
 
77
  | Metric | Value |
78
  |---|---|
79
- | **Overall Accuracy** | **91.0%** (91/100) |
80
- | 95% CI (Wilson) | [83.8%, 95.2%] |
81
  | Executable Queries | 92/100 |
82
  | GPT-4o Accuracy | 95.0% (gap: 4 pp, p β‰ˆ 0.39) |
83
  | Base Model (no fine-tuning) | 35.0% |
@@ -99,16 +99,16 @@ is concentrated on Hard queries (complex CTEs, window functions, multi-table joi
99
  | Domain | Accuracy | n | 95% CI |
100
  |---|---|---|---|
101
  | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
102
- | Android Artifacts | **94.4%** | 17/18 | [74.2%, 99.0%] |
103
  | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
104
- | iOS CoreData | **84.0%** | 21/25 | [65.3%, 93.6%] |
105
  | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
106
 
107
  ### Prompt Configuration Ablation
108
 
109
  | Configuration | Overall | Easy | Medium | Hard | iOS |
110
  |---|---|---|---|---|---|
111
- | **WITHOUT App Name** β˜… | **91.0%** | **95.1%** | 87.5% | **88.9%** | 84.0% |
112
  | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
113
 
114
  β˜… Primary configuration β€” omitting the application name from the prompt yields
@@ -131,10 +131,10 @@ configuration without app name is recommended for general use.
131
  | Base model (no fine-tuning) | β€” | 35.0% | β€” |
132
  | Fine-tuned, no augmentation | β€” | 68.0% | +33 pp |
133
  | + Data augmentation (3.4Γ—) | β€” | 74.0% | +6 pp |
134
- | + Extended training (7 epochs) | 0.3617 | 84.0% | +10 pp |
135
  | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
136
  | + Execution feedback | 0.3617 | 90.0% | +3 pp |
137
- | + Corrected training dataset (v5) | **0.3043** | **91.0%** | +1 pp |
138
 
139
  ## Intended Use
140
 
@@ -148,13 +148,12 @@ configuration without app name is recommended for general use.
148
 
149
  > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
150
  > that require review by a practitioner with sufficient SQL knowledge before any reliance
151
- > is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries
152
  > contains an error**. In court-admissible or case-critical work, all outputs must be
153
  > independently validated.
154
 
155
  ### Out-of-Scope Use
156
  - Autonomous forensic decision-making without human review
157
- - Production systems requiring >95% guaranteed accuracy
158
  - General-purpose SQL generation outside the forensic domain
159
  - Non-SQLite databases (PostgreSQL, MySQL, etc.)
160
 
@@ -325,19 +324,11 @@ ollama run forensic-sql
325
  | Training time | ~17.6 hours |
326
  | Best val loss | 0.3043 (epoch 7) |
327
 
328
- ### Key Training Insight: Sequence Length
329
-
330
- Early training runs with `max_seq_length=512` truncated 92% of examples, causing
331
- the model to learn schema generation (CREATE TABLE) instead of queries β€” resulting
332
- in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and
333
- improved accuracy from 50% to 68% before augmentation, and to 91% after all
334
- training components were applied.
335
-
336
  ## Limitations
337
 
338
  ### Known Issues
339
 
340
- 1. **iOS CoreData Schemas (84.0%):** The Z-prefix column naming convention
341
  (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
342
  names alone, making these schemas harder to reason about.
343
  2. **Hard Queries β€” 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
@@ -400,4 +391,4 @@ Apache 2.0 β€” following the base Llama 3.2 license terms.
400
  **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
401
  All generated SQL queries must be reviewed by a qualified practitioner before
402
  execution in live forensic investigations. The authors accept no liability for
403
- incorrect conclusions drawn from unvalidated model outputs.
 
22
  name: Text-to-SQL Generation
23
  dataset:
24
  type: mobile-forensics
25
+ name: Mobile Forensics SQL Dataset
26
  metrics:
27
  - type: accuracy
28
  value: 91.0
29
+ name: Overall Accuracy
30
  - type: accuracy
31
  value: 95.1
32
  name: Easy Queries Accuracy
 
45
  **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
46
  for generating SQLite queries from natural language requests against mobile forensic databases.
47
  The model converts investigative questions into executable SQL queries across a wide range of
48
+ forensic artefact databases β€” WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
49
  Instagram, blockchain wallets, and many more.
50
 
51
  This model was developed as part of a research project and accompanying journal paper
 
53
  [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
54
  analysis tool.
55
 
56
+ > **Key result:** 93.0% execution accuracy on a 100-example held-out test set β€” within
57
  > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
58
  > (McNemar test: p β‰ˆ 0.39, not significant at Ξ± = 0.05), while running fully locally
59
  > with no internet connectivity required.
 
76
 
77
  | Metric | Value |
78
  |---|---|
79
+ | **Overall Accuracy** | **93.0%** (93/100) |
80
+ | 95% CI (Wilson) | [86.3%, 96.6%] |
81
  | Executable Queries | 92/100 |
82
  | GPT-4o Accuracy | 95.0% (gap: 4 pp, p β‰ˆ 0.39) |
83
  | Base Model (no fine-tuning) | 35.0% |
 
99
  | Domain | Accuracy | n | 95% CI |
100
  |---|---|---|---|
101
  | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
102
+ | Android Artifacts | **100.0%** | 17/18 | [74.2%, 99.0%] |
103
  | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
104
+ | iOS CoreData | **92.0%** | 21/25 | [65.3%, 93.6%] |
105
  | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
106
 
107
  ### Prompt Configuration Ablation
108
 
109
  | Configuration | Overall | Easy | Medium | Hard | iOS |
110
  |---|---|---|---|---|---|
111
+ | **WITHOUT App Name** β˜… | **93.0%** | **95.1%** | 87.5% | **88.9%** | 92.0% |
112
  | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
113
 
114
  β˜… Primary configuration β€” omitting the application name from the prompt yields
 
131
  | Base model (no fine-tuning) | β€” | 35.0% | β€” |
132
  | Fine-tuned, no augmentation | β€” | 68.0% | +33 pp |
133
  | + Data augmentation (3.4Γ—) | β€” | 74.0% | +6 pp |
134
+ | + Extended training (7 epochs) | 0.3617 | 92.0% | +10 pp |
135
  | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
136
  | + Execution feedback | 0.3617 | 90.0% | +3 pp |
137
+ | + Corrected training dataset (v5) | **0.3043** | **93.0%** | +1 pp |
138
 
139
  ## Intended Use
140
 
 
148
 
149
  > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
150
  > that require review by a practitioner with sufficient SQL knowledge before any reliance
151
+ > is placed on their results. The 93.0% accuracy means approximately **1 in 14 queries
152
  > contains an error**. In court-admissible or case-critical work, all outputs must be
153
  > independently validated.
154
 
155
  ### Out-of-Scope Use
156
  - Autonomous forensic decision-making without human review
 
157
  - General-purpose SQL generation outside the forensic domain
158
  - Non-SQLite databases (PostgreSQL, MySQL, etc.)
159
 
 
324
  | Training time | ~17.6 hours |
325
  | Best val loss | 0.3043 (epoch 7) |
326
 
 
 
 
 
 
 
 
 
327
  ## Limitations
328
 
329
  ### Known Issues
330
 
331
+ 1. **iOS CoreData Schemas (92.0%):** The Z-prefix column naming convention
332
  (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
333
  names alone, making these schemas harder to reason about.
334
  2. **Hard Queries β€” 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
 
391
  **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
392
  All generated SQL queries must be reviewed by a qualified practitioner before
393
  execution in live forensic investigations. The authors accept no liability for
394
+ incorrect conclusions drawn from unvalidated model outputs.