Update MODEL_CARD.md
Browse files- MODEL_CARD.md +14 -23
MODEL_CARD.md
CHANGED
|
@@ -22,11 +22,11 @@ model-index:
|
|
| 22 |
name: Text-to-SQL Generation
|
| 23 |
dataset:
|
| 24 |
type: mobile-forensics
|
| 25 |
-
name:
|
| 26 |
metrics:
|
| 27 |
- type: accuracy
|
| 28 |
value: 91.0
|
| 29 |
-
name: Overall Accuracy
|
| 30 |
- type: accuracy
|
| 31 |
value: 95.1
|
| 32 |
name: Easy Queries Accuracy
|
|
@@ -45,7 +45,7 @@ model-index:
|
|
| 45 |
**ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
|
| 46 |
for generating SQLite queries from natural language requests against mobile forensic databases.
|
| 47 |
The model converts investigative questions into executable SQL queries across a wide range of
|
| 48 |
-
forensic
|
| 49 |
Instagram, blockchain wallets, and many more.
|
| 50 |
|
| 51 |
This model was developed as part of a research project and accompanying journal paper
|
|
@@ -53,7 +53,7 @@ investigating LLM fine-tuning for forensic database analysis, and is integrated
|
|
| 53 |
[FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
|
| 54 |
analysis tool.
|
| 55 |
|
| 56 |
-
> **Key result:**
|
| 57 |
> 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
|
| 58 |
> (McNemar test: p β 0.39, not significant at Ξ± = 0.05), while running fully locally
|
| 59 |
> with no internet connectivity required.
|
|
@@ -76,8 +76,8 @@ analysis tool.
|
|
| 76 |
|
| 77 |
| Metric | Value |
|
| 78 |
|---|---|
|
| 79 |
-
| **Overall Accuracy** | **
|
| 80 |
-
| 95% CI (Wilson) | [
|
| 81 |
| Executable Queries | 92/100 |
|
| 82 |
| GPT-4o Accuracy | 95.0% (gap: 4 pp, p β 0.39) |
|
| 83 |
| Base Model (no fine-tuning) | 35.0% |
|
|
@@ -99,16 +99,16 @@ is concentrated on Hard queries (complex CTEs, window functions, multi-table joi
|
|
| 99 |
| Domain | Accuracy | n | 95% CI |
|
| 100 |
|---|---|---|---|
|
| 101 |
| Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
|
| 102 |
-
| Android Artifacts | **
|
| 103 |
| Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
|
| 104 |
-
| iOS CoreData | **
|
| 105 |
| Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
|
| 106 |
|
| 107 |
### Prompt Configuration Ablation
|
| 108 |
|
| 109 |
| Configuration | Overall | Easy | Medium | Hard | iOS |
|
| 110 |
|---|---|---|---|---|---|
|
| 111 |
-
| **WITHOUT App Name** β
| **
|
| 112 |
| WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
|
| 113 |
|
| 114 |
β
Primary configuration β omitting the application name from the prompt yields
|
|
@@ -131,10 +131,10 @@ configuration without app name is recommended for general use.
|
|
| 131 |
| Base model (no fine-tuning) | β | 35.0% | β |
|
| 132 |
| Fine-tuned, no augmentation | β | 68.0% | +33 pp |
|
| 133 |
| + Data augmentation (3.4Γ) | β | 74.0% | +6 pp |
|
| 134 |
-
| + Extended training (7 epochs) | 0.3617 |
|
| 135 |
| + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
|
| 136 |
| + Execution feedback | 0.3617 | 90.0% | +3 pp |
|
| 137 |
-
| + Corrected training dataset (v5) | **0.3043** | **
|
| 138 |
|
| 139 |
## Intended Use
|
| 140 |
|
|
@@ -148,13 +148,12 @@ configuration without app name is recommended for general use.
|
|
| 148 |
|
| 149 |
> **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
|
| 150 |
> that require review by a practitioner with sufficient SQL knowledge before any reliance
|
| 151 |
-
> is placed on their results. The
|
| 152 |
> contains an error**. In court-admissible or case-critical work, all outputs must be
|
| 153 |
> independently validated.
|
| 154 |
|
| 155 |
### Out-of-Scope Use
|
| 156 |
- Autonomous forensic decision-making without human review
|
| 157 |
-
- Production systems requiring >95% guaranteed accuracy
|
| 158 |
- General-purpose SQL generation outside the forensic domain
|
| 159 |
- Non-SQLite databases (PostgreSQL, MySQL, etc.)
|
| 160 |
|
|
@@ -325,19 +324,11 @@ ollama run forensic-sql
|
|
| 325 |
| Training time | ~17.6 hours |
|
| 326 |
| Best val loss | 0.3043 (epoch 7) |
|
| 327 |
|
| 328 |
-
### Key Training Insight: Sequence Length
|
| 329 |
-
|
| 330 |
-
Early training runs with `max_seq_length=512` truncated 92% of examples, causing
|
| 331 |
-
the model to learn schema generation (CREATE TABLE) instead of queries β resulting
|
| 332 |
-
in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and
|
| 333 |
-
improved accuracy from 50% to 68% before augmentation, and to 91% after all
|
| 334 |
-
training components were applied.
|
| 335 |
-
|
| 336 |
## Limitations
|
| 337 |
|
| 338 |
### Known Issues
|
| 339 |
|
| 340 |
-
1. **iOS CoreData Schemas (
|
| 341 |
(e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
|
| 342 |
names alone, making these schemas harder to reason about.
|
| 343 |
2. **Hard Queries β 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
|
|
@@ -400,4 +391,4 @@ Apache 2.0 β following the base Llama 3.2 license terms.
|
|
| 400 |
**Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
|
| 401 |
All generated SQL queries must be reviewed by a qualified practitioner before
|
| 402 |
execution in live forensic investigations. The authors accept no liability for
|
| 403 |
-
incorrect conclusions drawn from unvalidated model outputs.
|
|
|
|
| 22 |
name: Text-to-SQL Generation
|
| 23 |
dataset:
|
| 24 |
type: mobile-forensics
|
| 25 |
+
name: Mobile Forensics SQL Dataset
|
| 26 |
metrics:
|
| 27 |
- type: accuracy
|
| 28 |
value: 91.0
|
| 29 |
+
name: Overall Accuracy
|
| 30 |
- type: accuracy
|
| 31 |
value: 95.1
|
| 32 |
name: Easy Queries Accuracy
|
|
|
|
| 45 |
**ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
|
| 46 |
for generating SQLite queries from natural language requests against mobile forensic databases.
|
| 47 |
The model converts investigative questions into executable SQL queries across a wide range of
|
| 48 |
+
forensic artefact databases β WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
|
| 49 |
Instagram, blockchain wallets, and many more.
|
| 50 |
|
| 51 |
This model was developed as part of a research project and accompanying journal paper
|
|
|
|
| 53 |
[FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
|
| 54 |
analysis tool.
|
| 55 |
|
| 56 |
+
> **Key result:** 93.0% execution accuracy on a 100-example held-out test set β within
|
| 57 |
> 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
|
| 58 |
> (McNemar test: p β 0.39, not significant at Ξ± = 0.05), while running fully locally
|
| 59 |
> with no internet connectivity required.
|
|
|
|
| 76 |
|
| 77 |
| Metric | Value |
|
| 78 |
|---|---|
|
| 79 |
+
| **Overall Accuracy** | **93.0%** (93/100) |
|
| 80 |
+
| 95% CI (Wilson) | [86.3%, 96.6%] |
|
| 81 |
| Executable Queries | 92/100 |
|
| 82 |
| GPT-4o Accuracy | 95.0% (gap: 4 pp, p β 0.39) |
|
| 83 |
| Base Model (no fine-tuning) | 35.0% |
|
|
|
|
| 99 |
| Domain | Accuracy | n | 95% CI |
|
| 100 |
|---|---|---|---|
|
| 101 |
| Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
|
| 102 |
+
| Android Artifacts | **100.0%** | 17/18 | [74.2%, 99.0%] |
|
| 103 |
| Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
|
| 104 |
+
| iOS CoreData | **92.0%** | 21/25 | [65.3%, 93.6%] |
|
| 105 |
| Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
|
| 106 |
|
| 107 |
### Prompt Configuration Ablation
|
| 108 |
|
| 109 |
| Configuration | Overall | Easy | Medium | Hard | iOS |
|
| 110 |
|---|---|---|---|---|---|
|
| 111 |
+
| **WITHOUT App Name** β
| **93.0%** | **95.1%** | 87.5% | **88.9%** | 92.0% |
|
| 112 |
| WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
|
| 113 |
|
| 114 |
β
Primary configuration β omitting the application name from the prompt yields
|
|
|
|
| 131 |
| Base model (no fine-tuning) | β | 35.0% | β |
|
| 132 |
| Fine-tuned, no augmentation | β | 68.0% | +33 pp |
|
| 133 |
| + Data augmentation (3.4Γ) | β | 74.0% | +6 pp |
|
| 134 |
+
| + Extended training (7 epochs) | 0.3617 | 92.0% | +10 pp |
|
| 135 |
| + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
|
| 136 |
| + Execution feedback | 0.3617 | 90.0% | +3 pp |
|
| 137 |
+
| + Corrected training dataset (v5) | **0.3043** | **93.0%** | +1 pp |
|
| 138 |
|
| 139 |
## Intended Use
|
| 140 |
|
|
|
|
| 148 |
|
| 149 |
> **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
|
| 150 |
> that require review by a practitioner with sufficient SQL knowledge before any reliance
|
| 151 |
+
> is placed on their results. The 93.0% accuracy means approximately **1 in 14 queries
|
| 152 |
> contains an error**. In court-admissible or case-critical work, all outputs must be
|
| 153 |
> independently validated.
|
| 154 |
|
| 155 |
### Out-of-Scope Use
|
| 156 |
- Autonomous forensic decision-making without human review
|
|
|
|
| 157 |
- General-purpose SQL generation outside the forensic domain
|
| 158 |
- Non-SQLite databases (PostgreSQL, MySQL, etc.)
|
| 159 |
|
|
|
|
| 324 |
| Training time | ~17.6 hours |
|
| 325 |
| Best val loss | 0.3043 (epoch 7) |
|
| 326 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
## Limitations
|
| 328 |
|
| 329 |
### Known Issues
|
| 330 |
|
| 331 |
+
1. **iOS CoreData Schemas (92.0%):** The Z-prefix column naming convention
|
| 332 |
(e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
|
| 333 |
names alone, making these schemas harder to reason about.
|
| 334 |
2. **Hard Queries β 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
|
|
|
|
| 391 |
**Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
|
| 392 |
All generated SQL queries must be reviewed by a qualified practitioner before
|
| 393 |
execution in live forensic investigations. The authors accept no liability for
|
| 394 |
+
incorrect conclusions drawn from unvalidated model outputs.
|