Update model card with Spider benchmark results
Browse files
README.md
CHANGED
|
@@ -1,75 +1,131 @@
|
|
| 1 |
---
|
| 2 |
-
library_name: transformers
|
| 3 |
-
license: other
|
| 4 |
base_model: Qwen/Qwen3-4B-Base
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
-
This model is
|
| 20 |
-
It achieves the following results on the evaluation set:
|
| 21 |
-
- Loss: 0.2490
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
- eval_batch_size: 1
|
| 43 |
-
- seed: 42
|
| 44 |
-
- gradient_accumulation_steps: 8
|
| 45 |
-
- total_train_batch_size: 8
|
| 46 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 47 |
-
- lr_scheduler_type: cosine
|
| 48 |
-
- lr_scheduler_warmup_steps: 300
|
| 49 |
-
- num_epochs: 4.0
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
| 0.0830 | 1.5218 | 2500 | 0.2772 |
|
| 60 |
-
| 0.0843 | 1.8262 | 3000 | 0.2774 |
|
| 61 |
-
| 0.0443 | 2.1303 | 3500 | 0.3124 |
|
| 62 |
-
| 0.0402 | 2.4347 | 4000 | 0.3158 |
|
| 63 |
-
| 0.0414 | 2.7391 | 4500 | 0.3392 |
|
| 64 |
-
| 0.0178 | 3.0432 | 5000 | 0.3912 |
|
| 65 |
-
| 0.0128 | 3.3476 | 5500 | 0.4340 |
|
| 66 |
-
| 0.0110 | 3.6521 | 6000 | 0.4406 |
|
| 67 |
-
| 0.0109 | 3.9565 | 6500 | 0.4410 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
##
|
| 71 |
|
| 72 |
-
-
|
| 73 |
-
-
|
| 74 |
-
-
|
| 75 |
-
- Tokenizers 0.22.2
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model: Qwen/Qwen3-4B-Base
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
tags:
|
| 6 |
+
- text-to-sql
|
| 7 |
+
- sql
|
| 8 |
+
- qwen3
|
| 9 |
+
- llamafactory
|
| 10 |
+
- spider
|
| 11 |
+
- spider-test-suite
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# qwen_4b_sql
|
| 15 |
+
|
| 16 |
+
`qwen_4b_sql` is a `Qwen3-4B-Base` model finetuned for text-to-SQL generation with full SFT on a cleaned split of `PipableAI/pip-txt-to-sql-spider-bird-dataset`.
|
| 17 |
+
|
| 18 |
+
This repository tracks the stronger 4B checkpoint from our H20 single-GPU training runs. In our internal comparison, this checkpoint outperformed the corresponding `Qwen3-1.7B-Base` baseline on Spider execution accuracy.
|
| 19 |
+
|
| 20 |
+
## Base Model
|
| 21 |
+
|
| 22 |
+
- Base model: [`Qwen/Qwen3-4B-Base`](https://huggingface.co/Qwen/Qwen3-4B-Base)
|
| 23 |
+
- Finetuning framework: `LLaMA-Factory`
|
| 24 |
+
- Training mode: `Full SFT`
|
| 25 |
+
- Task: `schema + question -> SQL only`
|
| 26 |
+
|
| 27 |
+
## Training Data
|
| 28 |
+
|
| 29 |
+
- Primary dataset: [`PipableAI/pip-txt-to-sql-spider-bird-dataset`](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset)
|
| 30 |
+
- We used a cleaned local split derived from that dataset for train/validation
|
| 31 |
+
|
| 32 |
+
## Training Setup
|
| 33 |
+
|
| 34 |
+
- Hardware: single `NVIDIA H20 96GB`
|
| 35 |
+
- Precision: `bf16`
|
| 36 |
+
- Context length: `2048`
|
| 37 |
+
- Per-device train batch size: `1`
|
| 38 |
+
- Gradient accumulation steps: `8`
|
| 39 |
+
- Effective batch size: `8`
|
| 40 |
+
- Learning rate: `5e-6`
|
| 41 |
+
- Scheduler: `cosine`
|
| 42 |
+
- Warmup steps: `300`
|
| 43 |
+
- Epochs: `4.0`
|
| 44 |
+
- Template: `qwen3_nothink`
|
| 45 |
+
- Best-checkpoint selection: `load_best_model_at_end = true`
|
| 46 |
+
|
| 47 |
+
## Spider Benchmark
|
| 48 |
+
|
| 49 |
+
The following numbers are from Spider dev using the official evaluation tooling:
|
| 50 |
+
|
| 51 |
+
- Official `match` evaluation from `test-suite-sql-eval`
|
| 52 |
+
- Official Spider `Test Suite` execution evaluation
|
| 53 |
+
|
| 54 |
+
### Main Results
|
| 55 |
+
|
| 56 |
+
| Metric | Score |
|
| 57 |
+
| --- | ---: |
|
| 58 |
+
| Spider official exact match | 35.0% |
|
| 59 |
+
| Spider Test Suite execution accuracy | 67.6% |
|
| 60 |
+
|
| 61 |
+
### Difficulty Breakdown
|
| 62 |
+
|
| 63 |
+
| Difficulty | Exact Match | Test Suite Exec |
|
| 64 |
+
| --- | ---: | ---: |
|
| 65 |
+
| Easy | 64.9% | 87.5% |
|
| 66 |
+
| Medium | 37.4% | 72.9% |
|
| 67 |
+
| Hard | 16.1% | 50.0% |
|
| 68 |
+
| Extra | 3.6% | 42.2% |
|
| 69 |
+
|
| 70 |
+
## Notes
|
| 71 |
+
|
| 72 |
+
- This model is stronger under execution-based Spider evaluation than our best `Qwen3-1.7B-Base` run.
|
| 73 |
+
- In our experiments, exact-match metrics were often stricter than execution-based metrics because semantically valid SQL rewrites do not always match the Spider gold form exactly.
|
| 74 |
+
- A later 4B rerun with altered training settings underperformed this checkpoint on Spider and is not the checkpoint published here.
|
| 75 |
|
| 76 |
+
## Intended Use
|
| 77 |
|
| 78 |
+
This model is intended for:
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
- text-to-SQL research baselines
|
| 81 |
+
- schema-conditioned SQL generation experiments
|
| 82 |
+
- single-turn SQL generation from natural language plus schema text
|
| 83 |
|
| 84 |
+
It is not validated for:
|
| 85 |
|
| 86 |
+
- production-grade database access control
|
| 87 |
+
- unrestricted execution over arbitrary enterprise schemas
|
| 88 |
+
- multi-turn agent workflows without extra prompting / tooling
|
| 89 |
|
| 90 |
+
## Example Usage
|
| 91 |
|
| 92 |
+
```python
|
| 93 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 94 |
|
| 95 |
+
model_id = "bsq1989/qwen_4b_sql"
|
| 96 |
|
| 97 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 98 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 99 |
+
model_id,
|
| 100 |
+
torch_dtype="auto",
|
| 101 |
+
device_map="auto",
|
| 102 |
+
trust_remote_code=True,
|
| 103 |
+
)
|
| 104 |
|
| 105 |
+
prompt = """Generate SQL from the given schema and question. Output SQL only.
|
| 106 |
|
| 107 |
+
Schema:
|
| 108 |
+
CREATE TABLE twitter (TweetID INTEGER, UserID INTEGER, LocationID INTEGER, Lang TEXT, ...);
|
| 109 |
+
CREATE TABLE location (LocationID INTEGER, Country TEXT, City TEXT, ...);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
Question:
|
| 112 |
+
How many tweets are in English?
|
| 113 |
+
"""
|
| 114 |
|
| 115 |
+
messages = [{"role": "user", "content": prompt}]
|
| 116 |
+
text = tokenizer.apply_chat_template(
|
| 117 |
+
messages,
|
| 118 |
+
tokenize=False,
|
| 119 |
+
add_generation_prompt=True,
|
| 120 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 123 |
+
outputs = model.generate(**inputs, max_new_tokens=256)
|
| 124 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 125 |
+
```
|
| 126 |
|
| 127 |
+
## Limitations
|
| 128 |
|
| 129 |
+
- Performance drops on more open-ended and heterogeneous SQL benchmarks than Spider.
|
| 130 |
+
- The model can still produce invalid column references on out-of-distribution schemas.
|
| 131 |
+
- Benchmark numbers here reflect our current internal setup and should be reproduced with the same evaluation pipeline for strict comparison.
|
|
|