bsq1989 commited on
Commit
21bbd2e
·
verified ·
1 Parent(s): eaecfc9

Update model card with Spider benchmark results

Browse files
Files changed (1) hide show
  1. README.md +110 -54
README.md CHANGED
@@ -1,75 +1,131 @@
1
  ---
2
- library_name: transformers
3
- license: other
4
  base_model: Qwen/Qwen3-4B-Base
 
 
5
  tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
- model-index:
10
- - name: train_run_06_qwen3_4b_formal
11
- results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- # train_run_06_qwen3_4b_formal
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on the sql_train dataset.
20
- It achieves the following results on the evaluation set:
21
- - Loss: 0.2490
22
 
23
- ## Model description
 
 
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
 
 
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
 
32
 
33
- More information needed
34
 
35
- ## Training procedure
 
 
 
 
 
 
36
 
37
- ### Training hyperparameters
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 5e-06
41
- - train_batch_size: 1
42
- - eval_batch_size: 1
43
- - seed: 42
44
- - gradient_accumulation_steps: 8
45
- - total_train_batch_size: 8
46
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_steps: 300
49
- - num_epochs: 4.0
50
 
51
- ### Training results
 
 
52
 
53
- | Training Loss | Epoch | Step | Validation Loss |
54
- |:-------------:|:------:|:----:|:---------------:|
55
- | 0.1851 | 0.3044 | 500 | 0.2562 |
56
- | 0.1331 | 0.6088 | 1000 | 0.2490 |
57
- | 0.1293 | 0.9132 | 1500 | 0.2554 |
58
- | 0.0885 | 1.2174 | 2000 | 0.2661 |
59
- | 0.0830 | 1.5218 | 2500 | 0.2772 |
60
- | 0.0843 | 1.8262 | 3000 | 0.2774 |
61
- | 0.0443 | 2.1303 | 3500 | 0.3124 |
62
- | 0.0402 | 2.4347 | 4000 | 0.3158 |
63
- | 0.0414 | 2.7391 | 4500 | 0.3392 |
64
- | 0.0178 | 3.0432 | 5000 | 0.3912 |
65
- | 0.0128 | 3.3476 | 5500 | 0.4340 |
66
- | 0.0110 | 3.6521 | 6000 | 0.4406 |
67
- | 0.0109 | 3.9565 | 6500 | 0.4410 |
68
 
 
 
 
 
69
 
70
- ### Framework versions
71
 
72
- - Transformers 5.2.0
73
- - Pytorch 2.9.1+cu130
74
- - Datasets 4.0.0
75
- - Tokenizers 0.22.2
 
1
  ---
 
 
2
  base_model: Qwen/Qwen3-4B-Base
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
  tags:
6
+ - text-to-sql
7
+ - sql
8
+ - qwen3
9
+ - llamafactory
10
+ - spider
11
+ - spider-test-suite
12
  ---
13
 
14
+ # qwen_4b_sql
15
+
16
+ `qwen_4b_sql` is a `Qwen3-4B-Base` model finetuned for text-to-SQL generation with full SFT on a cleaned split of `PipableAI/pip-txt-to-sql-spider-bird-dataset`.
17
+
18
+ This repository tracks the stronger 4B checkpoint from our H20 single-GPU training runs. In our internal comparison, this checkpoint outperformed the corresponding `Qwen3-1.7B-Base` baseline on Spider execution accuracy.
19
+
20
+ ## Base Model
21
+
22
+ - Base model: [`Qwen/Qwen3-4B-Base`](https://huggingface.co/Qwen/Qwen3-4B-Base)
23
+ - Finetuning framework: `LLaMA-Factory`
24
+ - Training mode: `Full SFT`
25
+ - Task: `schema + question -> SQL only`
26
+
27
+ ## Training Data
28
+
29
+ - Primary dataset: [`PipableAI/pip-txt-to-sql-spider-bird-dataset`](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset)
30
+ - We used a cleaned local split derived from that dataset for train/validation
31
+
32
+ ## Training Setup
33
+
34
+ - Hardware: single `NVIDIA H20 96GB`
35
+ - Precision: `bf16`
36
+ - Context length: `2048`
37
+ - Per-device train batch size: `1`
38
+ - Gradient accumulation steps: `8`
39
+ - Effective batch size: `8`
40
+ - Learning rate: `5e-6`
41
+ - Scheduler: `cosine`
42
+ - Warmup steps: `300`
43
+ - Epochs: `4.0`
44
+ - Template: `qwen3_nothink`
45
+ - Best-checkpoint selection: `load_best_model_at_end = true`
46
+
47
+ ## Spider Benchmark
48
+
49
+ The following numbers are from Spider dev using the official evaluation tooling:
50
+
51
+ - Official `match` evaluation from `test-suite-sql-eval`
52
+ - Official Spider `Test Suite` execution evaluation
53
+
54
+ ### Main Results
55
+
56
+ | Metric | Score |
57
+ | --- | ---: |
58
+ | Spider official exact match | 35.0% |
59
+ | Spider Test Suite execution accuracy | 67.6% |
60
+
61
+ ### Difficulty Breakdown
62
+
63
+ | Difficulty | Exact Match | Test Suite Exec |
64
+ | --- | ---: | ---: |
65
+ | Easy | 64.9% | 87.5% |
66
+ | Medium | 37.4% | 72.9% |
67
+ | Hard | 16.1% | 50.0% |
68
+ | Extra | 3.6% | 42.2% |
69
+
70
+ ## Notes
71
+
72
+ - This model is stronger under execution-based Spider evaluation than our best `Qwen3-1.7B-Base` run.
73
+ - In our experiments, exact-match metrics were often stricter than execution-based metrics because semantically valid SQL rewrites do not always match the Spider gold form exactly.
74
+ - A later 4B rerun with altered training settings underperformed this checkpoint on Spider and is not the checkpoint published here.
75
 
76
+ ## Intended Use
77
 
78
+ This model is intended for:
 
 
79
 
80
+ - text-to-SQL research baselines
81
+ - schema-conditioned SQL generation experiments
82
+ - single-turn SQL generation from natural language plus schema text
83
 
84
+ It is not validated for:
85
 
86
+ - production-grade database access control
87
+ - unrestricted execution over arbitrary enterprise schemas
88
+ - multi-turn agent workflows without extra prompting / tooling
89
 
90
+ ## Example Usage
91
 
92
+ ```python
93
+ from transformers import AutoTokenizer, AutoModelForCausalLM
94
 
95
+ model_id = "bsq1989/qwen_4b_sql"
96
 
97
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
98
+ model = AutoModelForCausalLM.from_pretrained(
99
+ model_id,
100
+ torch_dtype="auto",
101
+ device_map="auto",
102
+ trust_remote_code=True,
103
+ )
104
 
105
+ prompt = """Generate SQL from the given schema and question. Output SQL only.
106
 
107
+ Schema:
108
+ CREATE TABLE twitter (TweetID INTEGER, UserID INTEGER, LocationID INTEGER, Lang TEXT, ...);
109
+ CREATE TABLE location (LocationID INTEGER, Country TEXT, City TEXT, ...);
 
 
 
 
 
 
 
 
110
 
111
+ Question:
112
+ How many tweets are in English?
113
+ """
114
 
115
+ messages = [{"role": "user", "content": prompt}]
116
+ text = tokenizer.apply_chat_template(
117
+ messages,
118
+ tokenize=False,
119
+ add_generation_prompt=True,
120
+ )
 
 
 
 
 
 
 
 
 
121
 
122
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
123
+ outputs = model.generate(**inputs, max_new_tokens=256)
124
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
125
+ ```
126
 
127
+ ## Limitations
128
 
129
+ - Performance drops on more open-ended and heterogeneous SQL benchmarks than Spider.
130
+ - The model can still produce invalid column references on out-of-distribution schemas.
131
+ - Benchmark numbers here reflect our current internal setup and should be reproduced with the same evaluation pipeline for strict comparison.