bsq1989
/

qwen_4b_sql

@@ -1,75 +1,131 @@
 ---
-library_name: transformers
-license: other
 base_model: Qwen/Qwen3-4B-Base
 tags:
-- llama-factory
-- full
-- generated_from_trainer
-model-index:
-- name: train_run_06_qwen3_4b_formal
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# train_run_06_qwen3_4b_formal
-This model is a fine-tuned version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on the sql_train dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.2490
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 5e-06
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- gradient_accumulation_steps: 8
-- total_train_batch_size: 8
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 300
-- num_epochs: 4.0
-### Training results
-| Training Loss | Epoch  | Step | Validation Loss |
-|:-------------:|:------:|:----:|:---------------:|
-| 0.1851        | 0.3044 | 500  | 0.2562          |
-| 0.1331        | 0.6088 | 1000 | 0.2490          |
-| 0.1293        | 0.9132 | 1500 | 0.2554          |
-| 0.0885        | 1.2174 | 2000 | 0.2661          |
-| 0.0830        | 1.5218 | 2500 | 0.2772          |
-| 0.0843        | 1.8262 | 3000 | 0.2774          |
-| 0.0443        | 2.1303 | 3500 | 0.3124          |
-| 0.0402        | 2.4347 | 4000 | 0.3158          |
-| 0.0414        | 2.7391 | 4500 | 0.3392          |
-| 0.0178        | 3.0432 | 5000 | 0.3912          |
-| 0.0128        | 3.3476 | 5500 | 0.4340          |
-| 0.0110        | 3.6521 | 6000 | 0.4406          |
-| 0.0109        | 3.9565 | 6500 | 0.4410          |
-### Framework versions
-- Transformers 5.2.0
-- Pytorch 2.9.1+cu130
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 ---
 base_model: Qwen/Qwen3-4B-Base
+library_name: transformers
+pipeline_tag: text-generation
 tags:
+  - text-to-sql
+  - sql
+  - qwen3
+  - llamafactory
+  - spider
+  - spider-test-suite
 ---
+# qwen_4b_sql
+`qwen_4b_sql` is a `Qwen3-4B-Base` model finetuned for text-to-SQL generation with full SFT on a cleaned split of `PipableAI/pip-txt-to-sql-spider-bird-dataset`.
+This repository tracks the stronger 4B checkpoint from our H20 single-GPU training runs. In our internal comparison, this checkpoint outperformed the corresponding `Qwen3-1.7B-Base` baseline on Spider execution accuracy.
+## Base Model
+- Base model: [`Qwen/Qwen3-4B-Base`](https://huggingface.co/Qwen/Qwen3-4B-Base)
+- Finetuning framework: `LLaMA-Factory`
+- Training mode: `Full SFT`
+- Task: `schema + question -> SQL only`
+## Training Data
+- Primary dataset: [`PipableAI/pip-txt-to-sql-spider-bird-dataset`](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset)
+- We used a cleaned local split derived from that dataset for train/validation
+## Training Setup
+- Hardware: single `NVIDIA H20 96GB`
+- Precision: `bf16`
+- Context length: `2048`
+- Per-device train batch size: `1`
+- Gradient accumulation steps: `8`
+- Effective batch size: `8`
+- Learning rate: `5e-6`
+- Scheduler: `cosine`
+- Warmup steps: `300`
+- Epochs: `4.0`
+- Template: `qwen3_nothink`
+- Best-checkpoint selection: `load_best_model_at_end = true`
+## Spider Benchmark
+The following numbers are from Spider dev using the official evaluation tooling:
+- Official `match` evaluation from `test-suite-sql-eval`
+- Official Spider `Test Suite` execution evaluation
+### Main Results
+| Metric | Score |
+| --- | ---: |
+| Spider official exact match | 35.0% |
+| Spider Test Suite execution accuracy | 67.6% |
+### Difficulty Breakdown
+| Difficulty | Exact Match | Test Suite Exec |
+| --- | ---: | ---: |
+| Easy | 64.9% | 87.5% |
+| Medium | 37.4% | 72.9% |
+| Hard | 16.1% | 50.0% |
+| Extra | 3.6% | 42.2% |
+## Notes
+- This model is stronger under execution-based Spider evaluation than our best `Qwen3-1.7B-Base` run.
+- In our experiments, exact-match metrics were often stricter than execution-based metrics because semantically valid SQL rewrites do not always match the Spider gold form exactly.
+- A later 4B rerun with altered training settings underperformed this checkpoint on Spider and is not the checkpoint published here.
+## Intended Use
+This model is intended for:
+- text-to-SQL research baselines
+- schema-conditioned SQL generation experiments
+- single-turn SQL generation from natural language plus schema text
+It is not validated for:
+- production-grade database access control
+- unrestricted execution over arbitrary enterprise schemas
+- multi-turn agent workflows without extra prompting / tooling
+## Example Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "bsq1989/qwen_4b_sql"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True,
+)
+prompt = """Generate SQL from the given schema and question. Output SQL only.
+Schema:
+CREATE TABLE twitter (TweetID INTEGER, UserID INTEGER, LocationID INTEGER, Lang TEXT, ...);
+CREATE TABLE location (LocationID INTEGER, Country TEXT, City TEXT, ...);
+Question:
+How many tweets are in English?
+"""
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=256)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Limitations
+- Performance drops on more open-ended and heterogeneous SQL benchmarks than Spider.
+- The model can still produce invalid column references on out-of-distribution schemas.
+- Benchmark numbers here reflect our current internal setup and should be reproduced with the same evaluation pipeline for strict comparison.