File size: 4,042 Bytes
9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d 2a57783 b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 9a1986d b587998 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | ---
base_model: Qwen/Qwen3-Reranker-4B
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-classification
tags:
- text-to-sql
- text2sql
- nl2sql
- sql
- sql-generation
- template-matching
- template-selection
- constrained-decoding
- database
- nli
- paraphrase
- reranker
- cross-encoder
- qwen3
language:
- en
license: apache-2.0
---
# TeCoD SQL Template Matcher
Fine-tune of [`Qwen/Qwen3-Reranker-4B`](https://huggingface.co/Qwen/Qwen3-Reranker-4B) used by [TeCoD](https://github.com/SSLab-CSE-IITB/tecod), a template-guided constrained decoding system for text-to-SQL.
This model is the TeCoD template-matching reranker. It scores whether a user question matches a retrieved masked question/template, helping TeCoD select recurring SQL templates before generation.
- Project page: <https://sslab-cse-iitb.github.io/tecod/>
- Source repository: <https://github.com/SSLab-CSE-IITB/tecod>
- Base model: <https://huggingface.co/Qwen/Qwen3-Reranker-4B>
- Training data source: [BIRD](https://bird-bench.github.io/) train split.
## Intended Use
This model is intended as an internal component of TeCoD and related template-based text-to-SQL systems. It is not a standalone SQL generator. In TeCoD, it is used after vector retrieval and before SQL generation to rerank candidate SQL templates.
## Input Format
The model is used as a cross-encoder over a question pair. Order matters: the first sequence should be the masked candidate/template question, and the second sequence should be the raw user question.
```text
Premise: "Show movies released in _ sorted by popularity desc"
Hypothesis: "What are the top films from 2010 by viewer count?"
```
Entity values in the candidate question are masked with a space-padded underscore `_`. The same mask token is used for strings, numbers, dates, and other literal values. Swapping the input order or changing the masking convention can degrade reranking quality.
## Training Summary
- Base model: `Qwen/Qwen3-Reranker-4B`
- Architecture: `Qwen3ForSequenceClassification`
- Data: approximately 1.48M NLI pairs derived from BIRD questions.
- Positive pairs: template-paired questions, self paraphrases, and partner paraphrases that preserve the SQL template.
- Negative pairs: hard negatives mined using nearest-neighbor retrieval over masked questions, with both masked and unmasked query variants used during pair construction.
- Labels: `entailment`, `neutral`, `contradiction`.
- The `neutral` label is retained for compatibility with a 3-class NLI head but was not used as a training target.
## Limitations
- Specialized for masked text-to-SQL question/template matching.
- Not intended for general NLI, semantic similarity, or SQL generation.
- Assumes the same masking convention and candidate-template construction used by TeCoD.
- The `neutral` label is untrained; inference should use entailment vs. contradiction or renormalize over labels `{0, 2}`.
- Very long question pairs and non-English inputs are not validated.
- The reranking score is one signal in a larger text-to-SQL pipeline; it does not guarantee final SQL correctness.
## References
- TeCoD project page: <https://sslab-cse-iitb.github.io/tecod/>
- TeCoD source repo: <https://github.com/SSLab-CSE-IITB/tecod>
- Base model: <https://huggingface.co/Qwen/Qwen3-Reranker-4B>
- Training Data - BIRD Train Set: <https://bird-bench.github.io/>
If you use this model as part of TeCoD, please cite:
```bibtex
@article{10.1145/3769822,
author = {Jivani, Smit and Maheshwari, Saravam and Sarawagi, Sunita},
title = {Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding},
journal = {Proceedings of the ACM on Management of Data},
volume = {3},
number = {6},
pages = {1--26},
year = {2025},
month = dec,
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3769822},
url = {https://doi.org/10.1145/3769822}
}
```
## License
Apache 2.0
|