--- base_model: Qwen/Qwen3-Reranker-4B base_model_relation: finetune library_name: transformers pipeline_tag: text-classification tags: - text-to-sql - text2sql - nl2sql - sql - sql-generation - template-matching - template-selection - constrained-decoding - database - nli - paraphrase - reranker - cross-encoder - qwen3 language: - en license: apache-2.0 --- # TeCoD SQL Template Matcher Fine-tune of [`Qwen/Qwen3-Reranker-4B`](https://huggingface.co/Qwen/Qwen3-Reranker-4B) used by [TeCoD](https://github.com/SSLab-CSE-IITB/tecod), a template-guided constrained decoding system for text-to-SQL. This model is the TeCoD template-matching reranker. It scores whether a user question matches a retrieved masked question/template, helping TeCoD select recurring SQL templates before generation. - Project page: - Source repository: - Base model: - Training data source: [BIRD](https://bird-bench.github.io/) train split. ## Intended Use This model is intended as an internal component of TeCoD and related template-based text-to-SQL systems. It is not a standalone SQL generator. In TeCoD, it is used after vector retrieval and before SQL generation to rerank candidate SQL templates. ## Input Format The model is used as a cross-encoder over a question pair. Order matters: the first sequence should be the masked candidate/template question, and the second sequence should be the raw user question. ```text Premise: "Show movies released in _ sorted by popularity desc" Hypothesis: "What are the top films from 2010 by viewer count?" ``` Entity values in the candidate question are masked with a space-padded underscore `_`. The same mask token is used for strings, numbers, dates, and other literal values. Swapping the input order or changing the masking convention can degrade reranking quality. ## Training Summary - Base model: `Qwen/Qwen3-Reranker-4B` - Architecture: `Qwen3ForSequenceClassification` - Data: approximately 1.48M NLI pairs derived from BIRD questions. - Positive pairs: template-paired questions, self paraphrases, and partner paraphrases that preserve the SQL template. - Negative pairs: hard negatives mined using nearest-neighbor retrieval over masked questions, with both masked and unmasked query variants used during pair construction. - Labels: `entailment`, `neutral`, `contradiction`. - The `neutral` label is retained for compatibility with a 3-class NLI head but was not used as a training target. ## Limitations - Specialized for masked text-to-SQL question/template matching. - Not intended for general NLI, semantic similarity, or SQL generation. - Assumes the same masking convention and candidate-template construction used by TeCoD. - The `neutral` label is untrained; inference should use entailment vs. contradiction or renormalize over labels `{0, 2}`. - Very long question pairs and non-English inputs are not validated. - The reranking score is one signal in a larger text-to-SQL pipeline; it does not guarantee final SQL correctness. ## References - TeCoD project page: - TeCoD source repo: - Base model: - Training Data - BIRD Train Set: If you use this model as part of TeCoD, please cite: ```bibtex @article{10.1145/3769822, author = {Jivani, Smit and Maheshwari, Saravam and Sarawagi, Sunita}, title = {Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding}, journal = {Proceedings of the ACM on Management of Data}, volume = {3}, number = {6}, pages = {1--26}, year = {2025}, month = dec, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3769822}, url = {https://doi.org/10.1145/3769822} } ``` ## License Apache 2.0