File size: 4,042 Bytes
9a1986d
 
b587998
9a1986d
 
 
 
b587998
 
9a1986d
b587998
9a1986d
b587998
 
 
9a1986d
 
 
b587998
9a1986d
 
 
 
 
 
b587998
9a1986d
b587998
9a1986d
b587998
9a1986d
b587998
 
 
 
9a1986d
b587998
9a1986d
b587998
9a1986d
2a57783
 
 
 
 
 
 
 
 
 
 
b587998
9a1986d
b587998
 
 
 
 
 
 
9a1986d
b587998
9a1986d
b587998
 
 
 
 
 
9a1986d
 
 
b587998
 
9a1986d
 
b587998
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a1986d
 
 
b587998
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
base_model: Qwen/Qwen3-Reranker-4B
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-classification
tags:
  - text-to-sql
  - text2sql
  - nl2sql
  - sql
  - sql-generation
  - template-matching
  - template-selection
  - constrained-decoding
  - database
  - nli
  - paraphrase
  - reranker
  - cross-encoder
  - qwen3
language:
  - en
license: apache-2.0
---

# TeCoD SQL Template Matcher

Fine-tune of [`Qwen/Qwen3-Reranker-4B`](https://huggingface.co/Qwen/Qwen3-Reranker-4B) used by [TeCoD](https://github.com/SSLab-CSE-IITB/tecod), a template-guided constrained decoding system for text-to-SQL.

This model is the TeCoD template-matching reranker. It scores whether a user question matches a retrieved masked question/template, helping TeCoD select recurring SQL templates before generation.

- Project page: <https://sslab-cse-iitb.github.io/tecod/>
- Source repository: <https://github.com/SSLab-CSE-IITB/tecod>
- Base model: <https://huggingface.co/Qwen/Qwen3-Reranker-4B>
- Training data source: [BIRD](https://bird-bench.github.io/) train split.

## Intended Use

This model is intended as an internal component of TeCoD and related template-based text-to-SQL systems. It is not a standalone SQL generator. In TeCoD, it is used after vector retrieval and before SQL generation to rerank candidate SQL templates.

## Input Format

The model is used as a cross-encoder over a question pair. Order matters: the first sequence should be the masked candidate/template question, and the second sequence should be the raw user question.

```text
Premise:    "Show movies released in _ sorted by popularity desc"
Hypothesis: "What are the top films from 2010 by viewer count?"
```

Entity values in the candidate question are masked with a space-padded underscore `_`. The same mask token is used for strings, numbers, dates, and other literal values. Swapping the input order or changing the masking convention can degrade reranking quality.

## Training Summary

- Base model: `Qwen/Qwen3-Reranker-4B`
- Architecture: `Qwen3ForSequenceClassification`
- Data: approximately 1.48M NLI pairs derived from BIRD questions.
- Positive pairs: template-paired questions, self paraphrases, and partner paraphrases that preserve the SQL template.
- Negative pairs: hard negatives mined using nearest-neighbor retrieval over masked questions, with both masked and unmasked query variants used during pair construction.
- Labels: `entailment`, `neutral`, `contradiction`.
- The `neutral` label is retained for compatibility with a 3-class NLI head but was not used as a training target.

## Limitations

- Specialized for masked text-to-SQL question/template matching.
- Not intended for general NLI, semantic similarity, or SQL generation.
- Assumes the same masking convention and candidate-template construction used by TeCoD.
- The `neutral` label is untrained; inference should use entailment vs. contradiction or renormalize over labels `{0, 2}`.
- Very long question pairs and non-English inputs are not validated.
- The reranking score is one signal in a larger text-to-SQL pipeline; it does not guarantee final SQL correctness.

## References

- TeCoD project page: <https://sslab-cse-iitb.github.io/tecod/>
- TeCoD source repo: <https://github.com/SSLab-CSE-IITB/tecod>
- Base model: <https://huggingface.co/Qwen/Qwen3-Reranker-4B>
- Training Data - BIRD Train Set: <https://bird-bench.github.io/>

If you use this model as part of TeCoD, please cite:

```bibtex
@article{10.1145/3769822,
  author = {Jivani, Smit and Maheshwari, Saravam and Sarawagi, Sunita},
  title = {Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding},
  journal = {Proceedings of the ACM on Management of Data},
  volume = {3},
  number = {6},
  pages = {1--26},
  year = {2025},
  month = dec,
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  doi = {10.1145/3769822},
  url = {https://doi.org/10.1145/3769822}
}
```


## License

Apache 2.0