Distillation approach and SQL dialect coverage

by O96a - opened 24 days ago

Interesting work on Text2SQL distillation — the synthetic dataset approach (distil-labs/text2sql-synthetic) is a smart way to get training scale without manual annotation overhead.

I'm curious about the SQL dialect coverage. Does the model handle dialect-specific syntax (PostgreSQL JSON operators, MySQL LIMIT OFFSET, T-SQL TOP)? We've seen that models trained on generic SQL often struggle with production databases where query patterns differ significantly from textbook examples.

Also, have you evaluated on cross-domain schemas? The Text2SQL generalization gap is usually most visible when table names and relationships don't match training distribution — would be useful to know how this distillation approach handles schema shifts.

distillabs

distil labs org 23 days ago

Thanks for the questions!

Dialect coverage: this model is SQLite-only for now. The distillation approach is dialect-agnostic though - you'd just need seed examples and a teacher that handles the target dialect, and the pipeline generates dialect-specific synthetic data the same way. Haven't done it yet, but it's on our radar.

Cross-domain schemas: the eval set uses schemas unseen during training, so the reported numbers (80% LLM-as-a-Judge, 60% exact match) do reflect some schema shift robustness. We haven't broken it down by domain distance yet.

If you have a specific dialect or domain you'd like to test, happy to collaborate. Always good to see where things break!

O96a

23 days ago

Thanks for your reply,

The 80% LLM-judge vs 60% exact match gap is interesting — that delta usually comes from queries that are structurally valid but fail on execution. Have you measured execution accuracy against a live SQLite instance? That's the most unforgiving signal.

On collaboration.. I'm building a Text2SQL layer for sudanese Arabic-annotated schemas. Dialect isn't the problem there, cross-lingual schema grounding is. If PostgreSQL support is on your roadmap, happy to stress-test it against that kind of distribution shift.

Thanks in advance

distillabs

distil labs org 22 days ago

Good point on execution accuracy - we haven't run generated queries against a live SQLite instance yet. Right now the eval is LLM-as-a-Judge and exact match only. You're right that the gap between the two likely includes structurally valid but non-executable queries, and execution accuracy would be the cleanest signal. It's on our list.

The cross-lingual schema grounding problem sounds really interesting - that's a distribution shift we haven't explored at all. To clarify though, PostgreSQL support isn't on our roadmap as a built-in feature. But you could train a dialect-specific model yourself using our platform (distillabs.ai) with the right seed examples and teacher. If you'd like to try it for your Arabic-annotated schema use case, happy to give you free training credits.

Feel free to reach out directly - jacek@distillabs.ai - and we can set that up.

Vineethreddy changed discussion status to closed 20 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment