diff --git "a/data/chunks/2603.10697_semantic.json" "b/data/chunks/2603.10697_semantic.json"
new file mode 100644--- /dev/null
+++ "b/data/chunks/2603.10697_semantic.json"
@@ -0,0 +1,899 @@
+[
+  {
+    "chunk_id": "28c6607a-7e36-4774-82c7-0d939e034eb0",
+    "text": "Tianshu Zhang Kun Qian Siddhartha Sahai\nThe Ohio State University Adobe Inc. Columbus, OH Seattle, WA Seattle, WA\nzhang.11535@osu.edu kunq@adobe.com siddharthas@adobe.com Yuan Tian Shaddy Garg Huan Sun\nPurdue University Adobe Inc. The Ohio State University\nWest Lafayette, IN Bangalore Columbus, OH\ntian211@purdue.edu shadgarg@adobe.com sun.397@osu.edu\n2026 Yunyao Li\nAdobe Inc. San Jose, CA\nyunyaol@adobe.comMar",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 1,
+    "total_chunks": 39,
+    "char_count": 412,
+    "word_count": 53,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "bbc88da4-c415-487a-8b2d-3f0fa11d98a8",
+    "text": "3668, 2025.11 ABSTRACTNeural text-to-SQL models, which translate natural language ques- ROBUSTNESS AGAINST SCHEMA EVOLUTION. PVLDB, 18(10): 3655 -\ntions (NLQs) into SQL queries given a database schema, have doi:10.14778/3748191.3748222\nachieved remarkable performance. However, database schemas PVLDB Artifact Availability:\nfrequently evolve to meet new requirements. Such schema evo- The source code, data, and/or other artifacts have been made available at\nlution often leads to performance degradation for models trained https://github.com/zhangtianshu/EvoSchema.\non static schemas. Existing work either mainly focuses on simply\nparaphrasing some syntactic or semantic mappings among NLQ, 1 INTRODUCTION[cs.DB]\nDB and SQL, or lacks a comprehensive and controllable way to\nText-to-SQL parsing aims to translate natural language questions\ninvestigate the model robustness issue under the schema evolution,\n(NLQs) into SQL queries given a database schema, enabling the\nwhich is insufficient when facing the increasingly complex and rich\ndevelopment of natural language interfaces that allow users to\ndatabase schema changes in reality, especially in the LLM era.\nquery data and invoke services without requiring programming\nTo address the challenges posed by schema evolution, we present\nskills [18, 27, 29, 32, 33, 36]. Existing neural text-to-SQL models\nEvoSchema, a comprehensive benchmark designed to assess and\nhave achieved remarkable performance on existing benchmarks\nenhance the robustness of text-to-SQL systems under real-world\n[18, 32], which play an important role in empowering different\nschema changes. EvoSchema introduces a novel schema evolution\nplatforms such as business and marketing platforms [26, 34] and\ntaxonomy, encompassing ten perturbation types across columnbeing integrated into virtual assistants to enable real-time data\nlevel and table-level modifications, systematically simulating the\nquery and analysis [4].\ndynamic nature of database schemas. Through EvoSchema, we conHowever, database schemas are not static; they frequently evolve\nduct an in-depth evaluation spanning different open-source and\nto accommodate new use cases and improve efficiency [3, 11]. For\nclosed-source LLMs, revealing that table-level perturbations have\ninstance, depending on the scenario, a large patient table might be\na significantly greater impact on model performance compared\nmerged from or split into two tables: a patient information table\nto column-level changes. Furthermore, EvoSchema inspires the\nand a patient diagnosis table (Figure 1-c), to reduce redundancy, endevelopment of more resilient text-to-SQL systems, in terms ofarXiv:2603.10697v1 hance data integrity, and optimize performance [14]. Such schema\nboth model training and database design. The models trained on\nevolution occurs frequently, which often leads to distribution shifts\nEvoSchema's diverse schema designs can force the model to dis-\n[13, 24] such as nomenclature shifts, data granularity shifts, table\ntinguish the schema difference for the same questions to avoid\nand column relation shifts and schema complexity shifts. These\nlearning spurious patterns, which demonstrate remarkable robustdistribution shifts can cause significant performance degradation\nness compared to those trained on unperturbed data on average.\nwhen the model trained on old database schema is adapting to new\nThis benchmark offers valuable insights into model behavior and a\nschema designs.\npath forward for designing systems capable of thriving in dynamic,\nreal-world environments.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 2,
+    "total_chunks": 39,
+    "char_count": 3555,
+    "word_count": 487,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a823e19e-f6fe-49fe-941f-5c6bce56323e",
+    "text": "This work is licensed under the Creative Commons BY-NC-ND 4.0 International\nLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of\nthis license. For any use beyond those covered by this license, obtain permission by\nemailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights\nlicensed to the VLDB Endowment. PVLDB Reference Format:\nProceedings of the VLDB Endowment, Vol. 18, No. 10 ISSN 2150-8097. Tianshu Zhang, Kun Qian, Siddhartha Sahai, Yuan Tian, Shaddy Garg, doi:10.14778/3748191.3748222\nHuan Sun, and Yunyao Li. EVOSCHEMA: TOWARDS TEXT-TO-SQL Figure 1: The left (a) is the overview of the framework to collect EvoSchema dataset. The top right (b) is a column-level schema\nevolution example; the bottom right (c) is a table-level schema evolution example. This challenge highlights a critical issue in model robustness: Moreover, the training set in EvoSchema can be used to enhance\nhow well can a text-to-SQL model adapt to changes in the database models' robustness. The models can be trained with the same quesschema? Recent studies introduce evaluation benchmarks designed tions but coupled with different schema designs to generate the\nto expose robustness issues by perturbing NLQs, databases or SQL corresponding SQL queries. This training procedure forces the\nqueries [2, 6, 20, 23]. However, these studies have at least one of model to distinguish the schema difference which can help models\nthe following limitations: 1) mainly focus on the syntactic para- gain a stronger ability to recognize the correct table and column\nphrasing or simple semantic mappings among NLQ, DB and SQL relation and map them to the questions.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 3,
+    "total_chunks": 39,
+    "char_count": 1692,
+    "word_count": 254,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "01b8925a-1f0c-496a-8cbb-e55536a61751",
+    "text": "Our experimental results\n[2, 6]; (2) lack a taxonomy of comprehensive schema evolution demonstrate that the perturbation training data in EvoSchema can\ntypes [23]; (3) only focus on schema evolution that does not lead help train better text-to-SQL models, which are more robust to difto SQL changes [20]. These efforts are insufficient in the face of ferent schema evolution types on average, especially on table-level\nincreasingly complex and rich database schema changes found in perturbations.\nreality. Meanwhile, while it is natural to consider collecting new In summary, our main contributions are as follows:\ndata after schema evolution for retraining a model, repeating the\nentire model training life cycle frequently can be costly in terms of • We formulate a critical schema evolution adaptive textboth time and resources. to-SQL problem and present a new dataset, EvoSchema to\nUnder this background, we seek to answer the following two study this problem. We introduce a comprehensive taxonquestions: (1) How sensitive are existing text-to-SQL models to omy of the schema evolution types and build the datasets\nvarious types of database schema changes? (2) How can we train based on the taxonomy to get realistic schema designs by\na more robust text-to-SQL model that not only performs well on column-level and table-level perturbations on BIRD.\nexisting database schemas but also adapts effectively to schema • We conduct thorough and comprehensive assessment of\nchanges? Towards this end, we introduce EvoSchema, a new dataset model robustness against various schema perturbations\nthat covers a wide range of realistic schema design changes by spanning different open-source and closed-source LLMs\nperturbations on BIRD [18]. As illustrated in Figure 1 and Figure on our evaluation benchmark, and find that table-level per-\n2, EvoSchema builds upon our newly defined taxonomy, which turbations have a significantly greater impact on model\nencompasses a total of ten types of perturbations over schema, performance compared to column-level changes. Besides,\ncovering both column-level and table-level changes. Column-level we introduce two evaluation metrics: Table Match F1 and\nperturbations include adding, removing, renaming, splitting and Column Match F1, to rigorously evaluate the performance\nmerging columns, while table-level perturbations involve adding, of text-to-SQL models under schema evolution scenarios\nremoving, renaming, splitting, and merging tables. We keep the and provide fine-grained insights into model robustness. NLQs fixed and examine the robustness of a model under different • Our constructed training set inspires a new training paragranularities of schema evolution, and show that existing models digm: augmenting the existing training data with different\nare more easily affected by table-level perturbations than column- schema designs, which not only increase the data diverlevel perturbations. sity, but also force the model to distinguish the schema Figure 2: An overview of different perturbation types of EvoSchema. The top is an unperturbed example in BIRD [18]; the middle\nis the column-level perturbation; the bottom is the table-level perturbation. \"Remove Col in SQL\": remove columns that appear\nin gold SQL; \"Remove Tables\": the relevant tables appear in gold SQL are removed. Thus there is no gold SQL for these two cases. Note we don't illustrate \"Merge Columns\" in the figure as this example is not suitable for applying merging column changes. difference during training.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 4,
+    "total_chunks": 39,
+    "char_count": 3530,
+    "word_count": 526,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6f1d20f3-b8c5-4733-855e-2d5ee999840b",
+    "text": "Our approach yields better text- paraphrasing or simple semantic mappings, such as different repto-SQL models that achieve up to 33 points gain on different resentations of numbers or name abbreviations across NLQ, DB,\ntypes of schema perturbation evaluation data, compared to and SQL [2, 6]. While some work analyzes schema changes, they\nmodels trained on unperturbed, original training data. mainly focus on irrelevant column modifications that do not affect\nSQL [20] or with limited perturbation types [23]. These efforts are\ninsufficient in the face of increasingly complex and rich database\nschemas found in modern datasets. Though FootballDB [8] tackles\n2 RELATED WORK a similar schema design problem for better SQL written, they focus\nRobustness in Text-to-SQL: Existing research on text-to-SQL on reducing multiple foreign key mappings among tables and rerobustness is mainly two-fold: robustness evaluation and robust- ducing the JOIN paths in the SQL. Different from theirs, we tackle\nness training.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 5,
+    "total_chunks": 39,
+    "char_count": 1009,
+    "word_count": 151,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b621025e-8cd0-43a9-9bb4-8f136e1993e5",
+    "text": "Recent studies introduce evaluation benchmarks de- the schema evolution problem, which is not only for the schema\nsigned to expose robustness issues by perturbing NLQs, databases design on the existing data, but also needs to consider how new\nor SQL queries. However, these studies tend to focus on syntactic data and information will change the schema design. approach it through a different angle, where our scheme design and combine it with the NLQ as input. This input is then used to\ncontains 10 column-level and table-level changes. And our provided prompt the model to generate the corresponding SQL query.\nschema evolution framework allows us to try different schema design on multiple databases to get more generalizable findings, while 3.2 Rationale for Schema Evolution Types\nFootballDB [8] can only support the exploration on a single dataWhen a database schema evolves, it can induce distribution shifts\nbase.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 6,
+    "total_chunks": 39,
+    "char_count": 922,
+    "word_count": 145,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e150257d-5fd6-4983-abd6-c568e3fe3f41",
+    "text": "Moreover, the advent of LLMs has mitigated many linguistic\nin the data that may impact model performance. We categorize\nchallenges, further emphasizing the need for robust adaptation to\npotential distribution shifts into four types: nomenclature shifts,\nstructural changes in database schemas. For robust training, exdata granularity shifts, table and column relation shifts, and schema\nisting methods employ strategies like decomposing tasks so that\ncomplexity shifts. (1) Nomenclature shifts occur when tables and\nmodels generate each sub-clause individually before merging them\ncolumns are renamed, which may alter the convention of the es-\n[9], or using execution-guided decoding to eliminate incorrect subtablished terminology within the schema. For example, tables origclauses [30]. While these approaches focus on enhancing various\ninally named \"Products\", \"Customers\", and \"Orders\" might be reaspects of text-to-SQL robustness, our work specifically addresses\nnamed to \"Items\", \"Clients\", and \"Purchases\", respectively. Such\nthe challenge of schema evolution.\nchanges often reflect updates in business terminology or compliLLMs for Text-to-SQL: Most recently, the LLM-based approaches\nance with new standards. A desired model should handle those\nfor text-to-SQL are mainly two-fold: in-context learning [10, 15, 16,\nnomenclature shifts to adapt to the new terminology. (2) Data gran-\n27, 35] and finetuning [15–17, 38]. The former prompts proprietary\n1 2 ularity shifts arise from adding or removing columns or tables,LLMs such as GPT series and Claude for SQL generation withwhich changes the level of detailedness captured in the database.\nout additional model training, while the latter involves adapting\nFor instance, an \"Employee\" table with a single \"ContactNumber\"\nopen-source LLMs to text-to-SQL datasets, tailoring these models\nfield might involve another two separate \"WorkContact\" and \"Perdirectly to the task through supervised learning. These models are\nsonalContact\" fields later. This increases the data granularity to\ndesigned for question understanding, schema comprehension and\nmeet new requirements, necessitating models to adapt to more comSQL generation, which have achieved remarkable performance on\nplex and detailed semantics. (3) Table and column relation shifts\nthe existing open benchmarks [18, 32]. Liu et al. [19] provides a\nand schema complexity shifts mainly result from restructuring tacomprehensive review of the NL2SQL lifecycle, covering models,\nbles through splitting or merging. This process can highly affect\nbenchmarks, data synthesis, evaluation, and error analysis. While it\nhow each table is related to other tables by which column.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 7,
+    "total_chunks": 39,
+    "char_count": 2682,
+    "word_count": 378,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "ed744798-3f67-4c61-ad97-a4e947430d18",
+    "text": "Both\nidentifies schema variation as a challenge, it does not explore it in\nthe primary keys and foreign keys may change along with the tadepth. Our work focuses specifically on schema evolution robustble restructure. Besides, the schema complexity may change when\nness by evaluating recent and powerful LLMs (e.g., Code Llama,\nmultiple tables merge from or split into one table. A desired model\nMistral, SQLCoder, LLaMA 3, GPT-3.5, GPT-4) without preprocessis expected to be robust to such changes. By categorizing the distriing or postprocessing. We introduce EvoSchema, a benchmark with\nbution shifts caused by schema evolution, we can more effectively\ncontrolled schema perturbations that guides both evaluation and\nunderstand and evaluate a model's capacity to adapt to changes in\nstructured training data synthesis. In addition to standard execution\nthe underlying database schema.\naccuracy and human evaluation, we propose two fine-grained metrics: Table Match F1 and Column Match F1 that directly reflect our\ntable-level and column-level perturbation taxonomy. Li et al. [15] 3.3 Schema Evolution Synthesis Framework\nevaluates LLMs on unperturbed Spider and BIRD datasets and also Our study aims to cover comprehensive potential schema evolution\nexperiments on natural language variation but keep schema and types, which can foster the robustness evaluation of the existing\nSQL fixed; in contrast, our work systematically varies the schema text-to-SQL models and inspire robust model training. We synthewhile keeping the natural language fixed. size all the schema evolution types through hybrid strategies, which\nwill leverage both the heuristic rules to guarantee the data quality\n3 EVOSCHEMA DATASET and LLMs to ensure diversity. Broad Coverage of Different Schema Evolution Types: We aim\n3.1 Background\nto encapsulate a broad range of schema evolution types, recognizing\nIn the dynamic landscape of databases, schemas frequently evolve their prevalence and impact in real-world scenarios. Specifically, our\nto meet new demands, introducing significant challenges for text- schema evolution taxonomy includes both column-level and tableto-SQL models [3, 5]. These schema changes can vary widely, from level perturbations, which are categorized into ten distinct types.\nminor modifications to complete restructuring, and can significantly Column-level perturbations comprise five types: adding, removing,\nimpact the performance of models trained on static schemas. In renaming, splitting and merging columns, where modifications\nrealistic scenarios, a database can often contain a large number of are restricted to the columns within existing tables. Table-level\ntables, and only several related tables are responsible for a natural perturbations encompass five types: adding, removing, renaming,\nlanguage question (NLQ). In our experiment, we represent the splitting, and merging tables.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 8,
+    "total_chunks": 39,
+    "char_count": 2897,
+    "word_count": 415,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "84639890-3b87-437b-b5a6-c1560083236b",
+    "text": "These perturbations occur frequently\nrelevant database schema using Data Definition Language (DDL) 3 in practice, underscoring the need for text-to-SQL models that can\nrobustly handle such changes.\n1https://platform.openai.com/docs/models Hybrid Data Synthesis Strategies: To ensure both diversity and\n2https://www.anthropic.com/news/claude-3-family\n3DDL defines the structure and properties of a database, providing detailed information quality in the generation of schema perturbations, we employ a\nnecessary for database creation, including column types and primary/foreign keys. combination of heuristics and GPT models to synthesize various Figure 3: This figure shows two examples of our data collection procedure of EvoSchema. The top (a) is a \"rename columns\" data\ncollection procedure; the bottom (b) is a \"split tables\" data collection procedure.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 9,
+    "total_chunks": 39,
+    "char_count": 856,
+    "word_count": 113,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "428fc9e6-2e2e-4997-9a47-788b086660f0",
+    "text": "The blue box indicates prompting GPT models\nfor the generation. \"</>\" means programmatically processing the data. For each given seed instance in BIRD [18], 3.5 Data Generation\nconsisting of a <NLQ, relevant schema, SQL> triple, we maintain We design a framework to simulate different types of schema perthe natural language question (NLQ) fixed across all perturbation turbations in a configurable way. For adding or renaming columns,\ntypes, while only modifying the relevant schema. The correspond- both the modified column size and the column position in the tables\ning SQL query is adjusted as necessary to remain consistent with are set randomly, and we set the original column size in the table\nthe changes in the database schema. as the maximum number of columns to be changed. For removing columns, we can randomly remove important or unimportant\ncolumns from the existing relevant tables. The important columns\n3.4 Seed Dataset Selection are the columns that appear in the gold SQL, which will inevitably\naffect the prediction. For adding, removing, or renaming tables, we\nFor building Evoschema benchmark, we utilize the BIRD [18] dataset\nrandomly add, remove or rename one or multiple tables.\nas the seed data, which is specifically designed for the text-to-SQL\nSchema Change: To ensure the diversity and reasonability of the\ntask. Compared to Spider [32], which is commonly used to study\nsynthesized schema, we leverage the capabilities of GPT-3.5 and\ntext-to-SQL robustness, BIRD features more intricate, realistic, and\nGPT-4 to synthesize realistic and contextually appropriate columns\nextensive databases, as well as more complex SQL queries that inor tables, which help effectively produce high-quality synthetic data\nclude keywords often missing in Spider. BIRD consists of NLQs,\nthat meets our requirements. For adding or renaming columns and\ncorresponding database schemas, and gold SQL queries and entables, we input the existing relevant tables to GPT-3.5, and let the\ncompasses a wide range of real-world database scenarios, which\nmodel generate the potential tables or columns that fit the context.\nprovides a robust foundation for evaluating the performance of\nFor splitting tables or merging tables, since they are more complex\nmodels in translating NLQs into SQLs.\nthan other perturbations, we use GPT-4 to choose the tables that\nSchema Perturbations: To evaluate the robustness of the text-tocan be split or merged and then use the modified tables to replace\nSQL models, EvoSchema not only includes the BIRD dataset in their\nthe original ones. For adding or renaming columns and tables, we\noriginal form but also augmented it with various column-level and\napply heuristics to filter out the repeated ones in the synthesized\ntable-level schema perturbations. We ensure that the NLQs remain\ntables or columns. Besides, to ensure the correct relationship among\nfixed, while the schema and SQL queries are adjusted as necessary\ndifferent tables after modifying the schema, we apply heuristics to\nto reflect the changes introduced by our perturbations. We follow\nensure all the foreign keys change along with their referenced table\nthe standard train/dev split provided with BIRD, and apply all the\nnames and column names. When removing columns or tables, any\nperturbations on both training data and evaluation data.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 10,
+    "total_chunks": 39,
+    "char_count": 3338,
+    "word_count": 517,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "2f25fe24-eb7d-487a-8fa7-66f901fbdbab",
+    "text": "The data\nforeign keys in other tables that reference the removed columns or\nstatistics of EvoSchema are in Table 2 and the examples of different\ntables will be removed as well.\nperturbation types are in Figure 2. SQL Change: To ensure the consistency of the <NLQ, relevant prompt GPT-3.5 to generate similar, context-appropriate names.\nschema, SQL>, after we change the relevant table schema, we re- These synthesized names replace the original column names. In\nvise the gold SQL accordingly. Since the NLQs are the same for addition, in order to maintain the correctness of the relationship\nadding or removing columns and tables, and the schema evolution among the tables, If the column in one table has been renamed, we\nhere doesn't affect answering the questions, we keep the gold SQL will also rename the foreign keys in other tables if those columns\nunchanged for these perturbation types. For renaming columns reference the renamed one. We also revise gold SQL accordingly to\nor tables, we revise gold SQL if they appear in the gold SQL. For ensure that the revised schema and gold SQL remain aligned with\ntable splitting or merging, due to the complexity and variation in the unchanged NLQ.\nthe required SQL changes, we use GPT-4 to revise the gold SQL. Split columns: Since columns such as name, address, and date are\nThis revision is based on the mappings from the original to the often stored in more fine-grained formats in real-world databases\nnew tables and columns, as well as the necessary adjustments to (e.g., a full name split into first and last name; a date split into year,\nthe JOIN paths. We manually check the edited gold SQL for the month and day; an address split into state, city and street, etc),\nevaluation benchmark to make sure they are correct. we identify examples in BIRD dev set that involve these attributes\nand manually split the corresponding columns into finer-grained\ncolumns for evaluation. As these changes affect the structure of the\n3.6 Data Collection of Each Perturbation Type gold SQL queries, we manually revise the gold SQL to reflect the\nWe first define heuristics for different perturbation types, then updated schema. For the training set, we similarly select examples in\ncombine both GPT models' generation ability and programming BIRD train set involving name, address, or date, and use Claude 3.5\nto collect the data. Finally, we incorporate a human verification to synthesize the corresponding fine-grained columns and update\nstage to control the data quality.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 11,
+    "total_chunks": 39,
+    "char_count": 2515,
+    "word_count": 415,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d5f15874-14a3-44ab-8a0a-f804157149b5",
+    "text": "Here are some general heuristics the gold SQL accordingly.\nwe should consider to maintain consistency and avoid conflicts Merge columns:As the reverse of column splitting, we simulate\nwhen manipulating data: 1) Preserve Meaning: For renaming, the more abstract column representations commonly seen in real-world\nnew column or table name should reflect the same meaning as databases (e.g., combining first and last name into full name; year,\nthe original name to avoid semantic confusion. 2) Avoid Conflicts: month, and day into date; state, city, and street as address). We\nEnsure that the new column or table name does not conflict with identify relevant examples in the BIRD dev set and manually merge\nexisting column or table names within the same or other tables in fine-grained columns, updating the gold SQL accordingly. For trainthe database. 3) Update References: Update all references to the ing, we apply the same strategy to the BIRD train set and use Claude\nnew column or tables in foreign keys in other tables. 4) Revise SQL: 3.5 to synthesize the merged schema and update the gold SQL. Update all SQL queries referencing the new columns or tables to Add tables: We randomly add irrelevant tables to each question,\nwork correctly after the renaming.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 12,
+    "total_chunks": 39,
+    "char_count": 1262,
+    "word_count": 205,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f5f2876c-9f7b-4333-9c80-b83c7278a4e5",
+    "text": "These heuristics aim to ensure and these tables are still in the same database as the relevant tables\nthat those perturbations are performed systematically, maintaining in BIRD. The original BIRD datasets guarantee that no different\nthe database's integrity and compatibility with SQL queries. The tables in their database can lead to alternative correct SQL answers.\ndetails for each perturbation type are as follows: The tables added don't affect the NLQ and the gold SQL. Add columns: we input both the table name and all of its col- Remove tables: In this scenario, we randomly remove tables from\numn names and data types to GPT-3.5 and prompt it to generate the relevant schema, which are referenced in the gold SQL query. As\nmultiple column names and their corresponding data types that a result, the gold SQL becomes invalid. Instead, we use the response\nare suitable and congenial with reason and common sense given \"The given table information is insufficient to generate an SQL\nthe current scenario, and prompt GPT-3.5 don't generate the col- query to answer the question\" as the ground truth.\numn names that have the similar meaning with the existing input Rename tables: we input both the table name and all of its colcolumn names. Then we add a heuristic guarantee to filter out the umn names and data types to GPT-3.5.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 13,
+    "total_chunks": 39,
+    "char_count": 1332,
+    "word_count": 223,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "43b560b6-5e16-4fd6-abb2-3cb943635ac9",
+    "text": "We randomly select one\nredundant columns if the generated column names are repeated. or multiple table names and prompt GPT-3.5 to generate similar,\nThese synthesized columns are then randomly inserted into the context-appropriate names. These synthesized names replace the\nrelevant tables. Notably, both the NLQ and the gold SQL remain original table names. In addition, in order to maintain the correctunchanged during this process. ness of the relationship among the tables, we will also rename the\nRemove columns: We randomly eliminate columns from the foreign keys in other tables if they reference the renamed table.\ngiven schema, ensuring that the removed columns do not appear Finally, the table names in the gold SQL will also be renamed.\nin the gold SQL query. Again, the NLQ and the gold SQL are kept Split tables: as Figure 3 (b) shows, we input both the table name\nfixed during this operation. and all of its column names and data types to GPT-4. We prompt\nRemove columns in gold SQL: In this scenario, we randomly GPT-4 to identify tables that can be logically divided into two or\nremove columns from the schema, specifically targeting those ref- more smaller tables. Using GPT-4, we generate new table names and\nerenced in the gold SQL query. As a result, the gold SQL becomes distribute the columns of the original table among the new tables in\ninvalid. Instead, we use the response \"The given column informa- a contextually appropriate manner. The primary key in the original\ntion is insufficient to generate an SQL query to answer the question\" table will be copied into all the new tables after splitting. The gold\nas the ground truth.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 14,
+    "total_chunks": 39,
+    "char_count": 1654,
+    "word_count": 277,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0a21cc12-6492-4ce2-bcf1-24918c6227c7",
+    "text": "SQL is revised by GPT-4 to reference the newly created tables,\nRename columns: as Figure 3 (a) shows, we input both the table ensuring consistency across all components. We also manually\nname and all of its column names and data types to GPT-3.5. We check the new gold SQL to make sure it's correct.\nrandomly select multiple column names and their data types and Table 1: Statistics of EvoSchema compared with existing benchmarks. \"Tab\": tables; \"DB\": database; \"Col\": columns; \"PK\": primary\nkeys; \"FK\": foreign keys. Schema Evolution Features of Seed Data (Average)\nPerturbation Data Column-level Table-level Multiple DB Seed Data\nAffects SQL Tab/DB Col/DB Col/Tab PK/DB FK/DB\nFootballDB [8] - reduce PK/FK references, reduce JOIN paths ✓ ✗ FIFA World Cup [1] 15 107 7.1 - 16\nDr.Spider [2] Rename ✗ ✓ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nADVETA [23] Add; Rename ✗ ✓ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nMT-TEQL [20] Add; Remove; Shuffle; Rename Split; Merge; Shuffle ✗ ✓ Spider [2] 5.1 22.1 5.4 3.7 3.2\nEvoSchema (Ours) Add; Remove; Rename; Split; Merge Split; Merge; Rename; Add; Remove ✓ ✓ BIRD [18] 7.3 72.5 10.6 6.5 9.3 Merge Tables: We select two or more related tables and combine also indicates that LLM-generated split and merge tables include\nthem into a single table. GPT-4 is used to generate a suitable name around 30% low-quality data, underscoring the need for careful\nfor the merged table, and the columns from the original tables are human validation for these two types.\nconsolidated under this new table. More concretely, the GPT4 is\nprompted to 1) copy all the primary key columns of the original 3.7 Comparison with Existing Benchmarks\ntables to the new tables after merging, but only keep one of them\nEvoSchema, as presented in Table 1, introduces a comprehensiveas the primary key of the new table, and make others as the regular\nand unique taxonomy for evaluating models' behavior under the im-columns. 2) if the primary key columns in these two original tables\npact of schema evolution on SQL queries, distinguishing itself fromare the same, then just keep one in the new table after merging. 3)\nother benchmarks like Dr.Spider [2], ADVETA [23], MT-TEQL [20]when merging tables, if there are two columns not the primary key\nand FootballDB [8]. Unlike Dr.Spider and ADVETA, which focuscolumn but with the same names in the original tables, revise their\non limited perturbations such as column renaming and additions,column names accordingly to make them different when merging\nEvoSchema encompasses a broader range of transformations, in-them into the new table. Finally, the gold SQL is updated by GPT-4\ncluding adding, removing, renaming, splitting and merging at bothaccordingly. We also manually check the new gold SQL to make\nthe column level and table level. This diversity allows for testingsure it's correct.\nsystems under realistic and dynamic schema evolution scenarios.Quality Control: To ensure high-quality data in EvoSchema, we\nFurthermore, while MT-TEQL includes a variety of perturbations,leverage advanced language models and rigorous human validation.\nit only modifies the columns not mentioned in the SQL whichSpecifically, we use GPT-3.5 to generate synthesized column and\ndoes not consider the impact of schema evolution on SQL directly.table names and data types (only for columns) when adding or reEvoSchema uniquely integrates schema evolution with its effects onnaming are required. We randomly choose 200 generated examples\nSQL queries, enabling evaluation of models in environments thatto do manual review and reveal that GPT-3.5 demonstrates a strong\nclosely mimic real-world database evolution challenges. Differentunderstanding of the input context, effectively generating names\nfrom FootballDB [8] which mainly restructures schema to reducethat meet our requirement. For more complex operations, such\nforeign key mappings among tables and reduce JOIN paths for SQL,as splitting or merging tables, we utilize the capabilities of more\nwe define a more configurable, systematical and structured schemapowerful GPT-4 to handle both schema changes and corresponding\nevolution taxonomy.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 15,
+    "total_chunks": 39,
+    "char_count": 4129,
+    "word_count": 646,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5cd3fd47-dbd6-4cfb-ae23-38ec65652170",
+    "text": "Besides, our provided schema evolution andSQL modifications with high accuracy.\nsynthesis framework allows us to explore the schema change on To complement these automated processes, we engaged five anmultiple databases easily, while FoodballDB is only limited to onenotators with substantial SQL expertise to carefully review cases\ndatabase. Finally, for the seed data selection, compared to Spider,involving complex schema transformations. Annotators validated\nwhich is commonly used to study text-to-SQL robustness, BIRDand, where necessary, manually corrected the generated gold SQL\nfeatures more intricate, realistic, and extensive databases, as well asqueries to ensure correctness and alignment with the modified\nmore complex SQL queries that include keywords often missing inschemas.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 16,
+    "total_chunks": 39,
+    "char_count": 791,
+    "word_count": 106,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1564a09b-604e-41d9-8c0a-512c474a19e4",
+    "text": "To further enhance reliability, we implemented crossSpider. These distinctions make EvoSchema particularly well-suitedvalidation by assigning complex cases to multiple annotators and\nfor studying how systems adapt to evolving schemas, advancingresolving discrepancies through discussion or consensus. This combeyond the simpler or less holistic setups of prior benchmarks.bination of advanced AI tools and meticulous human review ensures that EvoSchema maintains a robust and accurate benchmark,\nfaithfully reflecting real-world schema evolution scenarios. 3.8 Data Statistics\nCost Analysis: We have 1.5K split-table examples and 1.1K merge- Table 2 provides an overview of the data statistics in EvoSchema,\ntable examples requiring human verification. Among the split exam- showcasing the various perturbation types applied to the origiples, 1.1K are relatively simple and take approximately 3 minutes nal BIRD dataset. \"Column Manipulation\" refers to applying the\neach to verify, while the remaining 0.4K are more complex and column-level operations on the columns of the original BIRD data;\nrequire about 7 minutes each—totaling roughly 100 hours. For the \"Table Manipulation\" refers to applying the table-level operations\nmerge-table examples, 0.8K are simple (3 minutes each) and 0.3K are on the tables of the original BIRD data. All the perturbed data are\ncomplex (7 minutes each), amounting to approximately 75 hours. obtained by applying column manipulation or table manipulation\nNote this manual effort was for curating the evaluation data, not the on the original BIRD dataset. \"Manipulated Items\" shows the size\ntraining data. Our training data is generated entirely automatically of the altered columns or the tables. \"Manipulated Items/Query\"\nwithout any human annotation or manual verification. Our analysis refers to the number of columns or tables modified in the schema Table 2: Data statistics of EvoSchema. \"Original\" refers to the during training. 2) with perturbation types: the model is trained\noriginal BIRD dataset; \"Column Manipulation\" refers to ap- by merging both the original training data and the perturbation\nplying the column-level operations on the columns of the training data. For closed-source models, we only use them for evaloriginal BIRD data; \"Table Manipulation\" refers to applying uation.\nthe table-level operations on the tables of the original BIRD Evaluation Setting: For all the closed-source models and the\ndata. \"*\": the evaluation data for calculating execution accu- finetuned open-sourced models, we evaluate them under two setracy. We synthesize values to reconstruct the database after tings: 1) without perturbation types: this setting uses the standard,\nschema evolution, and filter out those not executable by gold unaltered original evaluation data to evaluate the model perforSQL, which results in the smaller size of the evaluation data mance. 2) with perturbation types: the models are evaluated on data\nfor calculating execution accuracy. where different perturbations are introduced. By comparing the\nmodel performance under these two settings, we can assess how\nData Statistics resilient the finetuned models and GPT models are to schema evoluManipulated Items/Table Manipulated Items/Query\nPerturbation Type Train Eval Eval* Min Mean Median Max Min Mean Median Max tion in NL2SQL. This setup provides a comprehensive evaluation of\nOriginal 9426 1534 1068 - - - - - - - - model performance in both standard and perturbed environments,\nColumn Manipulation allowing for detailed analysis of robustness and adaptability across\nAdd Columns 9219 1506 846 1 5.7 3 83 1 5.9 4 43 different models and schema evolution types. Remove Columns 9426 1534 1076 1 6.2 2 87 1 6.9 3 70\nRemove Col in SQL 9424 1534 - 1 2.5 2 8 1 2.5 2.5 6\nRename Columns 9385 1533 947 1 4.3 3 46 1 4.4 3 46\nSplit Columns 140 37 37 1 2 2 4 1 2 2 4\nMerge Columns 148 44 44 2 3 3 4 2 3 3 4 5.2 Evaluation Metrics\nTable Manipulation 1) Table Match F1: this score is a metric to measure how well the\nAdd Tables 9387 1530 1014 - - - - 1 2 2 3\nRemove Tables 7212 1171 - - - - - 1 1 1 1 model correctly identifies the relevant tables required to generate\n1534 1063 - - - - 1 1.5 1 4 Rename Tables 9392\nSplit Tables 9254 1515 824 - - - - 1 2.6 3 5 a valid SQL query. The F1 score is a harmonic mean of precision\nMerge Tables 6930 1139 569 - - - - 2 2 2 2 and recall, where the precision is the percentage of tables correctly\npredicted out of all tables predicted by the model and the recall is\nthe percentage of tables correctly predicted out of all the actual\nfor each SQL query, specifically targeting the tables relevant to gen- tables that should have been selected.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 17,
+    "total_chunks": 39,
+    "char_count": 4689,
+    "word_count": 767,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "391accef-19f8-43e5-b7ea-d903c0e9315a",
+    "text": "The Table Match F1 score\nerating that query. For \"Split Tables,\" \"Manipulated Items/Query\" combines these two metrics to provide a balanced evaluation, which\nrepresents the number of tables each original table is split into. For can assess the ability of text-to-SQL models to correctly identify the\n\"Merge Tables\", \"Manipulated Items/Query\" indicates the number required tables from the database schema to form accurate queries.\nof tables combined into a single table. A higher Table Match F1 indicates better performance in selecting\nthe correct tables for the SQL query.\n4 TRAINING PARADIGM 2) Column Match F1: this score is to evaluate how accurately the\nIn our work, we propose a new training paradigm to enhance the model identifies the relevant columns required to generate a valid\nmodel's robustness against different schema evolution. For each SQL query from a natural language input. Like the Table Match F1,\n<NLQ, relevant schema, SQL> triple, we fix the NLQ in the training it measures the balance between precision and recall but is applied\ndata, and augment each triple with different schema designs, which specifically to the columns of the database. A higher Column Match\nmay or may not lead to SQL change. Consequently, we obtain mul- F1 score indicates better performance in selecting the right columns\ntiple triples that can be derived from each of the original triples. for the SQL query. We train the model by learning multiple schema designs and SQLs 3) Execution Accuracy: this metric measures whether the preto the original question mappings, which can improve the model's dicted SQL query can return the correct results as the gold SQL\nability to identify the correct relationships among different tables when executing against a database.\nand columns to the question, and can better distinguish the difference among different schema designs. Through this procedure, the\nmodel can avoid learning spurious patterns better and therefore 5.3 Training and Evaluation Details\nenhance the robustness against different schema evolution types.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 18,
+    "total_chunks": 39,
+    "char_count": 2060,
+    "word_count": 322,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "370370f7-2cf2-4e1c-a14c-ba43cb8cc324",
+    "text": "We choose Code Llama-7B [25], Mistral-7B [12], Llama 3-8B [7] and\nSQLCoder-7B 4 as our open-source base models. We fine-tune these\n5 EXPERIMENT SETUP models with Huggingface transformers library [31]. For the pertur-\n5.1 Training and Evaluation Settings bation training, we merge all the perturbation data and randomly\nshuffle them as our final training data. We use a learning rate of\nTraining Setting: We choose four open-source models: Code 2e-5 for training Code Llama, Llama 3 and SQLCoder, and 5e-6 for\nLlama-7B [25], Mistral-7B [12], Llama 3-8B [7] and SQLCoder-7B 4 training Mistral. We train all the models on 4\nand two closed-source models: GPT-3.5 5 and GPT-4 [22] for our A100 80GB GPUs and use a cosine scheduler with a 0.03 warm-up\nexperiments. For these four open-source models, we explore two period for 6 epochs.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 19,
+    "total_chunks": 39,
+    "char_count": 829,
+    "word_count": 136,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f4f83d04-3332-4c8a-abbb-ed7dae4be747",
+    "text": "We employ FSDP [37] to efficiently train the\nsettings: 1) without perturbation types: the model is trained on the model. We set the max input length of training as 1024 and the max\noriginal training data without any perturbation types introduced output length of inference as 500. For inference, we use vllm [31]\n4https://huggingface.co/defog/sqlcoder-7b-2 for batch evaluation, and we set the batch size as 16. We do the\n5https://openai.com/chatgpt/ inference on an 80G A100 GPU. For closed-source LLMs, we use Table 3: Evaluation on EvoSchema. \"w/\": the model is trained by merging the original data and all the perturbation training\ntypes together; \"w/o\": the model is only trained on the original training data. The best performance for each model is in bold,\nand red shows a larger gain. \"-\": some of the relevant tables are removed so there should be no gold SQL used to calculate the\nmetrics here. Code Llama Mistral Llama 3 SQLCoder GPT-3.5 GPT-4\nPerturbation Type\nw/o w/ w/o w/ w/o w/ w/o w/",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 20,
+    "total_chunks": 39,
+    "char_count": 1000,
+    "word_count": 167,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4ef5ccf1-52a8-4e4b-bf5a-507c4c96229c",
+    "text": "Original 89.77 90.42 89.58 90.62 89.96 89.53 89.69 90.64 87.28 88.98 Add Columns 89.73 90.27 89.65 90.03 89.08 89.70 89.30 90.52 86.35 88.12\nRemove Columns 89.82 90.24 89.89 90.66 90.09 89.82 89.81 90.54 87.18 88.87\nRename Columns 85.28 85.07 84.32 84.27 83.74 82.92 85.32 84.93 81.73 83.20\nSplit Columns 83.78 89.19 83.78 88.29 81.08 85.14 86.49 88.29 81.44 86.31\nMerge Columns 88.65 87.23 87.23 89.72 88.65 86.17 87.23 87.23 83.17 89.36 Add Tables 57.88 89.50 57.67 89.30 55.11 88.51 57.44 89.38 83.54 85.79\nRemove Tables - - - - - - - - - -\nRename Tables 88.84 90.32 89.40 90.56 87.18 89.14 89.40 90.48 87.02 88.45\nSplit Tables 71.99 81.55 66.12 80.87 71.08 80.12 72.52 81.92 77.52 80.68\nMerge Tables 85.29 87.03 83.39 86.91 81.68 86.48 84.80 86.35 83.04 86.99 MacroAvg 83.10 88.08 82.10 88.12 81.77 86.75 83.20 88.03 83.83 86.68 Original 80.66 81.64 81.10 82.36 79.13 78.72 81.52 81.97 78.28 80.78 Add Columns 78.26 80.27 79.16 80.18 75.79 76.87 79.09 80.46 75.03 78.58\nRemove Columns 82.67 82.75 83.09 84.00 81.56 80.69 83.20 83.18 80.33 82.55\nRename Columns 76.50 76.94 76.35 76.73 72.24 71.07 76.84 77.38 73.40 75.90\nSplit Columns 71.22 81.81 70.24 80.41 67.29 75.04 74.50 79.92 73.59 77.92\nMerge Columns 83.19 83.30 82.75 83.41 82.72 83.68 82.64 83.31 78.13 88.56 Add Tables 63.81 81.14 65.39 81.09 59.36 77.96 62.91 81.23 76.45 79.32\nRemove Tables - - - - - - - - - -\nRename Tables 79.60 80.91 80.32 81.29 77.49 77.46 80.77 81.79 77.78 80.04\nSplit Tables 75.30 78.45 73.87 78.11 73.81 73.95 75.83 78.59 74.89 77.41\nMerge Tables 65.56 67.09 64.12 67.46 63.50 64.40 65.57 67.29 63.23 68.13 MacroAvg 75.68 79.43 75.64 79.50 73.29 75.98 76.29 79.51 75.11 78.92",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 21,
+    "total_chunks": 39,
+    "char_count": 1665,
+    "word_count": 284,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6793ce80-93e6-471e-bc09-2f5c585b8f78",
+    "text": "We use the 2023-12-01-preview version for comparison with our primary fine-tuning approach, we use a fineGPT-4, and 2023-07-01-preview version for GPT-3.5. tuned Code Llama model trained without any schema perturbation\ndata as the SQL generation model. This setup allows us to isolate\n5.4 Baselines and evaluate the effectiveness of a schema selection and pruning\ncomponent in addressing schema evolution. The results are shownWe add in-context learning [10] and more advanced method: CHESS\nin Table 4.[28] as the baselines for comprehensive comparison. In order to\ntest whether the in-context learning can help address the schema\nevolution issue, we randomly select three examples (each example 6 RESULTS AND ANALYSIS\nis an <NLQ, database schema after evolution, gold SQL after schema 6.1 Main Results\nevolution> triple) as the demonstration in the prompt to help the\nAs Table 3 and Table 5 show, we train Codellama, Mistral, Llama3\nmodels understand the schema after evolution (Table 4). We also\nand SQLCoder on the original BIRD training data with and without\ninclude CHESS, an advanced method for NL2SQL as a baseline.\ndifferent perturbation types, and evaluate the model on the original\nWe apply the schema selection (SS) and candidate generation (CG)\nBIRD evaluation data and different perturbation types. We observe:\ncomponents developed in their work.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 22,
+    "total_chunks": 39,
+    "char_count": 1359,
+    "word_count": 209,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c91be887-9a58-4b34-95a2-056ed0e8a546",
+    "text": "For schema selection, we\nThe models trained on different perturbation types are\nuse advanced gpt-4o model to prune the database schema and\nmore robust to the schema variation on average, and demonremove the irrelevant tables and the irrelevant columns in the\nstrate high robustness on the table-level schema evolution.\nselected tables, ensuring only the most relevant tables and columns\nWhile adding the perturbation data during training leads to a slight\nare passed into the model for SQL generation. To ensure a fair\nExec Acc (EX) drop for original non-evolved evaluation data, adding,\nremoving and renaming column types, it achieves significantly bet-\n6https://learn.microsoft.com/en-us/azure/ai-services/openai/reference ter results on splitting columns and table-level perturbation types. By comparing these four models' performance with and without the noise in simpler schema changes where the model trained withperturbation data, we observe that for splitting columns, the model out perturbation data has already maximally learned the patterns.\ntrained with perturbation data can achieve up to 5.4 points gain for To better understand the slight performance gap under simpler\ntable match F1, 10.6 points gain column match F1 and 24 points gain column-level perturbations, we conducted error analysis and case\nfor EX; for adding tables, the model trained with perturbation data studies to compare models trained with and without perturbed\ncan achieve up to 33 points gain for table match F1, 18 points gain data. We observed two types of errors that lead to this phenomefor column match F1 and 19 points for EX; for splitting tables, the non: (1) Spurious or missing conditions in the WHERE clause. For\nmodel trained with perturbation data can achieve up to 14 points instance, given the question \"What is the element with the atom\ngain for table match F1, 4.2 points gain for column match F1 and ID of TR004_7 in molecule that is not carcinogenic?\", the model\n12 points for EX; for merging tables, the model trained on pertur- trained with perturbation (\"w/\") misses the condition T2.label = '-'\nbation data can achieve up to 4.8 points gain on table match F1 in WHERE clause, while the \"w/o\" model includes it correctly. Howand 3 points gain for column match F1. We hypothesize that this is ever, in another case, 'How many transactions were paid in CZK on\nbecause the perturbation augmented data is particularly beneficial the morning of 2012/8/26?', the \"w/\" model introduces an unnecesfor handling substantial schema changes, but may introduce minor sary WHERE condition: T1.TransactionID BETWEEN 1 AND 1000,\nwhich is not part of the gold SQL. (2) Incorrect column selection in\nSELECT or WHERE clauses. For example, for the question \"Among\nTable 4: Human Evaluation on EvoSchema. \"ZS\" refers to zerothe patients followed at the outpatient clinic, how many of them\nshot, which prompts models without any examples. \"ICL\"\nhave a normal level of alkaliphophatase?\", the \"w/\" model predicts\nrefers to in-context learning, which prompts models with\nT1.Description instead of T1.Admission in WHERE clause, while the\nthree demonstration examples. \"w/o\" means fine-tuning\n\"w/o\" model selects the correct column. Similarly, in the question\nmodel without perturbation training data; \"w/\" means fine-\n\"Which group does superhero A-Bomb belong to?\", the \"w/\" model\ntuning model with perturbation training data.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 23,
+    "total_chunks": 39,
+    "char_count": 3409,
+    "word_count": 528,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "83e75612-f1ca-40d1-9a55-84fc77cd23de",
+    "text": "Bold color\nselects T2.team_affiliation instead of the correct T2.race. These exindicates the best performance among each row.\namples suggest that while training with perturbed data can improve\ngeneral robustness, especially beneficial for handling substantial\nHuman Evaluation on EvoSchema schema changes, it may also introduce minor noise that misleads\nGPT-4 Code Llama CHESS𝑆𝑆+𝐶𝐺 in condition or column selection under simpler perturbations. Perturbation Type\nZS ICL w/o w/ Closed-source models are robust to different scheme evoOriginal 62 58 65 64 63 lution types in general. As table 3 and 5 show, we compare the\nAdd Columns 59 55 62 61 66 model performance on GPT models and four open-source modRemove Columns 65 61 66 63 64 els trained with and without perturbation types. We observe that:\nRename Columns 57 56 57 57 62 the GPT models' performance are relatively stable across different\nSplit Columns 46 59 41 62 49 perturbation types compared to the original non-evolved test set. Merge Columns 68 66 70 70 66\nIn contrast, fine-tuned open-source models without perturbation\nAdd Tables 56 55 46 62 57 training data exhibit significant performance drops—particularly\nRemove Tables - - - - - on split columns, add tables, split tables, and merge tables—which\nRename Tables 58 60 64 61 61\nintroduce larger schema changes. We hypothesize that the stabil- Split Tables 57 53 48 60 53\nMerge Tables 55 57 54 58 53 ity and robustness of closed-source models stems from broader\npretraining exposure and stronger internal schema reasoning capaMacroAvg 58 58 57 62 59\nbilities, while the open-source models trained without perturbation\ntypes are more sensitive due to limited training on diverse schema\nTable 5: Execution Accuracy on EvoSchema. \"w/\": the model variations. This motivates the need to fine-tune open-source modis trained with all the perturbation types; \"w/o\": the model els with perturbation training data to improve their generalization\nis only trained on the original training data. under schema evolution. We notice that comparing the model performance on the open-source LLMs and closed-source LLMs, the models\nExec Acc on EvoSchema trained with perturbation data have better performance than GPT\nCode Llama Mistral Llama 3 SQLCoder GPT-3.5 GPT-4 models on both column-level and table-level perturbation evaluation Perturbation Type\nw/o w/ w/o w/ w/o w/ w/o w/ data. This indicates that our models trained with perturbation data\nOriginal 58 57 59 58 55 51 58 58 44 47 are more robust than GPT models. Add Columns 57 55 56 56 52 49 55 57 43 46 Table-level perturbation has a larger impact than columnRemove Columns 59 57 60 58 56 53 60 58 45 47\nRename Columns 54 52 55 54 49 47 56 55 43 45 level perturbation on the model performance. As Table 3 and\nSplit Columns 41 62 35 54 38 49 43 67 41 46 5 show, comparing with the performance on the original evaluation\nMerge Columns 70 70 70 70 73 73 66 82 61 68 data: adding tables and splitting tables will lead to a significant table\nAdd Tables 40 58 39 58 37 52 40 57 44 48 match F1 drop; adding tables, splitting tables and merging tables\nRemove Tables - - - - - - - - - -\nRename Tables 56 55 55 56 52 50 56 55 43 47 will lead to a significant column match F1 drop. This phenomenon\nSplit Tables 38 46 36 48 40 41 43 49 40 47 indicates that adding tables or splitting tables easily confuses the\nMerge Tables 43 45 45 46 42 44 47 46 37 45 models in choosing the correct tables to generate the SQL query. For\nMacroAvg 52 56 51 56 49 51 52 58 44 49 merging tables, even though the model can correctly choose tables, it's a bit hard for the model to pick up the correct columns when Table 6: Perturbation type ablation on EvoSchema.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 24,
+    "total_chunks": 39,
+    "char_count": 3688,
+    "word_count": 644,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5fa0376b-78c5-4373-912b-43e6a42fb4a2",
+    "text": "The base\nthe columns from different tables go into the same table. While for model is Code Llama. \"both\": the model is trained with\nthe column-level performance, there are limited differences with both column-level perturbation and table-level perturbation\nthe performance on the original data except for splitting columns. types; \"w/o table-p\": the model is trained without table-level\nReducing table schema complexity is beneficial for model perturbation types; \"w/o column-p\": the model is trained\nperformance. Compare the model performance on column-level without column-level perturbation types.\nperturbation evaluation and the original evaluation data, adding\nPerturbation Type Ablation\ncolumns results in a decrease in column match F1, whereas removTable Match F1 Column Match F1\ning columns leads to an increase in column match F1. It indicates Perturbation Type both w/o table-p w/o column-p both w/o table-p w/o column-p\nsimpler table schema is beneficial for models to select columns, Original 90.73 90.80 (+0.07) 90.04 (-0.69) 81.09 82.15 (+1.06) 80.49 (-0.60)\nas removing columns simplifies the table schema while adding Add Columns 90.86 90.80 (-0.06) 89.75 (-1.11) 79.63 80.81 (+1.18) 77.29 (-2.34)\n(+0.11) (-0.24) (+0.57) (-0.67) Columns 90.72 90.83 90.48 83.28 83.85 82.61columns makes the table schema more complex. Remove\n(+0.03) (-0.78) (+1.04) (-1.32) Rename Columns 85.35 85.38 84.57 76.49 77.53 75.17 Add Tables 88.95 58.94 (-30.01) 88.57 (-0.38) 79.87 64.11 (-15.76) 79.33 (-0.54)\nRemove Tables - - - - - -\n6.2 Comparison of Different Baselines Rename Tables 90.54 90.77 (+0.23) 89.29 (-1.25) 81.13 81.51 (+0.38) 79.33 (-1.80)\nSplit Tables 80.71 73.28 (-7.43) 79.05 (-1.66) 77.41 75.95 (-1.46) 76.30 (-1.11)\nAs EvoSchema has a large scale of the test set and we need to call Merge Tables 88.72 87.87 (-0.85) 86.83 (-1.89) 68.40 68.26 (-0.14) 67.08 (-1.32)\nGPT-4 and GPT-4o API for in-context learning and CHESS respectively, to save the cost, we randomly select 200 examples for the\nTable 7: Out of Scope Effect on EvoSchema. The base model is\nraw BIRD test set and also from each perturbation type to comCode Llama. \"w/o\": the model is trained without perturbation\npare different baselines. We compare GPT-4 zero-shot prompting,\ntypes; \"w/\": the model is trained on the original data and all\nGPT-4 3-shot in-context learning, CodeLLama trained with and\nthe perturbation types; \"+ OOS\": the model is trained on the\nwithout perturbation training data and CHESS (with schema selecoriginal data, perturbation types and two out-of-scope (OOS)\ntion (SS) and candidate generation (CG)) on our downsampled test\nperturbation types; \"+ OOS FP\": The model trained with two\nset. Since we found that Exec Acc can still make mistakes when\nOOS perturbation types makes an incorrect prediction on the\ndifferent SQL queries produce the same results sometimes even\noriginal data and in-scope perturbation data; \"+ OOS TP\": The\nthey don't align with the NLQ, or sometimes both the gold SQL\nmodel trained with two OOS perturbation types makes the\nand wrong predicted SQL return the empty which may mislead\ncorrect prediction on the two OOS perturbation data; \"Tab\":\nthe evaluation, we use human evaluation here for more precise\nthe model refuses to predict SQL due to the lack of table\nevaluation. As Table 4 shows, compared to GPT-4 zero-shot (ZS),\ninformation; \"Col\": the model refuses to predict SQL due to\nin-context learning (ICL) shows a significant advantage only on\nthe lack of column information.\nthe split columns perturbation, while performing slightly better or\nworse on other types. This suggests that ICL is not consistently\nOut of Scope Effect\neffective for handling schema evolution. We hypothesize this is Table Match F1 Column Match F1 + OOS FP + OOS TP\nbecause the demonstration examples in ICL cannot cover the full Perturbation Type w/o w/ + OOS w/o w/ + OOS Tab Col Tab Col\nrange of schema and SQL changes; thus, for examples that differ Original 89.77 90.42 82.98 (-7.44) 80.66 81.64 75.43 (-6.21) 7.11 0.65 - -\nsignificantly from the demonstrations, ICL offers limited benefit. AddRemoveColumnsColumns 89.7389.82 90.2790.24 86.0782.24 (-4.20)(-8.00) 78.2682.67 80.2782.75 77.0075.90 (-3.27)(-6.85) 4.257.56 0.400.72 -- --\nHowever, for split columns, where changes commonly involve pat- Remove Col in SQL - - - - - - 5.02 - - 84.03\nterns like name, address, or date splits, the demonstrations gener- Rename Columns 85.28 85.07 80.20 (-4.87) 76.50 76.94 73.04 (-3.90) 4.44 0.20 - -\nAdd Tables 57.88 89.50 88.78 (-0.72) 63.81 81.14 80.71 (-0.37) 0.33 0.07 - -\nalize better—making ICL more effective in this case. For CHESS, Remove Tables - - - - - - - - 1.62 83.86 -\nwe use GPT-4o—a powerful closed-source model—for schema selec- RenameSplit TablesTables 88.8471.99 90.3281.55 86.3681.07 (-3.96)(-0.48) 79.6075.30 80.9178.45 78.0678.02 (-2.85)(-0.43) 3.520.26 0.390.07 -- --\ntion and pruning, and Code Llama without perturbation training Merge Tables 85.29 87.03 82.18 (-5.15) 65.56 67.09 63.59 (-3.50) 4.65 0.35 - -\n(CodeLlama w/o) as the SQL generation model. CHESS achieves\nthe best performance on add columns and rename columns, and\nsignificantly outperforms CodeLlama w/o on split columns, add each baseline.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 25,
+    "total_chunks": 39,
+    "char_count": 5239,
+    "word_count": 819,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "26a6f1c0-4e85-47b2-b32e-a810de735dc3",
+    "text": "We computed p-values using the statsmodels packtables, and on average. This highlights the importance of accurate age, considering differences statistically significant when 𝑝< 0.05,\nschema selection and pruning in improving SQL generation. How- which indicates that the improvement is unlikely due to random\never, we also observe that errors at the pruning stage can propagate, chance. Using this test, we observed our method achieved statisleading to degraded performance. Specifically, in merge columns tically significant improvements over three key baselines: GPT-4\nand merge tables cases, CHESS tends to over-prune, omitting rele- in-context learning, fine-tuning without perturbed data, and CHESS\nvant schema information and resulting in worse performance than (all with 𝑝< 0.05). Finally, we found that fine-tuning CodeLlama with\nperturbation training data is still needed, since this method gets 6.3 Influence of Perturbation Types\nthe best performance among all the baselines on average across all We explore the effect of the column-level perturbation types and\ntypes of evaluation data, and performs significantly better than oth- table-level perturbation types. As Table 6 shows, we train the model\ners on 'split columns', 'add tables', 'split tables' and 'merge tables' with both column-level and table-level perturbation types, and\ntypes. We applied McNemar's Test [21] to measure the statistical compare it with the model trained without column-level pertursignificance of performance differences between our method and bation types and without table-level perturbation types. experiments, we found that without training on table-level per- that models trained without perturbation types tend to predict SQL\nturbations, the model performance can be slightly better than the queries that join all available tables, even when some tables are\nmodel trained with both column-level and table-level perturbation irrelevant to the NLQs and SQLs. We hypothesize that this occurs\ntypes on column-level perturbation types, while can lead to a sig- because during training without perturbations, the model only sees\nnificant performance drop on the table-level perturbation types. relevant table schemas, causing it to learn spurious patterns that\nThis indicates that the table-level perturbation data has a limited always try to join all the input tables.\neffect on the column-level perturbation types while having a huge To explore whether simply adding irrelevant tables could yield\nimpact on the table-level perturbation types. When looking at the similar performance to models trained with perturbation data, we\nmodel trained only on table-level perturbation types, we found conducted an experiment where we trained CodeLlama on BIRD. As\nthat the model performance on both column-level and table-level shown in Table 8, adding irrelevant tables led to similar performance\nperturbation types dropped. This indicates that the column-level on \"Add Tables\" perturbation type. but it caused a performance\nperturbation types can still benefit the training. drop on other perturbation types. This suggests that combining all\nperturbation data is necessary to train a more robust model.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 26,
+    "total_chunks": 39,
+    "char_count": 3188,
+    "word_count": 464,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0fd37ae7-9822-417d-bac2-12c7cb5503a2",
+    "text": "Table 8: Irrelevant tables effect. \"w/\": the model is trained\nwith all the perturbation types; \"w/o\": the model is only Table 9: Intra-database Effect. This experiment emphasizes\ntrained on the original training data; \"w/o+\": the model is that the training and evaluation occur within the same dataonly trained on the original training data, but for the input base, instead of across databases.\ntable schema, we also add irrelevant tables. Intra-database Effect\nAdd Irrelevant Tables Effect Table Match F1 Column Match F1\nPerturbation Type\nTable Match F1 Column Match F1 w/o w/ w/o w/\nPerturbation Type\nw/o w/o+ w/ w/o w/o+ w/ Original 87.24 87.43 79.54 80.89\nOriginal 89.77 87.65 90.42 80.66 79.24 81.64\nAdd Columns 87.14 87.43 76.36 78.92\nAdd Columns 89.73 86.35 90.27 78.26 75.31 80.27 Remove Columns 87.29 87.27 81.14 81.29\nRemove Columns 89.82 87.30 90.24 82.67 80.74 82.75 Rename Columns 85.71 86.43 77.45 79.09\nRename Columns 85.28 81.90 85.07 76.50 73.28 76.94\nAdd Tables 61.13 83.95 66.11 78.57\nAdd Tables 57.88 88.01 89.50 63.81 79.51 81.14 Remove Tables - - - -\nRemove Tables - - - - - - Rename Tables 86.33 86.67 79.44 79.96\nRename Tables 88.84 86.84 90.32 79.60 78.47 80.91 Split Tables 71.82 78.52 75.09 77.42\nSplit Tables 71.99 67.27 81.55 75.30 70.39 78.45 Merge Tables 85.11 87.44 71.43 74.72\nMerge Tables 85.29 83.56 87.03 65.56 63.59 67.09\n6.6 Influence of Intra-DB and Cross-DB\nWe hypothesize that a model trained on the same databases may\n6.4 Influence of Out-of-scope Types not only learn schema evolution patterns but also become familiar\nWe evaluate both in-scope and out-of-scope scenarios. In in-scope with specific table and column names. To test this, we split the\nsettings, schema changes may or may not alter the gold SQL. Out-of- BIRD training data into train/test sets to ensure that each database\nscope cases involve two special perturbations: (1) Removing columns in the test set also appears in the training set. We use Code Llama\nused in the gold SQL, and (2) Removing tables used in the gold SQL. as the base model. The results in Table 9 show that, for most perturIn both cases, the schema lacks critical information, and the model bation types, the model's performance improves more compared\nis expected to abstain from generating a query. to the cross-database scenario in Section 6.1, which verifies our\nTo assess their impact, we train a model on a combined dataset hypothesis.\nthat includes both out-of-scope and in-scope perturbation types,\nalong with the original training data. We compare this model to 7 CONCLUSION\nothers trained only on the original or in-scope data.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 27,
+    "total_chunks": 39,
+    "char_count": 2615,
+    "word_count": 431,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f3782994-c9ba-4d90-a7e2-7619f09e0469",
+    "text": "As shown in\nIn conclusion, we formulate the critical challenge of schema evoTable 7, incorporating out-of-scope types results in performance\nlution in adaptive text-to-SQL systems and introduce EvoSchema,\ndegradation across both original and in-scope evaluation sets.\na comprehensive, diverse and unique benchmark designed specifiError analysis reveals that the model trained with out-of-scope\ncally to study and address this problem. We developed a structured\ndata tends to make more conservative predictions, sometimes abtaxonomy of schema evolution types, enabling the synthesis of realstaining even when the gold SQL is valid. Further analysis shows\nistic schema designs through column-level and table-level perturbathat the false positive (FP) rate closely matches the performance\ntions. Using this taxonomy, we construct an evaluation benchmark\ndrop between models with and without out-of-scope training, conto rigorously assess model robustness under schema changes and\nfirming that increased conservatism is the main cause. Additionally,\nalso introduce a novel training paradigm that augments existing\nfor the out-of-scope perturbations, the TP is only around 84%, which\n<NLQ, relevant schema, SQL> triples with diverse schema designs\nindicates that the model still has a 16% chance to make a prediction\nfor training to improve robustness against schema evolution.\neven when there should not be an SQL.\n6.5 Influence of Irrelevant Tables ACKNOWLEDGMENTS\nWe observed that the model trained with perturbation types demon- The authors would like to thank colleagues from the OSU NLP\nstrates significant robustness to table-level perturbations, such as group for their insightful discussions and constructive suggestions\nadding and splitting tables. Upon analyzing the errors, we found and all anonymous reviewers for their thoughtful comments. REFERENCES [20] Pingchuan Ma and Shuai Wang. 2021. MT-teql: evaluating and augmenting neural\n[1] Andre Becklas. 2018. FIFA World Cup: All the results from World Cups. Kaggle NLIDB on real-world linguistic and schema variations.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 28,
+    "total_chunks": 39,
+    "char_count": 2076,
+    "word_count": 296,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "54062a3f-f063-4b96-a104-3fbcec4ca4d2",
+    "text": "VLDB Endow. 15,\n(2018). https://www.kaggle.com/datasets/abecklas/fifa-world-cup 3 (nov 2021), 569–582. https://doi.org/10.14778/3494124.3494139\n[2] Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexan- [21] Quinn McNemar. 1947. Note on the sampling error of the difference between\nder Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve correlated proportions or percentages. Psychometrika 12, 2 (1947), 153–157.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 29,
+    "total_chunks": 39,
+    "char_count": 447,
+    "word_count": 56,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "db459022-0f54-4d26-941c-a611e72258c1",
+    "text": "Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, and [22] OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv. Dr.Spider: A Diagnostic Evaluation Benchmark towards org/abs/2303.08774\nText-to-SQL Robustness. In The Eleventh International Conference on Learning [23] Xinyu Pi, Bing Wang, Yan Gao, Jiaqi Guo, Zhoujun Li, and Jian-Guang Lou. 2022. Representations. https://openreview.net/forum?id=Wc5bmZZU9cy Towards Robustness of Text-to-SQL Models Against Natural and Realistic Ad-\n[3] Anthony Cleve, Maxime Gobert, Loup Meurice, Jerome Maes, and Jens Weber. versarial Table Perturbation. In Proceedings of the 60th Annual Meeting of the\n2015.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 30,
+    "total_chunks": 39,
+    "char_count": 685,
+    "word_count": 88,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "114bacfa-be9c-4d88-a5b1-f3e017b8afcf",
+    "text": "Understanding database schema evolution: A case study. Science of Association for Computational Linguistics (Volume 1: Long Papers), Smaranda\nComputer Programming 97 (2015), 113–121. Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Com-\n[4] Daiga Deksne and Raivis Skadin, š. 2022. Virtual Assistant for Querying Databases putational Linguistics, Dublin, Ireland, 2007–2022. https://doi.org/10.18653/v1/\nin Natural Language. In Proceedings of the Future Technologies Conference. 2022.acl-long.142\nSpringer, 555–564. [24] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D.\n[5] Julien Delplanque, Anne Etien, Nicolas Anquetil, and Olivier Auverlot. 2018. Dataset Shift in Machine Learning. The MIT Press.\nlational Database Schema Evolution: An Industrial Case Study. In 2018 IEEE [25] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, XiaoInternational Conference on Software Maintenance and Evolution (ICSME). qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy\n635–644. https://doi.org/10.1109/ICSME.2018.00073 Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris-\n[6] Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade\nHuan Sun, and Matthew Richardson. 2021. Structure-Grounded Pretraining Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas\nfor Text-to-SQL. In Proceedings of the 2021 Conference of the North American Scialom, and Gabriel Synnaeve. 2024.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 31,
+    "total_chunks": 39,
+    "char_count": 1593,
+    "word_count": 203,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "81925b0b-5680-4410-9def-c7c3564912d0",
+    "text": "Code Llama: Open Foundation Models for\nChapter of the Association for Computational Linguistics: Human Language Code. arXiv:2308.12950 [cs.CL] https://arxiv.org/abs/2308.12950\nTechnologies. Association for Computational Linguistics. https://doi.org/10. [26] Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé\n18653/v1/2021.naacl-main.105 Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne Goujon. 2024. Enhancing\n[7] Abhimanyu Dubey and et al. 2024. The Llama 3 Herd of Models.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 32,
+    "total_chunks": 39,
+    "char_count": 503,
+    "word_count": 61,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "41603bb4-a56f-43bb-99f4-60c1a0a562f4",
+    "text": "Text-to-SQL Translation for Financial System Design. In Proceedings of the\narXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 46th International Conference on Software Engineering: Software Engineering\n[8] Jonathan Fürst, Catherine Kosten, Farhad Nooralahzadeh, Yi Zhang, and Kurt in Practice. 252–262. Evaluating the Data Model Robustness of Text-to-SQL Systems [27] Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. 2023. Based on Real User Queries. In EDBT. 158–170. https://doi.org/10.48786/edbt. Exploring Chain of Thought Style Prompting for Text-to-SQL. In Proceedings\n2025.13 of the 2023 Conference on Empirical Methods in Natural Language Processing,\n[9] Chang Gao, Bowen Li, Wenxuan Zhang, Wai Lam, Binhua Li, Fei Huang, Luo Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational\nSi, and Yongbin Li. 2022. Towards Generalizable and Robust Text-to-SQL Linguistics, Singapore, 5376–5393. https://doi.org/10.18653/v1/2023.emnlpParsing. In Findings of the Association for Computational Linguistics: EMNLP main.327\n2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association [28] Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and\nfor Computational Linguistics, Abu Dhabi, United Arab Emirates, 2113–2125. CHESS: Contextual Harnessing for Efficient SQL Synthesis.\nhttps://doi.org/10.18653/v1/2022.findings-emnlp.155 arXiv:2405.16755 [cs.LG] https://arxiv.org/abs/2405.16755\n[10] Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and [29] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew\nJingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking\nBenchmark Evaluation. Proceedings of the VLDB Endowment 17, 5 (2024), 1132– for Text-to-SQL Parsers.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 33,
+    "total_chunks": 39,
+    "char_count": 1864,
+    "word_count": 233,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "134a5566-ffd2-45c2-8130-dd2efd9d350b",
+    "text": "In Proceedings of the 58th Annual Meeting of the\n1145. Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie\n[11] Andrea Hillenbrand and Uta Störl. 2021.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 34,
+    "total_chunks": 39,
+    "char_count": 176,
+    "word_count": 26,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "fef706d5-6653-45d1-891d-1c53c1944e44",
+    "text": "Managing Schema Migration in Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics,\nNoSQL Databases: Advisor Heuristics vs. Self-adaptive Schema Migration Strate- Online, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677\ngies. In International Conference on Model-Driven Engineering and Software [30] Chenglong Wang, Kedar Tatwawadi, Marc Brockschmidt, Po-Sen Huang, Yi Mao,\nDevelopment. Oleksandr Polozov, and Rishabh Singh. 2018.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 35,
+    "total_chunks": 39,
+    "char_count": 463,
+    "word_count": 52,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0280d25e-7b2c-4ab0-8e79-3390728ed220",
+    "text": "Robust Text-to-SQL Generation\n[12] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- with Execution-Guided Decoding. arXiv:1807.03100 [cs.CL] https://arxiv.org/\nvendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, abs/1807.03100\nGuillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, [31] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement DePierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funand William El Sayed. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL] https: towicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jer-\n//arxiv.org/abs/2310.06825 nite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame,\n[13] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Quentin Lhoest, and Alexander M. HuggingFace's TransformZhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas ers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL]\nPhillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution https://arxiv.org/abs/1910.03771\nshifts. In International conference on machine learning. PMLR, 5637–5664. [32] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li,\n[14] Kunal Kumar and S. Database normalization design pattern.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 36,
+    "total_chunks": 39,
+    "char_count": 1431,
+    "word_count": 185,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "19995af7-e0f1-456b-b389-1b28adf0785c",
+    "text": "James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir\nIn 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and\nComputer and Electronics (UPCON). 318–322. https://doi.org/10.1109/UPCON. Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the\n2017.8251067 2018 Conference on Empirical Methods in Natural Language Processing, Ellen\n[15] Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (Eds.). Association\nDawn of Natural Language to SQL: Are We Fully Ready? arXiv preprint for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.\narXiv:2406.01265 (2024). 18653/v1/D18-1425\n[16] Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. [33] Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold\nProceedings of the VLDB Endowment 17, 12 (2024), 4213–4216. Liu, Rui Zhao, Ziyue Li, and Hangyu Mao. 2024. Benchmarking the Text-\n[17] Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie to-SQL Capability of Large Language Models: A Comprehensive Evaluation. Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 37,
+    "total_chunks": 39,
+    "char_count": 1314,
+    "word_count": 187,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f6fc7b44-710a-4f76-b783-173d8960e19f",
+    "text": "Codes: Towards build- arXiv:2403.02951 [cs.CL] https://arxiv.org/abs/2403.02951\ning open-source language models for text-to-sql. Proceedings of the ACM on [34] Chao Zhang, Yuren Mao, Yijiang Fan, Yu Mi, Yunjun Gao, Lu Chen, Dongfang\nManagement of Data 2, 3 (2024), 1–28. Lou, and Jinshu Lin. 2024.",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 38,
+    "total_chunks": 39,
+    "char_count": 297,
+    "word_count": 43,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "235adc62-626b-426e-8290-93b0193fce03",
+    "text": "FinSQL: Model-Agnostic LLMs-based Text-to-SQL\n[18] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Framework for Financial Analysis. arXiv preprint arXiv:2401.10506 (2024). Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a [35] Hanchong Zhang, Ruisheng Cao, Lu Chen, Hongshen Xu, and Kai Yu. 2023.\ndatabase interface? a big bench for large-scale database grounded text-to-sqls. ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated\nAdvances in Neural Information Processing Systems 36 (2024). In The 2023 Conference on Empirical Methods in Natural\n[19] Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Language Processing. https://openreview.net/forum?id=oeZiXoCHgq\nFan, Guoliang Li, Nan Tang, and Yuyu Luo. 2024. A Survey of NL2SQL with [36] Tianshu Zhang, Changchang Liu, Wei-Han Lee, Yu Su, and Huan Sun. 2023. Large Language Models: Where are we, and where are we going? arXiv preprint Federated Learning for Semantic Parsing: Task Formulation, Evaluation Setup,\narXiv:2408.05109 (2024). New Algorithms. arXiv:2305.17221 [cs.CL] https://arxiv.org/abs/2305.17221\n[37] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu,\nLess Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit [38] Alex Zhuang, Ge Zhang, Tianyu Zheng, Xinrun Du, Junjie Wang, Weiming Ren,\nMathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Stephen W Huang, Jie Fu, Xiang Yue, and Wenhu Chen. 2024. StructLM: Towards\nData Parallel. arXiv:2304.11277 [cs.DC] https://arxiv.org/abs/2304.11277 Building Generalist Models for Structured Knowledge Grounding. arXiv preprint",
+    "paper_id": "2603.10697",
+    "title": "EvoSchema: Towards Text-to-SQL Robustness Against Schema Evolution",
+    "authors": [
+      "Tianshu Zhang",
+      "Kun Qian",
+      "Siddhartha Sahai",
+      "Yuan Tian",
+      "Shaddy Garg",
+      "Huan Sun",
+      "Yunyao Li"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10697v1",
+    "chunk_index": 39,
+    "total_chunks": 39,
+    "char_count": 1779,
+    "word_count": 248,
+    "chunking_strategy": "semantic"
+  }
+]
\ No newline at end of file