13
SParC: Cross-Domain Semantic Parsing in Context Tao Yu Rui Zhang Michihiro Yasunaga Yi Chern Tan Xi Victoria Lin Suyi Li Heyang Er Irene Li Bo Pang Tao Chen Emily Ji Shreya Dixit David Proctor Sungrok Shim Jonathan Kraft Vincent Zhang Caiming Xiong Richard Socher Dragomir Radev Department of Computer Science, Yale University Salesforce Research {tao.yu, r.zhang, michihiro.yasunaga, dragomir.radev}@yale.edu {xilin, cxiong, rsocher}@salesforce.com Abstract We present SParC, a dataset for cross-domain Semantic Parsing in Context. It consists of 4,298 coherent question sequences (12k+ indi- vidual questions annotated with SQL queries), obtained from controlled user interactions with 200 complex databases over 138 do- mains. We provide an in-depth analysis of SParC and show that it introduces new chal- lenges compared to existing datasets. SParC (1) demonstrates complex contextual depen- dencies, (2) has greater semantic diversity, and (3) requires generalization to new domains due to its cross-domain nature and the un- seen databases at test time. We experiment with two state-of-the-art text-to-SQL mod- els adapted to the context-dependent, cross- domain setup. The best model obtains an exact set match accuracy of 20.2% over all questions and less than 10% over all inter- action sequences, indicating that the cross- domain setting and the contextual phenomena of the dataset present significant challenges for future research. The dataset, baselines, and leaderboard are released at https:// yale-lily.github.io/sparc. 1 Introduction Querying a relational database is often challeng- ing and a natural language interface has long been regarded by many as the most powerful database interface (Popescu et al., 2003; Bertomeu et al., 2006; Li and Jagadish, 2014). The problem of mapping a natural language utterance into exe- cutable SQL queries (text-to-SQL) has attracted increasing attention from the semantic parsing community by virtue of a continuous effort of dataset creation (Zelle and Mooney, 1996; Iyyer et al., 2017; Zhong et al., 2017; Finegan-Dollak et al., 2018; Yu et al., 2018a) and the modeling innovation that follows it (Xu et al., 2017; Wang et al., 2018; Yu et al., 2018b; Shi et al., 2018). Database about student dormitory containing 5 tables. D 1 : Find the first and last names of the students who are living C 1 : in the dorms that have a TV Lounge as an amenity. How many dorms have a TV Lounge? Q 1 : SELECT COUNT (*) FROM dorm AS T1 JOIN has_amenity S 1 : AS T2 ON T1.dormid = T2.dormid JOIN dorm_amenity AS T3 ON T2.amenid = T3.amenid WHERE T3.amenity_name = 'TV Lounge' What is the total capacity of these dorms? Q 2 : SELECT SUM (T1.student_capacity) FROM dorm AS T1 S 2 : JOIN has_amenity AS T2 ON T1.dormid = T2.dormid JOIN dorm_amenity AS T3 ON T2.amenid = T3.amenid WHERE T3.amenity_name = 'TV Lounge' How many students are living there? Q 3 : SELECT COUNT (*) FROM student AS T1 JOIN lives_in S 3 : AS T2 ON T1.stuid = T2.stuid WHERE T2.dormid IN ( SELECT T3.dormid FROM has_amenity AS T3 JOIN dorm_amenity AS T4 ON T3.amenid = T4.amenid WHERE T4.amenity_name = 'TV Lounge') Please show their first and last names. Q 4 : SELECT T1.fname, T1.lname FROM student AS T1 JOIN S 4 : lives_in AS T2 ON T1.stuid = T2.stuid WHERE T2.dormid IN ( SELECT T3.dormid FROM has_amenity AS T3 JOIN dorm_amenity AS T4 ON T3.amenid = T4. amenid WHERE T4.amenity_name = 'TV Lounge') -------------------------------------- Database about shipping company containing 13 tables D 2 : Find the names of the first 5 customers. C 2 : What is the customer id of the most recent customer? Q 1 : SELECT customer_id FROM customers ORDER BY S 1 : date_became_customer DESC LIMIT 1 What is their name? Q 2 : SELECT customer_name FROM customers ORDER BY S 2 : date_became_customer DESC LIMIT 1 How about for the first 5 customers? Q 3 : SELECT customer_name FROM customers ORDER BY S 3 : date_became_customer LIMIT 5 Figure 1: Two question sequences from the SParC dataset. Questions (Q i ) in each sequence query a database (D i ), obtaining information sufficient to com- plete the interaction goal (C i ). Each question is anno- tated with a corresponding SQL query (S i ). SQL token sequences from the interaction context are underlined. While most of these work focus on precisely mapping stand-alone utterances to SQL queries, generating SQL queries in a context-dependent scenario (Miller et al., 1996; Zettlemoyer and

SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

SParC: Cross-Domain Semantic Parsing in Context

Tao Yu† Rui Zhang† Michihiro Yasunaga† Yi Chern Tan† Xi Victoria Lin¶Suyi Li† Heyang Er† Irene Li† Bo Pang† Tao Chen† Emily Ji†

Shreya Dixit† David Proctor† Sungrok Shim† Jonathan Kraft†Vincent Zhang† Caiming Xiong¶ Richard Socher¶ Dragomir Radev†

†Department of Computer Science, Yale University¶Salesforce Research

{tao.yu,r.zhang,michihiro.yasunaga,dragomir.radev}@yale.edu{xilin,cxiong,rsocher}@salesforce.com

Abstract

We present SParC, a dataset for cross-domainSemantic Parsing in Context. It consists of4,298 coherent question sequences (12k+ indi-vidual questions annotated with SQL queries),obtained from controlled user interactionswith 200 complex databases over 138 do-mains. We provide an in-depth analysis ofSParC and show that it introduces new chal-lenges compared to existing datasets. SParC(1) demonstrates complex contextual depen-dencies, (2) has greater semantic diversity, and(3) requires generalization to new domainsdue to its cross-domain nature and the un-seen databases at test time. We experimentwith two state-of-the-art text-to-SQL mod-els adapted to the context-dependent, cross-domain setup. The best model obtains anexact set match accuracy of 20.2% over allquestions and less than 10% over all inter-action sequences, indicating that the cross-domain setting and the contextual phenomenaof the dataset present significant challengesfor future research. The dataset, baselines,and leaderboard are released at https://yale-lily.github.io/sparc.

1 Introduction

Querying a relational database is often challeng-ing and a natural language interface has long beenregarded by many as the most powerful databaseinterface (Popescu et al., 2003; Bertomeu et al.,2006; Li and Jagadish, 2014). The problem ofmapping a natural language utterance into exe-cutable SQL queries (text-to-SQL) has attractedincreasing attention from the semantic parsingcommunity by virtue of a continuous effort ofdataset creation (Zelle and Mooney, 1996; Iyyeret al., 2017; Zhong et al., 2017; Finegan-Dollaket al., 2018; Yu et al., 2018a) and the modelinginnovation that follows it (Xu et al., 2017; Wanget al., 2018; Yu et al., 2018b; Shi et al., 2018).

Database about student dormitory containing 5 tables.D1 : Find the first and last names of the students who are livingC1 :

in the dorms that have a TV Lounge as an amenity.

How many dorms have a TV Lounge?Q1 : SELECT COUNT (*) FROM dorm AS T1 JOIN has_amenityS1 :

AS T2 ON T1.dormid = T2.dormid JOIN dorm_amenity AS T3 ON T2.amenid = T3.amenid WHERE T3.amenity_name = 'TV Lounge'

What is the total capacity of these dorms?Q2 : SELECT SUM (T1.student_capacity) FROM dorm AS T1S2 :

JOIN has_amenity AS T2 ON T1.dormid = T2.dormid JOIN dorm_amenity AS T3 ON T2.amenid = T3.amenid WHERE T3.amenity_name = 'TV Lounge'

How many students are living there?Q3 : SELECT COUNT (*) FROM student AS T1 JOIN lives_inS3 :

AS T2 ON T1.stuid = T2.stuid WHERE T2.dormid IN ( SELECT T3.dormid FROM has_amenity AS T3 JOIN dorm_amenity AS T4 ON T3.amenid = T4.amenid WHERE T4.amenity_name = 'TV Lounge')

Please show their first and last names.Q4 : SELECT T1.fname, T1.lname FROM student AS T1 JOINS4 :

lives_in AS T2 ON T1.stuid = T2.stuid WHERE T2.dormid IN ( SELECT T3.dormid FROM has_amenity AS T3 JOIN dorm_amenity AS T4 ON T3.amenid = T4. amenid WHERE T4.amenity_name = 'TV Lounge')

--------------------------------------

Database about shipping company containing 13 tablesD2 : Find the names of the first 5 customers.C2 :

What is the customer id of the most recent customer?Q1 : SELECT customer_id FROM customers ORDER BYS1 :

date_became_customer DESC LIMIT 1

What is their name?Q2 : SELECT customer_name FROM customers ORDER BYS2 :

date_became_customer DESC LIMIT 1

How about for the first 5 customers?Q3 : SELECT customer_name FROM customers ORDER BYS3 :

date_became_customer LIMIT 5

Figure 1: Two question sequences from the SParCdataset. Questions (Qi) in each sequence query adatabase (Di), obtaining information sufficient to com-plete the interaction goal (Ci). Each question is anno-tated with a corresponding SQL query (Si). SQL tokensequences from the interaction context are underlined.

While most of these work focus on preciselymapping stand-alone utterances to SQL queries,generating SQL queries in a context-dependentscenario (Miller et al., 1996; Zettlemoyer and

Page 2: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

Collins, 2009; Suhr et al., 2018) has been studiedless often. The most prominent context-dependenttext-to-SQL benchmark is ATIS1, which is set inthe flight-booking domain and contains only onedatabase (Hemphill et al., 1990; Dahl et al., 1994).

In a real-world setting, users tend to ask a se-quence of thematically related questions to learnabout a particular topic or to achieve a complexgoal. Previous studies have shown that by al-lowing questions to be constructed sequentially,users can explore the data in a more flexible man-ner, which reduces their cognitive burden (Hale,2006; Levy, 2008; Frank, 2013; Iyyer et al., 2017)and increases their involvement when interactingwith the system. The phrasing of such questionsdepends heavily on the interaction history (Katoet al., 2004; Chai and Jin, 2004; Bertomeu et al.,2006). The users may explicitly refer to or omitpreviously mentioned entities and constraints, andmay introduce refinements, additions or substitu-tions to what has already been said (Figure 1).This requires a practical text-to-SQL system to ef-fectively process context information to synthesizethe correct SQL logic.

To enable modeling advances in context-dependent semantic parsing, we introduce SParC(cross-domain Semantic Parsing in Context),an expert-labeled dataset which contains 4,298coherent question sequences (12k+ questionspaired with SQL queries) querying 200 complexdatabases in 138 different domains. The dataset isbuilt on top of Spider2, the largest cross-domaincontext-independent text-to-SQL dataset availablein the field (Yu et al., 2018c). The large num-ber of domains provide rich contextual phenom-ena and thematic relations between the questions,which general-purpose natural language interfacesto databases have to address. In addition, it en-ables us to test the generalization of the trainedsystems to unseen databases and domains.

We asked 15 college students with SQL expe-rience to come up with question sequences overthe Spider databases (§ 3). Questions in the orig-inal Spider dataset were used as guidance to thestudents for constructing meaningful interactions:each sequence is based on a question in Spider andthe student has to ask inter-related questions to ob-

1A subset of ATIS is also frequently used in context-independent semantic parsing research (Zettlemoyer andCollins, 2007; Dong and Lapata, 2016).

2The data is available at https://yale-lily.github.io/spider.

tain information that answers the Spider question.At the same time, the students are encouraged tocome up with related questions which do not di-rectly contribute to the Spider question so as toincrease data diversity. The questions were subse-quently translated to complex SQL queries by thesame student. Similar to Spider, the SQL Queriesin SParC cover complex syntactic structures andmost common SQL keywords.

We split the dataset such that a database appearsin only one of the train, development and test sets.We provide detailed data analysis to show the rich-ness of SParC in terms of semantics, contextualphenomena and thematic relations (§ 4). We alsoexperiment with two competitive baseline modelsto assess the difficulty of SParC (§ 5). The bestmodel achieves only 20.2% exact set matching ac-curacy3 on all questions, and demonstrates a de-crease in exact set matching accuracy from 38.6%for questions in turn 1 to 1.1% for questions inturns 4 and higher (§ 6). This suggests that there isplenty of room for advancement in modeling andlearning on the SParC dataset.

2 Related Work

Context-independent semantic parsing Earlystudies in semantic parsing (Zettlemoyer andCollins, 2005; Artzi and Zettlemoyer, 2013; Be-rant and Liang, 2014; Li and Jagadish, 2014; Pa-supat and Liang, 2015; Dong and Lapata, 2016;Iyer et al., 2017) were based on small and single-domain datasets such as ATIS (Hemphill et al.,1990; Dahl et al., 1994) and GeoQuery (Zelle andMooney, 1996). Recently, an increasing numberof neural approaches (Zhong et al., 2017; Xu et al.,2017; Yu et al., 2018a; Dong and Lapata, 2018; Yuet al., 2018b) have started to use large and cross-domain text-to-SQL datasets such as WikiSQL(Zhong et al., 2017) and Spider (Yu et al., 2018c).Most of them focus on converting stand-alone nat-ural language questions to executable queries. Ta-ble 1 compares SParC with other semantic parsingdatasets.

Context-dependent semantic parsing withSQL labels Only a few datasets have beenconstructed for the purpose of mapping context-dependent questions to structured queries.

3Exact string match ignores ordering discrepancies ofSQL components whose order does not matter. Exact setmatching is able to consider ordering issues in SQL evalu-ation. See more evaluation details in section 6.1.

Page 3: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

Dataset Context Resource Annotation Cross-domainSParC X database SQL X

ATIS (Hemphill et al., 1990; Dahl et al., 1994) X database SQL 7

Spider (Yu et al., 2018c) 7 database SQL XWikiSQL (Zhong et al., 2017) 7 table SQL X

GeoQuery (Zelle and Mooney, 1996) 7 database SQL 7

SequentialQA (Iyyer et al., 2017) X table denotation XSCONE (Long et al., 2016) X environment denotation 7

Table 1: Comparison of SParC with existing semantic parsing datasets.

Hemphill et al. (1990); Dahl et al. (1994) col-lected the contextualized version of ATIS thatincludes series of questions from users interact-ing with a flight database. Adopted by severalworks later on (Miller et al., 1996; Zettlemoyerand Collins, 2009; Suhr et al., 2018), ATIShas only a single domain for flight planningwhich limits the possible SQL logic it contains.In contrast to ATIS, SParC consists of a largenumber of complex SQL queries (with mostSQL syntax components) inquiring 200 databasesin 138 different domains, which contributes toits diversity in query semantics and contextualdependencies. Similar to Spider, the databases inthe train, development and test sets of SParC donot overlap.

Context-dependent semantic parsing with de-notations Some datasets used in recoveringcontext-dependent meaning (including SCONE(Long et al., 2016) and SequentialQA (Iyyer et al.,2017)) contain no logical form annotations butonly denotation (Berant and Liang, 2014) instead.SCONE (Long et al., 2016) contains some in-structions in limited domains such as chemistryexperiments. The formal representations in thedataset are world states representing state changesafter each instruction instead of programs or logi-cal forms. SequentialQA (Iyyer et al., 2017) wascreated by asking crowd workers to decomposesome complicated questions in WikiTableQues-tions (Pasupat and Liang, 2015) into sequences ofinner-related simple questions. As shown in Ta-ble 1, neither of the two datasets were annotatedwith query labels. Thus, to make the tasks feasi-ble, SCONE (Long et al., 2016) and SequentialQA(Iyyer et al., 2017) exclude many questions withrich semantic and contextual types. For example,(Iyyer et al., 2017) requires that the answers to thequestions in SequentialQA must appear in the ta-ble, and most of them can be solved by simple

SQL queries with SELECT and WHERE clauses.Such direct mapping without formal query labelsbecomes unfeasible for complex questions. Fur-thermore, SequentialQA contains questions basedonly on a single Wikipedia tables at a time. Incontrast, SParC contains 200 significantly largerdatabases, and complex query labels with all com-mon SQL key components. This requires a systemdeveloped for SParC to handle information neededover larger databases in different domains.

Conversational QA and dialogue system Lan-guage understanding in context is also studied fordialogue and question answering systems. Thedevelopment in dialogue (Henderson et al., 2014;Mrksic et al., 2017; Zhong et al., 2018) uses pre-defined ontology and slot-value pairs with limitednatural language meaning representation, whereaswe focus on general SQL queries that enablemore powerful semantic meaning representation.Recently, some conversational question answer-ing datasets have been introduced, such as QuAC(Choi et al., 2018) and CoQA (Reddy et al., 2018).They differ from SParC in that the answers arefree-form text instead of SQL queries. On theother hand, Kato et al. (2004); Chai and Jin (2004);Bertomeu et al. (2006) conduct early studies ofthe contextual phenomena and thematic relationsin database dialogue/QA systems, which we useas references when constructing SParC.

3 Data Collection

We create the SParC dataset in four stages: select-ing interaction goals, creating questions, annotat-ing SQL representations, and reviewing.

Interaction goal selection To ensure thematicrelevance within each question sequence, we usequestions in the original Spider dataset as the the-matic guidance for constructing meaningful queryinteractions, i.e. the interaction goal. Each se-quence is based on a question in Spider and the an-

Page 4: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

Thematic relation Description Example PercentageRefinement(constraint refine-ment)

The current question asks for the sametype of entity as a previous questionwith a different constraint.

Prev Q: Which major has the fewest students?Cur Q: What is the most popular one?

33.8%

Theme-entity(topic explo-ration)

The current question asks for otherproperties about the same entity as aprevious question.

Prev Q: What is the capacity of AnonymousDonor Hall?Cur Q: List all of the amenities which it has.

48.4%

Theme-property(participant shift)

The current question asks for the sameproperty about another entity.

Prev Q: Tell me the rating of the episode named“Double Down”.Cur Q: How about for “Keepers”?

9.7%

Answer refine-ment/theme(answer explo-ration)

The current question asks for a subsetof the entities given in a previous an-swer or asks about a specific entity in-troduced in a previous answer.

Prev Q: Please list all the different departmentnames.Cur Q: What is the average salary of all instruc-tors in the Statistics department?

8.1%

Table 2: Thematic relations between questions in a database QA system defined by Bertomeu et al. (2006). Thefirst three relations hold between a question and a previous question and the last relation holds between a questionand a previous answer. We manually classified 102 examples in SParC into one or more of them and show thedistribution. The entities (bold), properties (italics) and constraints (underlined) are highlighted in each question.

notator has to ask inter-related questions to obtainthe information demanded by the interaction goal(detailed in the next section). All questions in Spi-der were stand-alone questions written by 11 col-lege students with SQL background after they hadexplored the database content, and the question in-tent conveyed is likely to naturally arise in real-lifequery scenarios. We selected all Spider examplesclassified as medium, hard, and extra hard, as it isin general hard to establish context for easy ques-tions. In order to study more diverse informationneeds, we also included some easy examples (endup with using 12.9% of the easy examples in Spi-der). As a result, 4,437 questions were selected asthe interaction goals for 200 databases.

Question creation 15 college students withSQL experience were asked to come up with se-quences of inter-related questions to obtain theinformation demanded by the interaction goals4.Previous work (Bertomeu et al., 2006) has char-acterized different thematic relations between theutterances in a database QA system: refinement,theme-entity, theme-property, and answer refine-ment/theme5, as shown in Table 2. We showthese definitions to the students prior to ques-tion creation to help them come up with context-dependent questions. We also encourage the for-

4The students were asked to spend time exploringthe database using a database visualization tool poweredby Sqlite Web https://github.com/coleifer/sqlite-web so as to create a diverse set of thematic re-lations between the questions.

5We group answer refinement and answer theme, the twothematic relations holding between a question and a previousanswer as defined in Bertomeu et al. (2006), into a singleanswer refinement/theme type.

mulation of questions that are thematically relatedto but do not directly contribute to answering thegoal question (e.g. Q2 in the first example andQ1 in the second example in Figure 1. See moreexamples in Appendix as well). The students donot simply decompose the complex query. Instead,they often explore the data content first and evenchange their querying focuses. Therefore, all in-teractive query information in SParC could not beacquired by a single complex SQL query.

We divide the goals evenly among the studentsand each interaction goal is annotated by one stu-dent6. We enforce each question sequence to con-tain at least two questions, and the interaction ter-minates when the student has obtained enough in-formation to answer the goal question.

SQL annotation After creating the questions,each annotator was asked to translate their ownquestions to SQL queries. All SQL queries wereexecuted on Sqlite Web to ensure correctness. Tomake our evaluation more robust, the same an-notation protocol as Spider (Yu et al., 2018c)was adopted such that all annotators chose thesame SQL query pattern when multiple equivalentqueries were possible.

Data review and post-process We asked stu-dents who are native English speakers to reviewthe annotated data. Each example was reviewedat least once. The students corrected any gram-mar errors and rephrased the question in a morenatural way if necessary. They also checked if the

6The most productive student annotated 13.1% of thegoals and the least productive student annotated close to 2%.

Page 5: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

SParC ATISSequence # 4298 1658Question # 12,726 11,653Database # 200 1

Table # 1020 27Avg. Q len 8.1 10.2

Vocab # 3794 1582Avg. turn # 3.0 7.0

Table 3: Comparison of the statistics of context-dependent text-to-SQL datasets.

questions in each sequence were related and theSQL answers matched the semantic meaning ofthe question. After that, another group of studentsran all annotated SQL queries to make sure theywere executable. Furthermore, they used the SQLparser7 from Spider to parse all the SQL labels tomake sure all queries follow the annotation proto-col. Finally, the most experienced annotator con-ducted a final review on all question-SQL pairs.139 question sequences were discarded in this fi-nal step due to poor question quality or wrongSQL annotations

4 Data Statistics and Analysis

We compute the statistics of SParC and conducta through data analysis focusing on its contex-tual dependencies, semantic coverage and cross-domain property. Throughout this section, wecompare SParC to ATIS (Hemphill et al., 1990;Dahl et al., 1994), the most most widely usedcontext-dependent text-to-SQL dataset in the field.In comparison, SParC is significantly different asit (1) contains mode complex contextual depen-dencies, (2) has greater semantic coverage, and (3)adopts a cross-domain task setting, which makeit a new and challenging cross-domain context-dependent text-to-SQL dataset.

Data statistics Table 3 summarizes the statisticsof SParC and ATIS. SParC contains 4,298 uniquequestion sequences, 200 complex databases in 138different domains, with 12k+ questions annotatedwith SQL queries. The number of sequences inATIS is significantly smaller, but it contains acomparable number of individual questions sinceit has a higher number of turns per sequence8.

7https://github.com/taoyds/spider/blob/master/process_sql.py

8The ATIS dataset is collected under the Wizard-of-Ozsetting (Bertomeu et al., 2006) (like a task-oriented sequen-tial question answering task). Each user interaction is guidedby an abstract, high-level goal such as “plan a trip from city A

1 2 3 4 5 6Question turn number

12

34

56

Questio

n turn num

ber

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2: The heatmap shows the percentage of SQLtoken overlap between questions in different turns. To-ken overlap is greater between questions that are closerto each other and the degree of overlap increases as in-teraction proceeds. Most questions have dependenciesthat span 3 or fewer turns.

On the other hand, SParC has overcome the do-main limitation of ATIS by covering 200 differentdatabases and has a significantly larger natural lan-guage vocabulary.

Contextual dependencies of questions We vi-sualize the percentage of token overlap betweenthe SQL queries (formal semantic representationof the question) at different positions of a ques-tion sequence. The heatmap shows that more in-formation is shared between two questions that arecloser to each other. This sharing increases amongquestions in later turns, where users tend to narrowdown their questions to very specific needs. Thisalso indicates that resolving context references inour task is important.

Furthermore, the lighter color of the lower left4 squares in the heatmap of Figure 2 shows thatmost questions in an interaction have contextualdependencies that span within 3 turns. Reddy et al.(2018) similarly report that the majority of contextdependencies on the CoQA conversational ques-tion answering dataset are within 2 questions, be-yond which coreferences from the current ques-tion are likely to be ambiguous with little inherited

to city B, stop in another city on the way”. The domain by itsnature requires the user to express multiple constraints in sep-arate utterances and the user is intrinsically motivated to in-teract with the system until the booking is successful. In con-trast, the interaction goals formed by Spider questions are foropen-domain and general-purpose database querying, whichtend to be more specific and can often be stated in a smallernumber of turns. We believe these differences contribute tothe shorter average question sequence length of SParC com-pared to that of ATIS.

Page 6: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

1 2 3 4 5Question turn number

0.0

0.2

0.4

0.6

0.8

1.0Occurence frequency

ATIS

WHEREAGGJOINGROUPORDERNested

1 2 3 4 5Question turn number

0.0

0.2

0.4

0.6

0.8

1.0

Occurence fre

quen

cy

SParCWHEREAGGJOINGROUPORDERNested

Figure 3: Percentage of question sequences that con-tain a particular SQL keyword at a given turn. Thecomplexity of questions increases as interaction pro-ceeds on SParC as more SQL keywords are triggered.The same trend was not observed on ATIS.

information. This suggests that 3 turns on averageare sufficient to capture most of the contextual de-pendencies between questions in SParC.

We also plot the trend of common SQL key-words occurring in different question turns forboth SParC and ATIS (Figure 3) 9. We show thepercentage of question sequences that contain aparticular SQL keyword at each turn. The up-per figure in Figure 3 shows that the occurrencesof all SQL keywords do not change much as thequestion turn increases in ATIS, which indicatesthat the SQL query logic in different turns arevery similar. We examined the data and find thatmost interactions in ATIS involve changes in theWHERE condition between question turns. Thisis likely caused by the fact that the questions inATIS only involve flight booking, which typicallytriggers the use of the refinement thematic rela-tion. Example user utterances from ATIS are “onamerican airlines” or “which ones arrive at 7pm”(Suhr et al., 2018), which only involves changes tothe WHERE condition.

9Since the formatting used for the SQL queries are dif-ferent in SParC and ATIS, the actual percentages of WHERE,JOIN and Nested for ATIS are lower (e.g. the original ver-sion of ATIS may be highly nested, but queries could be re-formatted to flatten them). Other SQL keywords are directlycomparable.

In contrast, the lower figure demonstrates aclear trend that in SParC, the occurrences of nearlyall SQL components increase as question turn in-creases. This suggests that questions in subse-quent turns tend to change the logical structuresmore significantly, which makes our task more in-teresting and challenging.

Contextual linguistic phenomena We manu-ally inspected 102 randomly chosen examplesfrom our development set to study the thematic re-lations between questions. Table 2 shows the rela-tion distribution.

We find that the most frequently occurring re-lation is theme-entity, in which the current ques-tion focuses on the same entity (set) as the previ-ous question but requests for some other property.Consider Q1 and Q2 of the first example shownin Figure 1. Their corresponding SQL representa-tions (S1 and S2) have the same FROM and WHEREclauses, which harvest the same set of entities –“dorms with a TV lounge”. But their SELECTclauses return different properties of the target en-tity set (number of the dorms in S1 versus totalcapacity of the dorms in S2). Q3 and Q4 in thisexample also have the same relation. The refine-ment relation is also very common. For example,Q2 and Q3 in the second example ask about thesame entity set – customers of a shipping com-pany. But Q3 switches the search constraint from“the most recent” in Q2 to “the first 5”.

Fewer questions refer to previous questions byreplacing with another entity (“Double Down”versus “Keepers” in Table 2) (theme-property) butasking for the same property. Even less frequently,soem questions ask about the answers of previousquestions (answer refinement/theme). As in thelast example of Table 2, the current question asksabout the “Statistics department”, which is one ofthe answers returned in the previous turn. Moreexamples with different thematic relations are pro-vided in Figure 5 in the Appendix.

Interestingly, as the examples in Table 2 haveshown, many thematic relations are present with-out explicit linguistic coreference markers. Thisindicates information tends to implicitly propa-gate through the interaction. Moreover, in somecases where the natural language question sharesinformation with the previous question (e.g. Q2

and Q3 in the first example of Figure 1 forma theme-entity relation), the corresponding SQLrepresentations (S2 and S3) can be very different.

Page 7: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

SQL components SParC ATIS# WHERE 42.8% 99.7%

# AGG 39.8% 16.6%# GROUP 20.1% 0.3%# ORDER 17.0% 0.0%

# HAVING 4.7% 0.0%# SET 3.5% 0.0%

# JOIN 35.5% 99.9%# Nested 5.7% 99.9%

Table 4: Distribution of SQL components in SQLqueries. SQL queries in SParC cover all SQL compo-nents, whereas some important SQL components likeORDER are missing from ATIS.

One scenario in which this happens is when theproperty/constraint specification makes referenceto additional entities described by separate tablesin the database schema.

Semantic coverage As shown in Table 3, SParCis larger in terms of number of unique SQL tem-plates, vocabulary size and number of domainscompared to ATIS. The smaller number of uniqueSQL templates and vocabulary size of ATIS islikely due to the domain constraint and presenceof many similar questions.

Table 4 further compare the formal semanticrepresentation in these two datasets in terms ofSQL syntax component. While almost all ques-tions in ATIS contain joins and nested subqueries,some commonly used SQL components are eitherabsent (ORDER BY, HAVING, SET) or occur veryrarely (GROUP BY and AGG). We examined thedata and find that many questions in it has com-plicated syntactic structures mainly because thedatabase schema requires joined tables and nestedsub-queries, and the semantic diversity among thequestions is in fact smaller.

Cross domain As shown in Table 1, SParC con-tains questions over 200 databases (1,020 tables)in 138 different domains. In comparison, ATIScontains only one databases in the flight bookingdomain, which makes it unsuitable for develop-ing models that generalize across domains. Inter-actions querying different databases are shown inFigure 1 (also see more examples in Figure 4 inthe Appendix). As in Spider, we split SParC suchthat each database appears in only one of train, de-velopment and test sets. Splitting by database re-quires the models to generalize well not only tonew SQL queries, but also to new databases andnew domains.

Train Dev Test# Q sequences 3034 422 842# Q-SQL pairs 9025 1203 2498

# Databases 140 20 40

Table 5: Dataset Split Statistics

5 Methods

We extend two state-of-the-art semantic parsingmodels to the cross-domain, context-dependentsetup of SParC and benchmark their performance.At each interaction turn i, given the current ques-tion xi = 〈xi,1, . . . , xi,|xi|〉, the previously askedquestions I[: i − 1] = {x1, . . . , xi−1} and thedatabase schema C, the model generates the SQLquery yi.

5.1 Seq2Seq with turn-level history encoder(CD-Seq2Seq)

This is a cross-domain Seq2Seq based text-to-SQL model extended with the turn-level historyencoder proposed in Suhr et al. (2018).

Turn-level history encoder Following Suhret al. (2018), at turn i, we encode each user ques-tion xt ∈ I[: t − 1] ∪ {xi} using an utterance-level bi-LSTM, LSTME . The final hidden state ofLSTME , hE

t,|xt|, is used as the input to the turn-

level encoder, LSTMI , a uni-directional LSTM,to generate the discourse state hI

t . The input toLSTME at turn t is the question word embed-ding concatenated with the discourse state at turnt− 1 ([xt,j ,h

It−1]), which enables the propagation

of contextual information.

Database schema encoding For each columnheader in the database schema, we concate-nate its corresponding table name and columnname separated by a special dot token (i.e.,table name.column name), and use the av-erage word embedding10 of tokens in this se-quence as the column header embedding hC .

Decoder The decoder is implemented with an-other LSTM (LSTMD) with attention to theLSTME representations of the questions in η pre-vious turns. At each decoding step, the de-coder chooses to generate either a SQL keyword(e.g., select, where, group by) or a col-umn header. To achieve this, we use separate lay-ers to score SQL keywords and column headers,

10We use the 300-dimensional GloVe (Pennington et al.,2014) pretrained word embeddings.

Page 8: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

and finally use the softmax operation to generatethe output probability distribution over both cate-gories.

5.2 SyntaxSQLNet with history input(SyntaxSQL-con)

SyntaxSQLNet is a syntax tree based neuralmodel for the complex and cross-domain context-independent text-to-SQL task introduced by Yuet al. (2018b). The model consists of a table-awarecolumn attention encoder and a SQL-specific syn-tax tree-based decoder. The decoder adopts a setof inter-connected neural modules to generate dif-ferent SQL syntax components.

We extend this model by providing the decoderwith the encoding of the previous question (xi−1)as additional contextual information. Both xi andxi−1 are encoded using bi-LSTMs (of different pa-rameters) with the column attention mechanismproposed by Yu et al. (2018b). We use the samemath formulation to inject the representations ofxi and xi−1 to each syntax module of the decoder.

More details of each baseline model can befound in the Appendix. And we opensource theirimplementations for reproducibility.

6 Experiments

6.1 Evaluation Metrics

Following Yu et al. (2018c), we use the exact setmatch metric to compute the accuracy betweengold and predicted SQL answers. Instead of sim-ply employing string match, Yu et al. (2018c)decompose predicted queries into different SQLclauses such as SELECT, WHERE, GROUP BY,and ORDER BY and compute scores for eachclause using set matching separately11. We reportthe following two metrics: question match, the ex-act set matching score over all questions, and in-teraction match, the exact set matching score overall interactions. The exact set matching score is 1for each question only if all predicted SQL clausesare correct, and 1 for each interaction only if thereis an exact set match for every question in the in-teraction.

6.2 Results

We summarize the overall results of CD-Seq2Seqand SyntaxSQLNet on the development and the

11Details of the evaluation metrics can be foundat https://github.com/taoyds/spider/tree/master/evaluation_examples

Model Question Match Interaction MatchDev Test Dev Test

CD-Seq2Seq 17.1 18.3 6.7 6.4SyntaxSQL-con 18.5 20.2 4.3 5.2SyntaxSQL-sta 15.2 16.9 0.7 1.1

Table 6: Performance of various methods over all ques-tions (question match) and all interactions (interactionmatch).

test data in Table 6. The context-aware mod-els (CD-Seq2Seq and SyntaxSQL-con) signifi-cantly outperforms the standalone SyntaxSQLNet(SyntaxSQL-sta). The last two rows form a con-trolled ablation study, where without accessing toprevious context history, the test set performanceof SyntaxSQLNet decreases from 20.2% to 16.9%on question match and from 5.2% to 1.1% on in-teraction match, which indicates that context is acrucial aspect of the problem.

We note that SyntaxSQL-con scores higher inquestion match but lower in interaction matchcompared to CD-Seq2Seq. A closer examinationshows that SyntaxSQL-con predicts more ques-tions correctly in the early turns of an interaction(Table 7), which results in its overall higher ques-tion match accuracy. A possible reason for thisis that SyntaxSQL-con adopts a stronger context-agnostic text-to-SQL module (SyntaxSQLNet vs.Seq2Seq adopted by CD-Seq2Seq). The higherperformance of CD-Seq2Seq on interaction matchcan be attributed to better incorporation of in-formation flow between questions by using turn-level encoders (Suhr et al., 2018), which is pos-sible to encode the history of all previous ques-tions comparing to only single one previous ques-tion in SyntaxSQL-con. Overall, the lower per-formance of the two extended context-dependentmodels shows the difficulty of SParC and thatthere is ample room for improvement.

Turn # CD-Seq2Seq SyntaxSQL-con1 (422) 31.4 38.62 (422) 12.1 11.63 (270) 7.8 3.7≥ 4 (89) 2.2 1.1

Table 7: Performance stratified by question turns on thedevelopment set. The performance of the two modelsdecrease as the interaction continues.

Performance stratified by question positionTo gain more insight into how question positionaffects the performance of the two models, we

Page 9: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

report their performances on questions in differ-ent positions in Table 7. Questions in later turnsof an interaction in general have greater depen-dency over previous questions and also greaterrisk for error propagation. The results showthat both CD-Seq2Seq and SyntaxSQL-con con-sistently perform worse as the question turn in-creases, suggesting that both models struggle todeal with information flow from previous ques-tions and accumulate errors from previous pre-dictions. Moreover, SyntaxSQLNet significantlyoutperforms CD-Seq2Seq on questions in the firstturn, but the advantage disappears in later turns(starting from the second turn), which is ex-pected because the context encoding mechanismof SyntaxSQL-con is less powerful than the turn-level encoders adopted by CD-Seq2Seq.

Goal Difficulty CD-Seq2Seq SyntaxSQL-conEasy (483) 35.1 38.9

Medium (441) 7.0 7.3Hard (145) 2.8 1.4

Extra hard (134) 0.8 0.7

Table 8: Performance stratified by question difficultyon the development set. The performances of the twomodels decrease as questions are more difficult.

Performance stratified by SQL difficulty Wegroup individual questions in SParC into differentdifficulty levels based on the complexity of theircorresponding SQL representations using the cri-teria proposed in Yu et al. (2018c). As shown inFigure 3, the questions turned to get harder as in-teraction proceeds, more questions with hard andextra hard difficulties appear in late turns. Table 8shows the performance of the two models acrosseach difficulty level. As we expect, the modelsperform better when the user request is easy. Bothmodels fail on most hard and extra hard questions.Considering that the size and question types ofSParC are very close to Spider, the relatively lowerperformances of SyntaxSQLNet on medium, hardand extra hard questions in Table 8 comparing toits performances on Spider (17.6%, 16.3%, and4.9% respectively) indicates that SParC introducesadditional challenge by introducing context de-pendencies, which is absent from Spider.

Performance stratified by thematic relationFinally, we report the model performances acrossthematic relations computed over the 102 exam-ples summarized in Table 2. The results (Table9) show that the models, in particular SyntaxSQL-

Thematic relation CD-Seq2Seq SyntaxSQL-conRefinement 8.4 6.5

Theme-entity 13.5 10.2Theme-property 9.0 7.8

answer refine./them. 12.3 20.4

Table 9: Performance stratified by thematic rela-tions. The models perform best on the answer refine-ment/theme relation, but do poorly on the refinementand theme-property relations.

con, perform the best on the answer refine-ment/theme relation. A possible reason for thisis that questions in the answer theme categorycan often be interpreted without reference to pre-vious questions since the user tends to state thetheme entity explicitly. Consider the example inthe bottom row of Table 2. The user explicitly said“Statistics department” in their question, whichbelongs to the answer set of the previous question12. The overall low performance for all thematicrelations (refinement and theme-property in par-ticular) indicates that the two models still struggleon properly interpreting the question history.

7 Conclusion

In this paper, we introduced SParC, a large-scale dataset of context-dependent questions overa number of databases in different domains an-notated with the corresponding SQL representa-tion. The dataset features wide semantic cover-age and a diverse set of contextual dependenciesbetween questions. It also introduces unique chal-lenge in mapping context-dependent questions toSQL queries in unseen domains. We experimentedwith two competitive context-dependent semanticparsing approaches on SParC. The model accuracyis far from satisfactory and stratifying the perfor-mance by question position shows that both mod-els degenerate in later turns of interaction, sug-gesting the importance of better context model-ing. The dataset, baseline implementations andleaderboard are publicly available at https://yale-lily.github.io/sparc.

Acknowledgements

We thank Tianze Shi for his valuable discussionsand feedback. We thank the anonymous reviewersfor their thoughtful detailed comments.

12As pointed out by one of the anonymous reviewers, thereare less than 10 examples of answer refinement/theme rela-tion in the 102 analyzed examples. We need to see more ex-amples before concluding that this phenomenon is general.

Page 10: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion forComputational Linguistics.

Jonathan Berant and Percy Liang. 2014. Semanticparsing via paraphrasing. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1415–1425, Baltimore, Maryland. Association forComputational Linguistics.

Nuria Bertomeu, Hans Uszkoreit, Anette Frank, Hans-Ulrich Krieger, and Brigitte Jorg. 2006. Contextualphenomena and thematic relations in database qa di-alogues: results from a wizard-of-oz experiment. InProceedings of the Interactive Question AnsweringWorkshop at HLT-NAACL 2006, pages 1–8. Associ-ation for Computational Linguistics.

Joyce Y. Chai and Rong Jin. 2004. Discourse struc-ture for context question answering. In HLT-NAACL2004 Workshop on Pragmatics in Question Answer-ing.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. Quac: Question answering in context.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages2174–2184. Association for Computational Linguis-tics.

Deborah A. Dahl, Madeleine Bates, Michael Brown,William Fisher, Kate Hunicke-Smith, David Pallett,Christine Pao, Alexander Rudnicky, and ElizabethShriberg. 1994. Expanding the scope of the atis task:The atis-3 corpus. In Proceedings of the Workshopon Human Language Technology, HLT ’94, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Li Dong and Mirella Lapata. 2016. Language to logi-cal form with neural attention. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2016, August 7-12, 2016,Berlin, Germany, Volume 1: Long Papers.

Li Dong and Mirella Lapata. 2018. Coarse-to-fine de-coding for neural semantic parsing. In Proceed-ings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 731–742. Association for Computa-tional Linguistics.

Catherine Finegan-Dollak, Jonathan K. Kummer-feld, Li Zhang, Karthik Ramanathan Dhanalak-shmi Ramanathan, Sesh Sadasivam, Rui Zhang, andDragomir Radev. 2018. Improving text-to-sql eval-uation methodology. In ACL 2018. Association forComputational Linguistics.

Stefan L. Frank. 2013. Uncertainty reduction as a mea-sure of cognitive load in sentence comprehension.Topics in cognitive science, 5 3:475–94.

John Hale. 2006. Uncertainty about the rest of the sen-tence. Cognitive science, 30 4:643–72.

Charles T. Hemphill, John J. Godfrey, and George R.Doddington. 1990. The atis spoken language sys-tems pilot corpus. In Speech and Natural Language:Proceedings of a Workshop Held at Hidden Valley,Pennsylvania, June 24-27,1990.

Matthew Henderson, Blaise Thomson, and Jason D.Williams. 2014. The second dialog state trackingchallenge. In SIGDIAL Conference.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung,Jayant Krishnamurthy, and Luke Zettlemoyer. 2017.Learning a neural semantic parser from user feed-back. CoRR, abs/1704.08760.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages1821–1831. Association for Computational Linguis-tics.

Tsuneaki Kato, Jun’ichi Fukumoto, Fumito Masui, andNoriko Kando. 2004. Handling information accessdialogue through qa technologies-a novel challengefor open-domain question answering. In Proceed-ings of the Workshop on Pragmatics of Question An-swering at HLT-NAACL 2004.

Roger P. Levy. 2008. Expectation-based syntacticcomprehension. Cognition, 106:1126–1177.

Fei Li and HV Jagadish. 2014. Constructing an in-teractive natural language interface for relationaldatabases. VLDB.

Reginald Long, Panupong Pasupat, and Percy Liang.2016. Simpler context-dependent logical forms viamodel projections. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1456–1465. Association for Computational Linguistics.

Scott Miller, David Stallard, Robert J. Bobrow, andRichard M. Schwartz. 1996. A fully statistical ap-proach to natural language interfaces. In ACL.

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017.Neural belief tracker: Data-driven dialogue statetracking. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1777–1788. Asso-ciation for Computational Linguistics.

Panupong Pasupat and Percy Liang. 2015. Compo-sitional semantic parsing on semi-structured tables.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing of the Asian Federation of Nat-ural Language Processing, ACL 2015, July 26-31,

Page 11: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

2015, Beijing, China, Volume 1: Long Papers, pages1470–1480.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.ACL.

Ana-Maria Popescu, Oren Etzioni, and Henry Kautz.2003. Towards a theory of natural language inter-faces to databases. In Proceedings of the 8th in-ternational conference on Intelligent user interfaces,pages 149–157. ACM.

Siva Reddy, Danqi Chen, and Christopher D. Manning.2018. Coqa: A conversational question answeringchallenge. CoRR, abs/1808.07042.

Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti,Yi Mao, Oleksandr Polozov, and Weizhu Chen.2018. Incsql: Training incremental text-to-sqlparsers with non-deterministic oracles. arXivpreprint arXiv:1809.05054.

Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018.Learning to map context-dependent sentences to ex-ecutable formal queries. In Proceedings of the Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 2238–2249. Associationfor Computational Linguistics.

Chenglong Wang, Po-Sen Huang, Alex Polozov,Marc Brockschmidt, and Rishabh Singh. 2018.Execution-guided neural program decoding. InICML workshop on Neural Abstract Machines andProgram Induction v2 (NAMPI).

Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:Generating structured queries from natural languagewithout reinforcement learning. arXiv preprintarXiv:1711.04436.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018a. Typesql: Knowledge-based type-aware neural text-to-sql generation. InProceedings of NAACL. Association for Computa-tional Linguistics.

Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,Dongxu Wang, Zifan Li, and Dragomir Radev.2018b. Syntaxsqlnet: Syntax tree networks for com-plex and cross-domain text-to-sql task. In Proceed-ings of EMNLP. Association for Computational Lin-guistics.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018c. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InEMNLP.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In AAAI/IAAI, pages 1050–1055,Portland, OR. AAAI Press/MIT Press.

Luke Zettlemoyer and Michael Collins. 2007. Onlinelearning of relaxed ccg grammars for parsing to log-ical form. In Proceedings of the 2007 Joint Con-ference on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning (EMNLP-CoNLL).

Luke S. Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. UAI.

Luke S. Zettlemoyer and Michael Collins. 2009.Learning context-dependent mappings from sen-tences to logical form. In ACL/IJCNLP.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2sql: Generating structured queriesfrom natural language using reinforcement learning.CoRR, abs/1709.00103.

Victor Zhong, Caiming Xiong, and Richard Socher.2018. Global-locally self-attentive encoder for di-alogue state tracking. In ACL.

A Appendices

A.1 Additional Baseline Model DetailsCD-Seq2Seq We use a bi-LSTM, LSTME , toencode the user utterance at each turn. At eachstep j of the utterance, LSTME takes as input theword embedding and the discourse state hI

i−1 up-dated for the previous turn i− 1:

hEi,j = LSTME([xi,j ;h

Ii−1],hE

i,j−1)

where i is the index of the turn and j is the in-dex of the utterance token. The final hidden stateLSTME is used as the input of a uni-directionalLSTM, LSTMI , which is the interaction level en-coder:

hIi = LSTMI(hE

|xi|,hIi−1).

For each column header, we concate-nate its table name and its column nameseparated by a special dot token (i.e.,table name.column name), and the columnheader embedding hC is the average embeddingsof the words.

The decoder is implemented as another LSTMwith hidden state hD. We use the dot-productbased attention mechanism to compute the con-text vector. At each decoding step k, we computeattention scores for all tokens in η previous turns

Page 12: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

(we use η = 5) and normalize them using soft-max. Suppose the current turn is i, and considerthe turns of 0, . . . , η − 1 distance from turn i. Weuse a learned position embedding φI(i − t) whencomputing the attention scores. The context vec-tor is the weighted sum of the concatenation of thetoken embedding and the position embedding:

sk(t, j) = [hEt,j ;φ

I(i− t)]WatthDk

αk = softmax(sk)

ck =

i∑t=i−h

|xt|∑j=1

αk(t, j)[hEt,j ;φ

I(i− t)]

At each decoding step, the sequential de-coder chooses to generate a SQL keyword (e.g.,select, where, group by, order by) or acolumn header. To achieve this, we use separatelayers to score SQL keywords and column head-ers, and finally use the softmax operation to gen-erate the output probability distribution:

ok = tanh([hDk ; ck]Wo)

mSQL = okWSQL + bSQL

mcolumn = okWcolumnhC

P (yk) = softmax([mSQL;mcolumn])

It’s worth mentioning that we experimentedwith a SQL segment copying model similar to theone proposed in Suhr et al. (2018). We implementour own segment extraction procedure by extract-ing SELECT, FROM, GROUP BY, ORDER BYclauses as well as different conditions in WHEREclauses. In this way, we can extract 3.9 segmentsper SQL on average. However, we found thatadding segment copying does not significantly im-prove the performance because of error propaga-tion. Better leveraging previously generated SQLqueries remains an interesting future direction forthis task.

SyntaxSQL-con As in (Yu et al., 2018b), thefollowing is defined to compute the conditionalembedding H1/2 of an embedding H1 given an-other embedding H2:

H1/2 = softmax(H1WH>2 )H1.

Here W is a trainable parameter. In addition, aprobability distribution from a given score matrixU is computed by

P(U) = softmax (Vtanh(U)) ,

where V is a trainable parameter. To incorpo-rate the context history, we encode the questionright before the current question and add it toeach module as an input. For example, the COLmodule of SyntaxSQLNet is extended as follow-ing. HPQ denotes the hidden states of LSTM onembeddings of the previous one question and theWnum

3 HnumPQ/COL

> and Wval4 Hval

PQ/COL> terms add

history information to prediction of the columnnumber and column value respectively.

P numCOL = P

(Wnum

1 HnumQ/COL

> + Wnum2 Hnum

HS/COL> + Wnum

3 HnumPQ/COL

>)

P valCOL = P

(Wval

1 HvalQ/COL

>+ Wval

2 HvalHS/COL

>+ Wval

3 HCOL> + Wval

4 HvalPQ/COL

>)

A.2 Additional Data ExamplesWe provide additional SParC examples in Figure 4and examples with different thematic relations inFigure 5.

Page 13: SParC: Cross-Domain Semantic Parsing in Context · 2021. 1. 26. · SParC: Cross-Domain Semantic Parsing in Context Tao Yu yRui Zhang Michihiro Yasunaga Yi Chern Tany Xi Victoria

Database about wine.D3 : Find the county where produces the most number of wines with score higher than 90.C4 :

How many different counties are all wine appellations from?Q1 : SELECT COUNT(DISTINCT county) FROM appellationsS1 :

How many wines does each county produce?Q2 : SELECT T1.county, COUNT(*) FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellationS2 :

GROUP BY T1.county

Only show the counts of wines that score higher than 90?Q3 : SELECT T1.county, COUNT(*) FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellationS3 :

WHERE T2.score > 90 GROUP BY T1.county

Which county produced the greatest number of these wines?Q4 : SELECT T1.county FROM appellations AS T1 JOIN wine AS T2 ON T1.appellation = T2.appellation WHERES4 :

T2.score > 90 GROUP BY T1.county ORDER BY COUNT(*) DESC LIMIT 1

--------------------------------------

Database about districtsD5 : Find the names and populations of the districts whose area is greater than the average area.C5 :

What is the total district area?Q1 : SELECT sum(area_km) FROM districtS1 :

Show the names and populations of all the districts.Q2 : SELECT name, population FROM districtS2 :

Excluding those whose area is smaller than or equals to the average area.Q3 : SELECT name, population FROM district WHERE area_km > (SELECT avg(area_km) FROM district)S3 :

--------------------------------------

Database about booksD6 : Find the title, author name, and publisher name for the top 3 best sales books.C6 :

Find the titles of the top 3 highest sales books.Q1 : SELECT title FROM book ORDER BY sale_amount DESC LIMIT 3S1 :

Who are their authors?Q2 : SELECT t1.name FROM author AS t1 JOIN book AS t2 ON t1.author_id = t2.author_id ORDER BYS2 :

t2.sale_amount DESC LIMIT 3

Also show the names of their publishers.Q3 : SELECT t1.name, t3.name FROM author AS t1 JOIN book AS t2 ON t1.author_id = t2.author_id JOIN press ASS3 :

t3 ON t2.press_id = t3.press_id ORDER BY t2.sale_amount DESC LIMIT 3

Figure 4: More examples in SParC.

Database about school departmentsD7 : What are the names and budgets of the departments with average instructor salary greater than the overall average?C7 :

Please list all different department names.Q1 : SELECT DISTINCT dept_name FROM departmentS1 :

Show me the budget of the Statistics department. (Theme/refinement-answer)Q2 : SELECT budget FROM department WHERE dept_name = "Statistics"S2 :

What is the average salary of instructors in that department? (Theme-entity)Q3 : SELECT AVG(T1.salary)FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id WHERES3 :

T2.dept_name = "Statistics"

How about for all the instructors ? (Refinement)Q4 : SELECT AVG(salary) FROM instructorS4 :

Could you please find the names of the departments with average instructor salary less than that?Q5 : (Theme/refinement-answer)

SELECT T2.dept_name FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id GROUP BYS5 : T1.department_id HAVING AVG(T1.salary) < (SELECT AVG(salary) FROM instructor)

Ok, how about those above the overall average? (Refinement)Q6 : SELECT T2.dept_name FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.id GROUP BYS6 :

T1.department_id HAVING AVG(T1.salary) > (SELECT AVG(salary) FROM instructor)

Please show their budgets as well. (Theme-entity)Q7 : SELECT T2.dept_name, T2.budget FROM instructor as T1 JOIN department as T2 ON T1.department_id = T2.idS7 :

GROUP BY T1.department_id HAVING AVG(T1.salary) > (SELECT AVG(salary) FROM instructor)

Figure 5: Additional example in SParC annotated with different thematic relations. Entities (purple), properties(magenta), constraints (red), and answers (orange) are colored.