12
Spreadsheet As a Relational Database Engine [Extended Abstract] ABSTRACT Without any doubt, spreadsheets are the most commonly used applications for data management and analysis. Per- haps they are even among the most widely used computer applications of all kinds. However, the spreadsheet paradigm of computation still lacks sufficient theoretical analysis. In this paper we consider the relationship of spreadsheets to database systems. We demonstrate that a spreadsheet can play the role of a relational database engine, without any use of macros or built-in programming languages, merely by using spreadsheet formulas. We achieve that by implement- ing all operators of relational algebra by means of spread- sheet functions. Given a definition of a database (say in SQL), it is possible to construct a spreadsheet workbook with empty worksheets for data tables and worksheets filled with formulas for queries. Since then on, when the user enters, alters or deletes data in the data worksheets, the for- mulas in query worksheets automatically compute the actual results of the queries. Thus, the spreadsheet serves as data storage and executes SQL queries, and therefore acts as a relational database engine. Syntactically and semantically, the paper is based on Mi- crosoft Excel (TM) 2003 version, because so far there is no formal model of spreadsheets that might be used for that purpose. However, the presented constructions work in other spreadsheet systems, too. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—relational data- bases ; H.4.1 [Information Systems Applications]: Office Automation—spreadsheets ; K.8.1 [Personal Computing]: Application Packages—spreadsheets General Terms spreadsheets, relational databases Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2010 Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. Keywords spreadsheets, relational databases, relational algebra, SQL, end-user computing 1. INTRODUCTION Spreadsheets are the end-user computing counterpart of databases and OLAP in the enterprise-scale computing. They serve basically the same purpose — data management and analysis, but at the opposite extremes of the data quantity scale. At the same time spreadsheets are extremely popular. Their users range from every(wo)men who manage their home budgets, to business professionals and researchers who create and examine extremely sophisticated models and data. For example, Science journal writes in its instructions for authors [4] as follows. In general, Science will accept the following nine categories of supporting online material: [. . . ] 8.Databases – In certain cases, Science will con- sider linked database presentations more com- plex than a flat text file or table; these can in- clude, for example, tables hyperlinked to public sequence, array, or protein databases, or collec- tions of hypertext tables or Excel files linked to explanatory image files or tables. Such presen- tations may require special treatment, and should be discussed in advance with the online editor. Submission of databases such as those de- scribed above will generally only be appro- priate when the data in question can not be accommodated by an established public repository such as Genbank or PDB. In practice, Excel files are quite common as a form of supporting online material in Science. The same journal provides an example of a scientific controversy [6, 3] which finally turned out to be related to the design of a spreadsheet used for data analysis. Despite that, and surprisingly enough, spreadsheets, and the spreadsheet paradigm in general, lack sufficient theoret- ical analysis. There is even no formal model of spreadsheets, which might be the base of such analysis. There are only a few papers considering spreadsheets from the point of view of functional programming paradigm [1, 2, 7, 9, 11], while

Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Spreadsheet As a Relational Database Engine

[Extended Abstract]

ABSTRACTWithout any doubt, spreadsheets are the most commonlyused applications for data management and analysis. Per-haps they are even among the most widely used computerapplications of all kinds. However, the spreadsheet paradigmof computation still lacks sufficient theoretical analysis.

In this paper we consider the relationship of spreadsheetsto database systems. We demonstrate that a spreadsheetcan play the role of a relational database engine, without anyuse of macros or built-in programming languages, merely byusing spreadsheet formulas. We achieve that by implement-ing all operators of relational algebra by means of spread-sheet functions. Given a definition of a database (say inSQL), it is possible to construct a spreadsheet workbookwith empty worksheets for data tables and worksheets filledwith formulas for queries. Since then on, when the userenters, alters or deletes data in the data worksheets, the for-mulas in query worksheets automatically compute the actualresults of the queries. Thus, the spreadsheet serves as datastorage and executes SQL queries, and therefore acts as arelational database engine.

Syntactically and semantically, the paper is based on Mi-crosoft Excel (TM) 2003 version, because so far there isno formal model of spreadsheets that might be used forthat purpose. However, the presented constructions workin other spreadsheet systems, too.

Categories and Subject DescriptorsH.2.4 [Database Management]: Systems—relational data-bases; H.4.1 [Information Systems Applications]: OfficeAutomation—spreadsheets; K.8.1 [Personal Computing]:Application Packages—spreadsheets

General Termsspreadsheets, relational databases

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD 2010Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.

Keywordsspreadsheets, relational databases, relational algebra, SQL,end-user computing

1. INTRODUCTIONSpreadsheets are the end-user computing counterpart of

databases and OLAP in the enterprise-scale computing. Theyserve basically the same purpose — data management andanalysis, but at the opposite extremes of the data quantityscale.

At the same time spreadsheets are extremely popular.Their users range from every(wo)men who manage theirhome budgets, to business professionals and researchers whocreate and examine extremely sophisticated models and data.For example, Science journal writes in its instructions forauthors [4] as follows.

In general, Science will accept the following ninecategories of supporting online material:

[. . . ]

8.Databases – In certain cases, Science will con-sider linked database presentations more com-plex than a flat text file or table; these can in-clude, for example, tables hyperlinked to publicsequence, array, or protein databases, or collec-tions of hypertext tables or Excel files linked toexplanatory image files or tables. Such presen-tations may require special treatment, and shouldbe discussed in advance with the online editor.

Submission of databases such as those de-scribed above will generally only be appro-priate when the data in question can notbe accommodated by an established publicrepository such as Genbank or PDB.

In practice, Excel files are quite common as a form ofsupporting online material in Science. The same journalprovides an example of a scientific controversy [6, 3] whichfinally turned out to be related to the design of a spreadsheetused for data analysis.

Despite that, and surprisingly enough, spreadsheets, andthe spreadsheet paradigm in general, lack sufficient theoret-ical analysis. There is even no formal model of spreadsheets,which might be the base of such analysis. There are only afew papers considering spreadsheets from the point of viewof functional programming paradigm [1, 2, 7, 9, 11], while

Page 2: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

we think that spreadsheets constitute a paradigm by them-selves. There is also vast literature devoted to the practiceof using spreadsheets, and even the European SpreadsheetRisks Interest Group EuSpRIG http://www.eusprig.org

with its annual conference.It seems therefore surprising that the computer science

community did not study any further that extremely popu-lar, important and successful type of application.

In this paper we do not attempt to create a formal modelof spreadsheets. Instead, we aim at providing a strong evi-dence, that spreadsheets are a very interesting type of soft-ware systems and deserve more research. Specifically, weconsider the relation of spreadsheets to database systems.It is a natural comparison, because spreadsheets indeed of-ten play the role of small databases at the end-user level ofcomputing.

We demonstrate that virtually any spreadsheet system isa relational database engine. We do so by implementing alloperators of relational algebra using spreadsheet functions.For each query in SQL, we construct a spreadsheet work-book with empty worksheets for data tables and worksheetsfilled with formulas for queries. As the user enters, altersor deletes data tuples in the data worksheets, the formulasin query worksheets automatically compute the actual re-sults of the queries. Thus, the spreadsheet serves as datastorage, and executes SQL queries. It is therefore a rela-tional database engine. Consequently, any specification ofa database, written in SQL in the form of table and viewdefinitions, can be compiled into a spreadsheet workbookwhich has exactly the same functionality as if the databasewas implemented in a classical RDBMS. Crucially, this isachieved without any use of macros written in an externalprogramming language, like Visual Basic or the like. Onemight consider our construction also as an implementationof a relational database on a completely new type of (virtual)hardware.

As a model of spreadsheet syntax and semantics we takeMicrosoft Excel (TM) (the general reference is [8]), but ourconstructions work in other similar systems, like OpenOfficeCalc, gnumeric or Google docs, too.

2. TECHNICALITIESThe paper is written assuming Microsoft Excel (TM) 2003

as the target system. The newest (at the time of this writing)Excel 2007 provides a couple of new functions, which simplifysome of the tasks, but are not present in other spreadsheetsystems. Therefore we chose the older version.

2.1 R1C1 notationWe assume the reader to be familiar with spreadsheets.

The choice of Excel is due to itspopularity and the fact thatit accepts the row-column R1C1-style addressing of cells andranges, as opposed to, e.g., OpenOffice Calc, Google docs andsimilar tools. This notation is easier to handle in a formaldescription, although in everyday practice the equivalent A1notation is dominating. The key advantage of the R1C1 no-tation is that the meaning of the formula is independent ofthe cell in which it is located.

In the R1C1 notation, both rows and columns of work-sheets are numbered by integers from 1 onward (so that anExcel spreadsheet set to R1C1 notation can be easily distin-guished from one in the classical A1 notation). For arbitrarynonzero integers i and j and nonzero natural numbers m,n

the following expressions are cell references in the R1C1 no-tation: RmCn, R[i]Cm, RmC[j], R[i]C[j], RCm, RC[i], RmC,R[i]C.

The number after ‘R’ refers to the row number and thenumber after ‘C’ to the column number. If the number ismissing, it means “same row (column)” as the cell in whichthis expression is used. If the number is written in squarebrackets, it is a relative reference and the cell to which thisexpression points should be determined by adding the num-ber in brackets to the row (column) number of the presentcell. Numbers without brackets are absolute references andrefer to a cell whose row (column) number is equal to thatnumber. For example, R[−1]C7 denotes a cell which is in therow directly above the present one in column 7, while RC[3]

denotes a cell in the same row as the present one and 3columns to the right. If R or C is itself omitted, the resultingexpression denotes the whole column or row (respectively),e.g., C7 is the (whole) column number 7. For the purpose ofdata validation or for referencing cells in other worksheets,RC may also be used, and references the cell whose row andcolumn numbers are equal to the address of the cell in whichthis expression is located. Ranges are composed generallyfrom two cell references separated by a colon, and mean arectangular area, spanned by the two cells.max stands for the maximal number of rows permitted in a

worksheet. This number may be imposed by the spreadsheetsystem that is used, or by the user who decides to limit thequantity of data that can be stored, in exchange for betterperformance.

2.2 IF functionIF is a condtional function in spreadsheets. It syntax is

IF(condition,true_branch,false_branch). What makesit unusual is that its evaluation is lazy, i.e., after the con-

dition is evaluated and yields either TRUE or FALSE, onlyone of the branches is evaluated. It makes IF very useful.It can be used to protect functions from being applied toarguments of wrong types, trap errors, and, last but notleast, to speed up execution of queries by avoiding lengthycomputations in certain cases.

2.3 SUMPRODUCT functionWe will often use a special function called SUMPRODUCT.

It is one of the few formulas which can operate on lists ofdata elements rather than on single ones. Its uses will begenerally modifications of the following two examples.

Example 1.=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4)) is cal-culated as follows:

1. each cell in the range R1C1:R5C1 is compared withR1C3, and this yields a sequence of five booleans;

2. each cell in the range R1C2:R5C2 is compared withR1C4, and this yields another sequence of five booleans;

3. the two sequences from previous items are multipliedcoordinate-wise, which results in automatic data typeconversion from booleans to integers (with 1 corre-sponding to TRUE and 0 to FALSE), and then normalmultiplication;

4. SUMPRODUCT then adds the five numbers up and pro-duces a single number as a result.

Page 3: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Figure 1: The idea of a database implementation in a spreadsheet

Consequently, the final result is the number of rows, inwhich the columns C1 and C2 contain the same pair of num-bers as in R1C3:R1C4.

Example 2.=SUMPRODUCT((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)*

R1C5:R5C5) is calculated as follows:

1. the first three steps of evaluation are the same as be-fore;

2. the sequence of 0s and 1s from previous item and therange R1C5:R5C5 are multiplied again coordinate-wise,which results in a sequence of five numbers;

3. again the sum of the above five numbers is returned.

Consequently, the final result is the sum of values in C3,calculated over those ones which are located in rows, inwhich the columns C1 and C2 contain the same pair of num-bers as in R1C3:R1C4.

These two examples generalize to sum-multiplication ofmore than two or three arrays.

The behavior of SUMPRODUCT very much resembles the wayarray formulas are evaluated. In fact, the formulas

{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C2=R1C4))}

and

{=SUM((R1C1:R5C1=R1C3)*(R1C2:R5C1=R1C4)*R1C5:R5C5)}

are exactly equivalent to our two examples.

3. ARCHITECTURE OF A DATABASE IM-PLEMENTED IN A SPREADSHEET

3.1 OverviewIn this paper, we disregard a number of minor issues aris-

ing in practical implementation of database operations in aspreadsheet. First of all, here is the obvious limitation ofsize on number and sizes of relations, views and their inter-mediate results, imposed by the maximal available numberof worksheets, columns and rows in the spreadsheet systemat hand. Next, the size of the data values (integers, strings,etc.) is also limited. The variety of data types in spread-sheets is also restricted when compared to database systems.

The overall architecture of a relational database imple-mented in a spreadsheet is as follows.

Given specification of the database, an implementation ofa database is created by an external program (which playsthe role of query compiler), in the form of an .xls, .xlsx,.odc, etc., file.

The whole resulting database is a workbook, consisting ofone worksheet per data table and one worksheet per view inthe database.

The data table worksheets are where the data is entered,updated and deleted. In the case of the (more theoreticalin flavor) implementation of the relational algebra, the datatable sheets do not contain any formulas and are simplythe place to enter tuples into relations. In the case of SQLimplementation, the cells are equipped with data validationformulas, which perform data type verification, enforce PRI-

MARY KEY, FOREIGN KEY and other integrity constraints in-

Page 4: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

cluded in the CREATE TABLE statements.The query (view) worksheets are not supposed to be edited

by the user. They contain columns filled with formulas,which calculate the consecutive values of the result of thequery. Besides the result columns of the query, the viewworksheets can also contain a number of hidden columns,which calculate and store intermediate results emerging dur-ing query evaluation. It is important that the formulas arecompletely uniform in each column of the database work-book, and they do not depend on the data which will bestored in the application. Initially all formulas compute theempty string "" value, representing unused space. When theuser manually enters data into the tables, the automatic re-computation of the spreadsheet causes the results of queriesto be computed and appear in the view worksheets.

4. THEORETICAL LEVEL: RELATIONALALGEBRA

We assume the semantics over a fixed domain of (thespreadsheet’s implementation of) integers, so that a rela-tion is a set or multiset of tuples over the integers that areimplemented in the spreadsheet software.

4.1 CompositionalityWe assume the unnamed syntax for the relational algebra:

relations and queries have columns, which are numbered anddo not have any names. Sometimes we consider the expres-sions C1, C2, etc., as the names of the worksheet columns,as well as the names of the columns in relations.

The representation of a relation r of arity n is a group of nconsecutive columns in a worksheet, whose rows contain thetuples in the relation. The rows in which there are no tuplesof r are assumed to be filled with the empty string formula="", evaluating to the empty string value "", which the usercan replace by the new tuples of the relation. The emptystring is never a component of a tuple in a relation or query.Therefore either all cells in a row contain the empty string,or none does. The rows of tables and queries evaluating toempty strings are called null rows henceforth.

The assumption that ="" formulas fill the empty rows ofdata tables is only for uniformity of presentation. A for-mula in a cell can not evaluate to ”empty cell” (because theformula occupies that cell anyway), only to empty string.Therefore, if blank cells were used in empty rows, formulasexpressing queries must have been adapted to accept unusedspace in two different forms: empty cells in data tables, andempty strings in results of other queries. Moreover, blankcells are interpreted as 0 by many Excel functions, whichmakes formulas prepared for blank cells even more compli-cated.

The representation of a relational algebra query Q of aritym is a group of l + m consecutive columns in a worksheet.All its rows from 1 to max are filled with formulas (identicalin all cells of each column), which calculate the tuples in Q.We assume that the formulas in the last m columns shouldreturn either (a component of) a tuple in the result of Q, orthe empty string value "". The additional l columns are alsofilled with identical formulas, which calculate intermediateresults. A worksheet of this kind can be created by enteringthe formulas in the first row, and then filling them down-ward to fill the first max rows. This uniformity assumptionmeans in particular, that the formulas are completely inde-

pendent on the data they will work on. However, we wouldlike to stress that there is no reason to reject nonuniformimplementations, should they appear to be more effective orpermit expressing queries inexpressible in uniform way.

In the following we will consider both set and bag (mul-tiset) semantics of the relational algebra. In the first case,duplicate rows are not permitted in the relations and queries,in the latter they are permitted. However, even in the set se-mantics a spreadsheet representation of a relation may con-tain many null rows.

Furthermore, the representation may be loose if null rowsare interspersed with the tuples, or standard if all the tuplescome first, followed by the null rows.

Consequently, we have loose-set, loose-bag, standard-setand standard-bag semantics.

No matter which of the above semantics above we have inmind, the result of the query appears exactly as if it were atable, and can be used as such. Now the only thing neces-sary to compose queries is to locate their implementationsside by side in a single worksheet and change input columnnumbers in the formulas computing the outermost query, toagree with the column numbers of the outputs of the argu-ment queries (and then the output columns of the argumentqueries become the intermediate results columns of the com-position).

Therefore, queries represented in this way are composi-tional.

Now it suffices to demonstrate that the each of the follow-ing relational algebra operators from [5] can be implementedin a spreadsheet:

• Two operations peculiar to spreadsheets, absent in [5]:

– Error trapping.

– Standardization.

• Sorting.

• Duplicate removal δr.

• Selection σθr.

• Projection πi,j,...r.

• Union r ∪ s.

• Difference r \ s.

• Cartesian product r × s.

• Grouping with aggregation γLr:

– Grouping with SUM.

– Grouping with COUNT.

– Grouping with AVG.

– Grouping with MAX and MIN.

Note that in Google docs spreadsheet there are specialbuilt-in operators for sorting and duplicate removal. Sortingis of course present in Excel and other spreadsheet systems(duplicate removal is additionally present in Excel 2007), butcan not be invoked by a formula, and requires a sequence ofclicks by the user. This can not be accepted, as we want thequeries to compute automatically.

Generally, the sets of functions present in spreadsheets arehighly redundant, so the same computation can be achieved

Page 5: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Figure 2: The query SELECT lastname, AVG(income) FROM incomes GROUP BY lastname HAVING COUNT(*)>3, comput-ing average family income, implemented in a spreadsheet. (Errors appearing in the worksheet are intended.)

in many different ways. In this theoretical section we choosesolutions which are common to most of (or even all) spread-sheet systems. This way we believe to consider the spread-sheet paradigm, even if its definition is not yet formulated inthe literature.

4.2 NotationWe use the following convention for presenting queries im-

plemented in a spreadsheet:COLUMNS < =FORMULA

means that the =FORMULA is entered into the COLUMNS, whichmay be specified either to be a single column (e.g. C5 )or a range of a few columns (e.g. C5:C7), or a single cell(e.g. R1C5), and in each case belongs to the columns withintermediate values. The formula

COLUMNS << =FORMULA

indicates that formulas located in COLUMNS calculate the out-put of the query.

In all cases, we fill the first max rows of the indicatedcolumns.

Sometimes the output columns are not specified, and thenit is always indicated, that the output is computed by ap-plying another, already defined operation to some of thecolumns with intermediate results. In any case, it is as-sumed that the first max rows of the LOCATION are filled withformulas, except when it is a single cell. max stands in thefollowing always for a concrete integer, which is written di-rectly into the formulas.

Generally, we assume the arguments of the algebra oper-ators to be two- or three-ary relations or queries, the gener-alization to higher arities is straightforward.

Except of the standardization and sorting, in all othercases we assume the input to be in standard form, i.e., nullrows at the bottom.

4.3 Error trapping and standardizationIn this section we describe two special purpose operators,

which perform very common and useful tasks, specific to ourspreadsheet environment.

4.3.1 Error trapping

If we replace a formula =F, which may produce an error,

by the formula =IF(ISERROR(F),"",F), any error producedby =F is replaced by the empty string, and otherwise thevalue is the same as the value of =F.

4.3.2 Standardization

This operation converts a relation from loose to standardform, moving null rows to the bottom. The relative order ofnon-null rows is preserved. We assume that columns C1 andC2 contain the source data.

C3 < =SUMPRODUCT((R1C1:RC1<>"")*1)

counts the non-null rows above the present row, includingthe present one. This number is the row number to whichthe present row should be relocated. Note that multiplica-tion by 1 enforces boolean to integer conversion.

C4 < =MATCH(ROW(),R1C3:RmaxC3,0)

The function MATCH(ROW(),R1C3:RmaxC3,0) searches forthe value of the number of the present row (computed byROW()) in C3 and returns the row number of the first exactmatch found. If no match is found (i.e., we are in a rowwhose number is higher than the total number of non-nullrows), an error is returned.

C5:C6 << =IF(ISERROR(RC4),"",

INDEX(R1C[-4]:RmaxC[-4],RC4))

Errors are trapped, and when there is no error, INDEX

returns the data from the suitable row of C[-4]. Thus thevalues from C1:C2 get relocated to their positions calculatedin C3.

4.4 SortingNow we describe an implementation of sorting, which is a

generalization of standardization. We assume that columnsC1 and C2 contain the source data and we sort in ascendingorder by the values in C1.

C3 < =SUMPRODUCT((R1C1:RmaxC1<=RC1)*1)

This puts in RiC3 the number of entries in column C1

which are smaller than or equal to RiC1. "" compared by<= is larger than any number, so null rows do not give anyerrors, and in the following are treated as the largest entries.

C4 < =RC3-SUMPRODUCT((RC1:RmaxC1=RC1)*1)+1

Page 6: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Now in RiC4 is the number of entries in column C1 whichare either smaller than RiC1 or equal to it and located in thesame row or above it. This is the number of the row intowhich RiC1 should be relocated during sort.

C5:C6 << =INDEX(R1C[-4]:RmaxC[-4],

MATCH(ROW(),R1C4:RmaxC4,0))

This part is very similar to the standardization solution,except that there are no errors to be trapped and we combinetwo formulas into one.

Sorting in descending order is done by reverting the signsof numbers by the formula =IF(RC[-1]="";"";-RC[-1]) andsorting into ascending order, and then reverting the signsagain. This leaves the null rows at the bottom. In partic-ular, if sorting is necessary there is no need to standardizefirst.

An important property of this operation is that rows withempty string in the column on which the sort is performed,are moved to the bottom. Consequently, sorting brings anyquery or relation to standard form. Moreover, this formof sorting does not affect the relative order of tuples, whichhave identical values in the column on which they are sorted.

4.5 Duplicate removalNext we describe the implementation of duplicate removal,

which, among other things, converts its input data from bagto set semantics. For the purpose of illustration, we assumethe table to contain two columns C1:C2.

C3 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2))

This causes RiC3 to contain the number of tuples fromC1:C2 which are equal to RiC1:RiC2 and are located at thesame level or above it. This number is 1 iff the row containsthe first occurrence of this tuple.

C4:C5 << =IF(RC3=1,RC[-3],"")

Now the first occurrences of tuples are copied into C4:C5,the other are replaced by null rows. Standardization can beused to bring the result to the standard form, if desired.

4.6 SelectionAssume that we are given a relation r located in C1:C2

and we want to compute σθr, where θ is a boolean combi-nation of equalities and inequalities concerning the valuesof columns of r and constants. Then we use a spreadsheetformula expressing θ to substitute "" for the rows which donot satisfy θ. This is best explained on an example: if θis (C1 ≤ 100 ∧ C2 > C1) ∨ C2 6= 175, then the selection isimplemented by

C3:C4 << =IF(OR(AND(RC1<=100,RC2>RC1),

RC2<>175),RC[-2],"")

It leaves the result of the selection in a loose (set or bag, in-herited from the input) form, but, as always, can be broughtto the standard form.

4.7 ProjectionThe case of projection is quite easy: it amounts to omit-

ting some columns from the input relation/query.

4.8 UnionAssume that we are given two relations located in C1:C2

and C3:C4, respectively, and that the sum of their cardinal-

ities does not exceed max.Then use the following formulas to calculate their union

in standard bag form, which can be subsequently broughtto loose set form by duplicate removal and then to standardset form by standardization.

R1C5 < =COUNT(C1)

This is the number of non-null rows of C1.C6:C7 << =IF(ROW()<=R1C5,RC[-5],

INDEX(R1C[-3]:RmaxC[-3],ROW()-R1C5))

If the present row number is less than R1C5 then we takethe same row from C1:C2, otherwise we take rows from C3:C4

whose numbers are suitably shifted. Note that this workswhen the inputs are standard (set or bag). Therefore, ifthe input relations are loose, they should be brought to thestandard form, before taking union.

4.9 DifferenceAssume that we are given two relations located in C1:C2

and C3:C4, respectively. Then use the following formulas tocalculate their set difference.

C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*

(R1C4:RmaxC4=RC2))

This calculates in RiC5 the number of times a tuple equalto RiC1:RiC2 appears in C3:C4.

C6:C7 << =IF(RC5=0,RC[-3],"")

Now if RiC5 is 0, we copy the row RiC1:RiC2 to the output,otherwise we replace it by a null row.

The set form of the result is inherited from the inputs, butcertainly may contain null rows and is therefore loose. How-ever, this construction does not work for the bag format,since in this case we should count the copies of identical rowsin both relations and put in the output a suitable numberof such rows.

The more complicated construction which does work is asfollows:

C5 < =SUMPRODUCT((R1C3:RmaxC3=RC1)*

(R1C4:RmaxRC2))

This, exactly as before, calculates in RiC5 the number oftimes a tuple equal to RiC1:RiC2 appears in C3:C4.

C6 < =SUMPRODUCT((R1C1:RC1=RC1)*(R1C2:RC2=RC2))

Now we calculate in RiC6 the number of times a tupleequal to RiC1:RiC2 appears in C1:C2 in row i or above it.

C7:C8 << =IF(RC5>=RC6;"";RC[-6])

Now we replace by null rows the first RiC5 occurrences oftuple RiC1:RiC2, and leave unaffected the remaining ones,which gives the desired bag difference. The resulting relationis loose.

4.10 Cartesian productAssume that we are given two relations located in C1:C2

and C3:C4, respectively, and that the product of their car-dinalities does not exceed max.

Then use the following formulas to calculate their Carte-sian product. The construction below works only for rela-tions in standard form, so if the inputs are loose, standard-ization is necessary first.

R1C5 < =COUNT(R1C1:R1Cmax)

R2C5 < =COUNT(R1C3:RmaxC3)

Page 7: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

We calculate the numbers of non-null rows in C1:C2 andC3:C4.

C6:C7 << =IF(ROW()<=R1C5*R2C5,

INDEX(R1C[-5]:RmaxC[-5],

INT(ROW()-1,R2C5)+1),"")

This creates R1C5 blocks, the i-th block being R2C5 copiesof RiC1:RiC2.

C8:C9 << =IF(ROW()<=R1C5*R2C5,

INDEX(R1C[-5]:RmaxC[-5],

MOD(ROW()-1,R2C5)+1),"")

This repeats in circular fashion the consecutive rows ofC3:C4 a total of R1C5 rounds.

Note that in this case, the set or bag form of the initialrelations is inherited by their product.

4.11 Grouping with aggregationIn the following, we assume always the relation to be lo-

cated in C1:C3, grouping done over C1:C2 and aggregationover C3.

4.11.1 GROUP BY with SUM

C4 < =SUMPRODUCT((R1C1:RmaxC1=RC1)*

(R1C2:RmaxC2=RC2)*R1C3:RmaxC3)

This array formula computes in RiC4 the sum of all RjC3over all j such that RjC1:RjC2 is equal to RiC1:RiC2.

Now we do duplicate elimination over C1:C2 and C4 andthat is the desired result.

4.11.2 GROUP BY with COUNT

C4 < =SUMPRODUCT((R1C1:RmaxC1=RC1)*

(R1C2:RmaxC2=RC2))

This is quite similar to the previous case, except that C4

computes counts of rows rather than sum.

4.11.3 GROUP BY with AVG

One has to compute GROUP BY with SUM and GROUP BY

with COUNT side-by-side and return the copy of C1:C2 plusthe sum column divided by the count column.

4.11.4 GROUP BY with MAX and MIN

Let us consider MAX, the other being handled symmetri-cally. First, the whole relation is sorted into descendingorder by C3. On the result, elimination of duplicates is per-formed, which however considers two rows identical alreadywhen they agree on C1:C2. Our implementation of this op-eration eliminates all occurrences of a tuple except the veryfirst one. In this case, the one left is accompanied by themaximal value of C3, as desired.

4.12 SummaryAt this point we have already achieved the main goal of

this paper. We have demonstrated that spreadsheets canimplement and execute all relational algebra queries.

5. PRACTICAL LEVEL: SQLThis part is devoted to the discussion of the implementa-

tion issues of SQL-92. It is less detailed than the previous

section, and is more dependent on the particular propertiesof Excel.

Of the three parts of SQL: DDL, DML and DCL, that lastone is irrelevant, since we construct a database for a singleuser.

5.1 NULL valuesNULLs can be represented simply by the string NULL and

handled as such. This is not difficult, rather tedious, sinceall the formulas, whether implementing DDL or DML state-ments, must be adjusted to handle NULLs by introducingconditional IFs which test if the argument is a NULL andinvoke either a special treatment of NULL or the standardformula for non-NULLs.

5.2 DDLLet’s discuss DDL, i.e., mainly CREATE TABLE state-

ments. We adopt the option to distinguish the data tablefrom its input area. So for each CREATE TABLE statementwe create a separate data table and a separate input table.The latter is indeed a query table (see below for details),which filters tuples which do not satisfy integrity constraintsincluded in the DDL statement and displays a warning mes-sage for the user. The former then fetches the rows whichsatisfy the integrity constraints (by merely looking if thereis a warning message or not), and does standardization.

We assume the user to enter data elements adding themat the bottom. If elements are removed (simply using theDEL key), no new elements are added at their positions.Updates are performed by removing the old version of thetuple and immediately adding the new one at the bottom.

Function TYPE allows one to distinguish text, booleansand numbers, which, together with length function LEN forstrings and inequalities for numbers allow one to enforcedata type declarations. There are a few limitations to thisrule, e.g., the empty string "" plays a special role in our im-plementation of relational algebra operators, and so doesthe string "NULL", which imposes a (mild) restriction onwhat kind of strings can be used. For the DATE statement,however, one has to use formatting, instead, which enforcesnumbers to be interpreted and displayed as dates.UNIQUE and PRIMARY KEY are enforced by the duplicate

elimination operator described above, which rejects tupleswhich have already appeared before.FOREIGN KEY statements are enforced by a semijoin query,

which can be constructed using already described algebraoperators. It does not seem that there is an easy methodto implement policies concerning behavior of the databasewhen one deletes a foreign key for a tuple, except the CAS-

CADE option. This one is completely automatic: when theforeign key disappears, the tuples which reference it becomeillegal and disappear from the data table (because the for-mulas which transfer them to the data table return "" in theabsence of the foreign key), even though they remain in theinput table (where a warning appears).

Concerning INDEX, indexes can not be created in the usualsense, but there is a simple method which helps in some situ-ations when index does. It amounts to creating a copy of therelation sorted by the column with the INDEX. Experimentsshow, that searching with MATCH function is faster on sortedcolumns, which already speeds up queries. Furthermore, onecan create a separate table with the unique values from thiscolumn, along with the numbers of their occurrences in the

Page 8: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

original sorted table. This can considerably speed up, e.g.,the computation of equijoins.

Example 3. Let us consider the following DDL statement:

CREATE TABLE Orders(

Id INT UNSIGNED NOT NULL PRIMARY KEY,

ModelID INT NOT NULL REFERENCES Models (ModelID),

Version SMALLINT,

ModelDescrip VARCHAR(40));

We assume the following:

• the input worksheet for the above table is OrdersInput,and the worksheet of the data table is OrdersData;

• worksheet ModelsData keeps in column C1 the primarykey referenced to above;

• size limits for INT and SMALLINT are M and N , respec-tively;

• the limit for the number of rows in data tables is max.

The following formula is placed in column C5:

=IF(AND(RC1="",RC2="",RC3="",RC4=""),"",

IF(OR(RC1="",RC2="",RC3="",RC4=""),"Invalid data",

IF(

AND(

IF(TYPE(RC1)=1,INT(RC1)=RC1,FALSE),

RC1>=0,RC1<=M,

COUNTIF(R1C1:RC1,RC1)=1,

COUNTIF(ModelsData!R1C1:RmaxC1,RC2)=1,

IF(TYPE(RC3)=1,

AND(INT(RC3)=RC3,ABS(RC3)<=N),

RC3="NULL"),

TYPE(RC4)=2,LEN(RC4)<=40),

"","Invalid data")))

The explanation of this formula is as follows: the formulais a big IF, which behaves as follows: it returns an emptystring (numbers in parentheses refer to the lines in the for-mula above): when the row is a null row (1), and otherwisean error message if at least one (but not all) of its fields is ""(2), and otherwise again "" if all of the following conditionshold:

(5-6) The first column contains a number whose integerpart is equal to itself, is nonnegative and does not ex-ceed M (note that we used IF – this formula has lazyevaluation in Excel, hence the function INT is neverapplied to non-numbers and does not give any errormessage).

(7) RC1 appears for the first time in its column (COUNTIF isa single-column equivalent of SUMPRODUCT formulas wehave used elsewhere).

(8) RC2 appears exactly once in the table ModelsData inthe first column (assumed to contain the primary keyof that relation) and in this branch of the initial IFsit is not "" (in an extremely rare case the foreign keycolumn might contain exactly one "" value). Note thatit is not necessary to verify that RC2 is a nonnegativeinteger in the specified range, because it is enforced inthe foreign key table, so the count takes care of it.

(9-11) If RC3 is a number then it must be an integer whoseabsolute value does not exceed N , and if not number,then it is the string "NULL".

(12) RC4 is a text, whose length is in the specified range.Function LEN accepts numbers as inputs, so there is noneed to protect its uses by IF.

Then the OrdersData worksheet contains formulas

=IF(OrdersInput!RC5="Invalid data","",OrdersInput!RC)

in all its four columns.

As it can be seen from the example, the actual translationof the CREATE TABLE statements can be quite complicated.The main reason for that now we have many data types andsome of the functions must be prevented from being appliedto arguments of wrong type, NULLs may show up, etc.

5.3 DMLAs we have already demonstrated, spreadsheets have the

full power of executing relational algebra queries, i.e., allSELECT queries of SQL-92 can be evaluated. Note that Ex-ample 3 gives really an example of a (rather simple) query,too. Except data type verification, that query does a semi-join to check the FOREIGN KEY statement and duplicate elim-ination to satisfy the PRIMARY KEY declaration.

As the user communicates with the database directly byits input tables, there is no need to implement INSERT, DELETEand UPDATE statements, although individual inserts can eas-ily be combined with the DDL declarations and executed bythe compiler when creating the spreadsheet implementationof the database.

6. PERFORMANCEUnfortunately, Excel and other spreadsheets have not been

designed to serve as database engines, so we can not ex-pect very good performance of our implementations. Arrayand aggregation formulas generally always do linear scansof their arguments, and they are used in a linear numberof cells. Recomputation of cells, whether invoked automat-ically of manually, always applies to all of them, so theyproduce quadratic algorithms. Of course, there are still pos-sibilities to get some improvement (at least of the constants),by using dynamic algorithms, which compute the values incells accessing only a few neighboring cells, or exploit thelazy evaluation of IF statements to prune the computationtrees significantly. This area is largely unexplored, as thewhole problem of optimization queries to be executed in aspreadsheet.

Apart from reducing the cost of operations, the otherimportant possibility is to reduce recomputation. Namely,some of the systems (including Excel again) permit refer-ences not only to other worksheets, but to other workbooks(i.e., files), too. This gives the possibility to locate eachquery in a different file and open it only when it is neces-sary. It is then recomputed, but other queries are not. Inparticular, when working with data tables no queries needto be open. This kind of architecture is shown in Figure1 at the beginning of the paper. A similar solution is towork with tables and queries located in worksheets of oneworkbook, but with automatic recomputation turned off.Instead, as an act of computing a query, the user manually

Page 9: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

orders recomputation only of the currently active worksheet(Shift-F9 in Excel).

Below we present a few performance tests. All of themwere conducted using Excel 2003 running on one core ofIntel(R) Core(TM)2 Duo CPU at 2.40GHz, in a laptop with2 GB RAM and Windows XP Professional SP3. We do notclaim the the findings of this section carry over to otherspreadsheets systems. All charts appear at the end of thepaper.

6.1 Impact of optimizationWe discuss only two simple optimization techniques, which

do not go beyond improving particular operations, and donot consider at all the choice of a better logical query plans.The first of them is avoiding computations on null rows byutilizing the lazy evaluation of IF. It is assumed to be usedin all experiments below. Other experiments, whose resultsare not shown here, indicate that it reduces time cost sig-nificantly when there are many null rows in the tables, anddoes not create any significant overhead when there are onlyfew of them.

The other possible optimization is to choose between ar-ray formulas, built-in aggregating functions and dynamicalgorithms. We illustrate this on the example of the stan-dardization operator.

The implementation described in Section 4 is as follows:C3 < =SUMPRODUCT((R1C1:RC1<>"")*1)

C4 < =MATCH(ROW(),R1C3:RmaxC3,0)

C5:C6 << =IF(ISERROR(RC4),"",

INDEX(R1C[-4]:RmaxC[-4],RC4))

However, we have at least three other options to count thenumber of non-null rows above or at the level of the presentrow, used in C3.

The array-formula solution is to useC3 < {=SUM((R1C1:RC1<>"")*1)}

The dynamic programming solution isR1C3 < =IF(R1C1="",0,1)

R2C3:RmaxC3 < =IF(RC1="",R[-1]C,R[-1]C+1)

The aggregate solution isC3 < =ROW()-COUNTIF(R1C1:RC1;"")+1

It is very instructive to compare their performance inquery tables with max ranging between 50 and 5000, fullof data in each case (other experiments indicate that thecost of this query does not depend significantly on the num-ber of null rows in the table) in Figure 3. Remember thatthe complete implementation we test contains three otheraggregating functions in each row, which remain there in allcases.

The results suggest that SUMPRODUCT is just an alias for anarray formula, since their performances are precisely identi-cal, at least in this context.

6.2 Performance tests

6.2.1 InsertionsAll tests assume relation in standard form, and the imple-

mentations were optimized by using IF testing if the rows

are null to avoid high computation costs on them.The first test was conducted on a table from Example 3.As we can see in Figure 4, a user of an average computer

who is willing to wait 1 second for a result of his actions,can store more than 2500 tuples in a table with integrityconstraints of medium complexity. The cost depends on thesize of the table with the foreign keys.

6.2.2 SortingSorting is more time-consuming, and the cost depends on

how large the table is and how many tuples are alreadystored in it (we assume no integrity constraints on the table),as illustrated in Figure 5. However, the value of max does notinfluence the performance of the operation very significantly.

Remember that a table with max means a table filled withmax rows of formulas designed for sorting up to max rows,which are recalculated no matter how many tuples are inthe relation at the moment. Again, the 1 second limit islocated at about 2500 tuples.

6.2.3 Average family income queryThe average family income query from Figure 2 is the

last example. The implementation uses, besides the IF op-timization, also the use of the fastest standardization fromparagraph 6.1, based on dynamic programming. Still, thequery proves to be more time-consuming than the previousoperations. Again, the value of max does not change the costvery much.

6.3 SummaryThe general observation is that all the costs are indeed

O(n2), but the good news is that the constants are rathersmall and the times remain still reasonable for a few thou-sand tuples. Moreover, it seems that the main factor is thequantity of data, rather than the size of the initial table.

7. CONCLUSIONS, FURTHER RESEARCHWe have demonstrated that relational algebra can be nat-

urally expressed in a spreadsheet, thus showing the power ofthe spreadsheet paradigm, which subsumes on the theoreti-cal level the paradigm of relational databases. This can beunderstood as an implementation of a relational databaseon a completely new type of (virtual) hardware. Of course,in practice the effectiveness of this database is low.

This immediately raises a number of new questions andproblems.

• Can a small database be practically implemented in aspreadsheet, yielding a really useful application? Ourperformance tests suggest that it might be possible forstoring a few thousand tuples of data.

• Can a database project written in SQL and compiledto a spreadsheet serve (and be useful) as a rapidly cre-ated prototype of that database? The advantage ofthis solution is that the spreadsheet would then pro-vide an instant, friendly user interface for experimentsand demonstrations.

• Develop a methodology to optimize SQL queries exe-cuted in a spreadsheet.

• Can spreadsheets execute queries not expressible inSQL-92? In particular, can spreadsheets execute re-cursive queries, like those WITH ...SELECT in SQL-99,

Page 10: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

or those in Datalog? It seems that the answer is neg-ative, but a proof of impossibility requires a formalmodel of spreadsheets, which does not exist so far.

• Our implementations of SQL-92 queries use uniformspreadsheets, in which all rows of a query table areidentical. Could nonuniformity help expressing morequeries?

• Can spreadsheets naturally implement other modelsof databases, like semi-structural or object-relationalones?

• What is the ultimate limit of the spreadsheet paradigmof computation?

8. ACKNOWLEDGMENTSI would like to thank several people, however in this blind

version all I can say is ”Wovon man nicht sprechen darf,daruber muß man schweigen” (compare [10, Proposition 7]).

9. REFERENCES[1] R. Abraham and M. Erwig. Type inference for

spreadsheets. In PPDP ’06: Proceedings of the 8thACM SIGPLAN Symposium on Principles andPractice of Declarative Programming, pages 73–84,New York, NY, USA, 2006. ACM.

[2] M. M. Burnett, J. W. Atwood, R. W. Djang,J. Reichwein, H. J. Gottfried, and S. Yang. Forms/3:A first-order visual language to explore the boundariesof the spreadsheet paradigm. J. Funct. Program.,11(2):155–206, 2001.

[3] E. J. Chesler, S. L. Rodriguez-Zas, J. S. Mogil,A. Darvasi, J. Usuka, A. Grupe, S. Germer, D. Aud,J. K. Belknap, R. F. Klein, M. K. Ahluwalia,R. Higuchi, and G. Peltz. In silico mapping of mousequantitative trait loci. Science, 294(5551):2423, 2001.In Technical Comments.

[4] Science. Preparing Your Supporting Online Material.http://www.sciencemag.org/about/

authors/prep/prep_online.dtl, accessed20/10/2009.

[5] H. Garcia-Molina, J. D. Ullman, and J. Widom.Database System Implementation. Prentice-Hall, 2000.

[6] A. Grupe, S. Germer, J. Usuka, D. Aud, J. K.Belknap, R. F. Klein, M. K. Ahluwalia, R. Higuchi,and G. Peltz. In silico mapping of complexdisease-related traits in mice. Science,292(5523):1915–1918, 2001.

[7] S. P. Jones, A. Blackwell, and M. Burnett. Auser-centred approach to functions in Excel. In ICFP’03: Proceedings of the Eighth ACM SIGPLANInternational Conference on Functional Programming,pages 165–176, New York, NY, USA, 2003. ACM.

[8] Microsoft Corporation. Excel Home Page - MicrosoftOffice Online. http://office.microsoft.com/en-us/excel/default.aspx, accessed 20/10/2009.

[9] D. Wakeling. Spreadsheet functional programming. J.Funct. Program., 17(1):131–143, 2007.

[10] L. Wittgenstein. Logisch-philosophische Abhandlung,Tractatus logico-philosophicus. Suhrkamp, Frankfurtam Main, 1998. Kritische Edition.

[11] A. G. Yoder and D. L. Cohn. Real spreadsheets forreal programmers. In H. E. Bal, editor, Proceedings ofthe IEEE Computer Society 1994 InternationalConference on Computer Languages, May 16-19,1994, Toulouse, France, pages 20–30, 1994.

Page 11: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Figure 3: Costs of four solutions of standardization, for table with max between 50 and 5000

Figure 4: Cost of an insertion, for table from Example 3 with max equal 2500, and for foreign key table with500 (triangles) and 1000 values (squares), respectively

Page 12: Spreadsheet As a Relational Database Enginejty/SIGMOD/SIGMOD.pdftions of hypertext tables or Excel les linked to explanatory image les or tables. Such presen-tations may require special

Figure 5: Cost of sorting, for tables with max equal 2000 (triangles) and 5000 (squares)

Figure 6: Cost of computing query from Figure 2, for tables with max equal 1000 (diamonds), 1500 (triangles)and 2000 (squares)