Demonstrating AI-enabled SQL Queries over Relational Data ... › kdd2018 › files › deep-learning... · entries from a relational database are converted to text tokens rep-resenting

Demonstrating AI-enabled SQLQueries over Relational Datausing a Cognitive Database

Joseacute Luis NevesIBM Systems

Poughkeepsie NYjnevesusibmcom

Rajesh BordawekarIBM T J Watson Research Center

Yorktown Heights NYbordawusibmcom

ABSTRACTThis paper demonstrates key capabilities of Cognitive Database anovel AI-enabled relational database system which uses an unsu-pervised neural network model to facilitate semantic queries overrelational data The neural network model called word embeddingoperates on an unstructured view of the database and builds a vec-tor model that captures latent semantic context of database entitiesof different types The vector model is then seamlessly integratedinto the SQL infrastructure and exposed to the users via a new classof SQL-based analytics queries known as cognitive intelligence(CI) queries The cognitive capabilities enable complex queries overmulti-modal data such as semantic matching inductive reasoningqueries such as analogies and predictive queries using entities notpresent in a database We plan to demonstrate the end-to-end exe-cution flow of the cognitive database using a Spark based prototypeFurthermore we demonstrate the use of CI queries using a publi-caly available enterprise financial dataset (with text and numericvalues) A Jupyter Notebook python based implementation will alsobe presented

CCS CONCEPTSbull Information systems rarr Structured Query Language On-line analytical processing Online analytical processing engines

KEYWORDSArtificial Intelligence Relational Databases SQL Word Embedding

1 INTRODUCTIONRelational Databases store information based on a user definedschema that describes the data types keys and functional depen-dencies Knowing the schema allows someone to extract relevantinformation For example given a table with a column containingfinancial transactions described by individual amounts one caneasily calculate the total amount Likewise if there is a date asso-ciated with the transaction one can report the financial data bymonth quarter or year Database languages like SQL allow a userto make these and more complex queries However the semanticrelationships represented by the data is mostly left to the user inter-pretation as queries are executed and data is re-organized Further

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for third-party components of this work must be honoredFor all other uses contact the ownerauthor(s)KDDrsquo18 Deep Learning Day August 2018 London UKcopy 2018 Copyright held by the ownerauthor(s)ACM ISBN 978-x-xxxx-xxxx-xYYMMhttpsdoiorg101145nnnnnnnnnnnnnn

traditional SQL queries rely mainly on value-based predicates to de-tect patterns For example the aggregate of all transactions withina timeframe or the sorting of transaction amounts by decreasingorder of value The meaningful relationship and interpretation be-tween the data of multiple columns is left to the user during thewriting of the SQL query Thus the traditional SQL queries lacka holistic view of the underlying relations and thus are unableto extract and exploit semantic relationships that are collectivelygenerated by tokens in a database relation

This paper discusses Cognitive Database [3 5] a novel relationaldatabase system which uses an unsupervised neural network basedapproach from Natural Language Processing called word embed-ding to extract latent knowledge from a database table The gen-erated word-embedding model captures inter- and intra-columnsemantic relationships between database tokens of different typesFor each database token the model includes a vector that encodescontextual semantic relationships The cognitive database seam-lessly integrates the model into the existing SQL query processinginfrastructure and uses it to enable a new class of SQL-based ana-lytics queries called Cognitive Intelligence (CI) queries CI queriesuse the model vectors to enable complex queries such as semanticmatching inductive reasoning queries such as analogies or seman-tic clustering and predictive queries using entities not present in adatabase In this paper we demostrate unique capabilities of Cog-nitive Databases using an use-case where SQL-based CI queriesin conjuction with traditional SQL queries are used to analyze amulti-modal relational database containing text and numeric valuesWe evaluate this use-case using a Spark-based cognitive databaseprototype

The rest of the paper is organized as follows In Section 2 we firstsummarize key design aspects of cognitive database and then dis-cuss architecture of the Spark-based prototype Section 3 describesin detail the different types of Cognitive CI queries and how theywork with UDFs and the word embedding model Section 4 outlineskey features being demonstrated data pre-processing to build theword-embedding model the word-embedding model design of theSpark-based CI queries over multi-modal data different examplesof CI queries including analysis explaining the results obtainedfor each type of CI query and Python-based interfaces for the CIqueries

2 BACKGROUND AND SYSTEMARCHITECTURE

Figure 1 presents the three key phases in the execution flow of acognitive database The first only executed when a new model iscreated or needs to be updated training phase takes place when the

KDDrsquo18 Deep Learning Day August 2018 London UK J Neves et al

(1) Optional Training Phase

VectorDomain

Domain

Relational

Domain

Relational

Tables

External Text sources

Learned Vectors

Relational

System Tables

External

Learned Vectors UDFs

CI Queries Relations

Tokenized Relations

(3) Query Execution Phase(2) Vector Storage Phase

Figure 1 End-to-end execution flow of a cognitive relationaldatabase

database is used to train the model Our training approach is char-acterized by two unique aspects (1) Usingmeaningful unstructuredtext representation of the structured relational data as input to thetraining process (ie irrespective of the associated SQL types allentries from a relational database are converted to text tokens rep-resenting them) and (2) Using the unsupervised word embeddingtechnique to generate meaning vectors from the input text corpusWe have modified the classical word embedding approach [9] toaddress various subtletiles with the relational data model (eg sup-porting primary keys) Every unique token from the input corpusis associated with a vector of dimension d In our scenario a texttoken in a training set can represent either text numeric or im-age data Thus the model builds a joint latent representation thatintegrates information across different modalities using untypeduniform feature (or meaning) vectors Alternatively this phase canalso use pre-trained vectors built from an external data repository(eg Wikipedia)

Following vector training the resultant vectors are stored in arelational system table (phase 2) At runtime the SQL query execu-tion engine uses various user-defined functions (UDFs) that fetchthe trained vectors from the system table as needed and answerCI queries (phase 3) These UDFs take typed relational values asinput and compute semantic relationships between them usinguniformly untyped meaning vectors This enables the relationaldatabase system to seamlessly analyze data of different types (egtext numeric values and images) using the same SQL CI queryOur current implementation is built on the Apache Spark 220infrastructure [2] using Scala It runs on a variety of processorsand operating systems The system implementation follows thecognitive database execution flow as presented in Figure 1 Thesystem first initializes in-memory Spark Dataframes from externaldata sources (eg relational database or CSV files) loads the associ-ated word embedding model into another Spark Dataframe (whichcan be created offline from either the database being queried orexternal knowledge bases such as Wikipedia) and then invokes theCognitive Intelligence queries using Spark SQL The SQL queriesinvoke Scala-based cognitive UDFs to enable computations on themeaning vectors A Python based implementation (Figure 2) is alsoprovided thus allowing the system to be used by both the databaseand data science communities

The distinguishing aspect of cognitive intelligence queries con-textual semantic comparison between relational variables is imple-mented using user-defined functions (UDFs) Thus the CI queriescan support both the traditional value-based as well as the new

semantic contextual computations in the same query Each CIquery uses the UDFs to measure semantic similarity between apair of sets (or sequences) of tokens associated with the inputrelational parameters The core computational operation of a cog-nitive UDF is to calculate similarity between a pair of tokens bycomputing the cosine distance between the corresponding vec-tors For two vectors v1 and v2 the cosine distance is computedas cos(v1v2) = cos(v1v2) =

v1 middotv2∥v1 ∥ ∥v2 ∥

The cosine distance valuevaries from 10 (very similar) to -10 (very dissimilar) Each CI queryuses the UDFs to execute nearest neighbor computations using thevectors from the current word-embedding model Thus CI queriesprovide approximate answers that reflect a given model

Figure 2 Spark Implementation of a cognitive relationaldatabase

Our current implementation supports four types of CI SQLqueries similarity based classification inductive reasoning pre-diction and cognitive OLAP [3] These queries can be executedover databases with multiple datatypes we currently support textnumeric and image data The similarity queries compare two re-lational variables based on similarity or dissimilarity between theinput variables Each relational variable can be either set or se-quence of tokens In case of sequences computation of the finalsimilarity value takes the ordering of tokens into account The sim-ilarity value is then used to classify and group related data Theinductive reasoning queries exploit latent semantic information inthe database to reason from part to whole or from particular to gen-eral We support five different types of inductive reasoning queriesanalogies semantic clustering analogy sequences clustered analo-gies and odd-man-out Given an item from an external data corpus(which is not present in a database) the predictive CI query canidentify items from the database that are similar or dissimilar tothe external item by using the externally trained model Finallycognitive OLAP allows SQL aggregation functions such as MAX()MIN() or AVG() over a set that is identified by contextual similaritycomputations on the relational variables

3 COGNITIVE INTELLIGENCE QUERIESThe basic UDF and its extensions are invoked by the SQL CI queriesto enable semantic operations on relational variables Each CI queryuses the UDFs to execute nearest neighbor computations using thevectors from the current word-embedding model Thus CI queriesprovide approximate answers that reflect a given model For the pur-poses of demonstrating the various CI queries a publicly available

Demonstrating AI-enabled SQLQueries over Relational Data using a Cognitive DatabaseKDDrsquo18 Deep Learning Day August 2018 London UK

dataset is used as presented in Section 4 However lets briefly de-scribe the data as it helps explain the different types of CI queries de-tailed in this section The dataset contains all the expenditure trans-actions for the state of Virginia for the year of 2016 Each transactionis characterized by several fields namely VENDOR_NAME AGENCYFUND_DETAIL OBJECTIVE SUB_PROGRAM AMOUNT and VOUCHER_DATEThrough feature engineering two more fields were added to thedataset namely QUARTER and COUNTY the purpose explained at alater time The VENDOR_NAME field names the institution the statehad an expense with This institution can be a state county or localagency a physical person a local state or national business etcThe CI queries can be broadly classified into four categories asfollows

31 Similarity QueriesThe basic UDF that compares two sets of relational variables can beintegrated into an existing SQL query to form a similarity CI queryIn a traditional SQL environment to determine similarity of trans-action expenses of a given customer against all the other customersone would have to determine the terms of comparison meaningwhich columns to compare followed by gathering statistics for allthe transactions of the given customer This process would have tobe repeated for all the other customers in the table Finally each cus-tomer would be compared against the given customer and a scorecalculated that represents similarity Note that the terms that definesimilarity would have to be defined prior to all this process mean-ing the rules amongst the columns that define similarity Also notethat the aforementioned process becomes more complex and timeconsuming as the number of features describing each transaction in-creases The alternative is a SQL Similarity CI query as illustrated inFigure 3 that identifies similar transactions to all the transactions bya given VENDOR_NAME in this case the County of Arlington Assumethat Expenses is a table that contains all transaction expenses forthe state of Virginia whose ExpensesVENDOR_NAME column con-tains each individual entity or customer the state had an expensewith To identify which transactions have a similar transaction pat-tern to a given customer one would use a SQL query with a UDFin this case proximityCust_NameUDF() that computes similarityscore between two sets of vectors that correspond to the fieldsdescribing each VENDOR_NAME

SELECT VENDOR_NAME proximityCust_NameUDF(VENDOR_NAMErsquo$aVendorrsquo) AS proximityValueFROM ExpensesWHERE proximityValue gt 05ORDER BY proximityValue DESC

Figure 3 Example of a SQL CI similarity query find similarstate customers based on expense transaction patterns

The query shown in Figure 3 uses the similarity score to se-lect rows with related Vendors and returns an ordered set of sim-ilar Vendors sorted in descending order of their similarity scoreThe similarity score is computed by calculating the cosine dis-tance between vectors one being the vector of the customer ofinterest County of Arlington and the other the vector associated

with ExpensesVENDOR_NAME for all the VENDOR_NAMEs in the tableExpenses The similarity score is sorted in descending order andthe outcome presented in a table as illustrated in Figure 4

Figure 4 Most similar vendors to County of Arlington

Section 4 presents in detail analysis techniques that show whythe similarity results for County of Arlington are actually othercounties and not some other random Vendor Note that the SQLcommand to get such results did not filter any data before or afterthe SELECT statement with respect to the 190950 unique Vendorsthat had one or more transactions with the state of Virginia

32 Dissimilar QueriesDissimilarity is a special case of similarity where the query willfirst choose rows whose Vendors have lower similarity (eg lt 03)to a given Vendor and the results ordered in an ascending formusing the SQL ASC keyword This variation returns Vendors thatare highly dissimilar to a given Vendor (ie the transaction withthe state is with completely different agencies objectives fundsandor programs) If the results are ordered in the descending orderusing the SQL DESC keyword the CI query will return Vendors thatare somewhat dissimilar to a given Vendor

If one is interested to find the counties that are most dissimilarto a given county a dissimilar SQL CI query can still be used toreduce the number of dissimilar Vendors and extract from thatsubset any row which VENDOR_NAME contains the keyword COUNTYOF However the local governments within the State of Virginia arecomposed of Counties and Independent Cities As such the filter stepcan be enhanced to include Independent Cities or a new featurecan be added to the database that correctly identifies each Vendortransaction if it is a local government transaction or not Engineer-ing this feature into the dataset is also useful when comparing thetransactionsexpenditures of the state with its local governmentswith respect to other characteristics not present in the expendi-ture database For example the correlation of money spent by thestate in K-12 and higher education with the amount of studentswithin each county and independent city that finish high schooland university

Similarity and dissimilarity queries can be customized to restricttransactions to a particular time period eg a specific quarter ora month The query would use vector additions over vectors tocompute new vectors (eg create a vector for transaction patternsof a Vendor Vendor_A in quarter Q3 by adding vectors for Vendor_Aand quarter_Q3) and use the modified vectors to find the targetcustomers

The patterns observed in these queries can be applied to otherdomains as well eg identifying patients that are taking similardrugs but with different brand names or identifying food items

SELECT XcustID similarityUDF(XItems lsquolisteriarsquo) ASsimilarityFROM sales XWHERE similarityUDF(XItems lsquolisteriarsquo) gt 03ORDER BY similarity DESC

Figure 5 Example of a prediction query find customers thathave purchased items affected by a listeria recall

with similar ingredients or recommending mutual funds with sim-ilar investment strategies As we will see in the next section thesimilarity query can be applied to other data types such as images

Another use case provides an illustration of a predictive CI querywhich uses a model that is externally trained using an unstructureddata source or another database (Figure 5) For this scenario twocompletely different datasets are used The first dataset is a salesdataset that describes customer purchasing patterns in terms ofproducts The second dataset describes the products in terms ofpotential infections andor deceases Consider a scenario of a recallof various fresh fruit types due to possible listeria infection Thisexample assumes that we have built a word embedding model usingthe recall notices as an external source Assume that recall documentlists all fruits impacted by the possible listeria infection eg ApplesPeaches Plums Nectarines The model will create vectors for allthese words and the vector for the word listeria will be closer to thevectors ofApples Peaches Plums etc Nowwe can import thismodeland use it to query the sales database to find out which customershave bought items that may be affected by this recall as defined bythe external source As Figure 5 shows the similarityUDF() UDFis used to identify those purchases that contain items similar tolisteria such as Apples This example demonstrates a very powerfulability of CI queries that enables users to query a database usinga token not present in the database (eg listeria) This capabilitycan be applied to different scenarios in which recent updatableinformation can be used to query historical data For example amodel built using FDA recall notices could be used to identify thosecustomers who have purchased medicines similar to the recalledmedicines

33 Cognitive OLAP QueriesFigure 6 presents a simple example of using semantic similarities inthe context of a traditional SQL aggregation query This CI queryaims to determine the maximum amount a State Agency paid toin the Expenses table for each Vendor that is similar to a specifiedVendor Vendor_Y The result is collated using the values of theVendor the Agency and ordered by the total expense paid Asillustrated earlier the UDF proximityCust_NameUDF defined forsimilarity queries is also used in this scenario The UDF can useeither an externally trained or locally trained model This query canbe easily adapted to support other SQL aggregation functions suchas MAX() MIN() and AVG() This query can be further extended tosupport ROLLUP operations over the aggregated values [6]

We are also exploring integration of cognitive capabilities intoadditional SQL operators eg IN and BETWEEN For example one orboth of the value ranges for the BETWEEN operator can be computedusing a similarity CI query For an IN query the associated set of

SELECT VENDOR_NAME AGY_AGENCY_NAME SUM(AMOUNT) asmax_valueFROM Expenses INNER JOIN Agencies ON ExpensesAGY_AGENCY_KEY= AgenciesAGY_AGENCY_KEY WHEREproximityCust_NameUDF(VENDOR_NAMErsquo$Vendor_Yrsquo) gt $proximityGROUP BY VENDOR_NAME AGY_AGENCY_NAME ORDER BY max_value DESC

Figure 6 Example of a cognitive OLAP (aggregation) query

choices can be generated by a similarity or inductive reasoningqueries

34 Inductive Reasoning QueriesA unique feature of word-embedding vectors is their capabilityto answer inductive reasoning queries that enable an individual toreason from part to whole or from particular to general [11 13]Solutions to inductive reasoning queries exploit latent semanticstructure in the trained model via algebraic operations on the corre-sponding vectors We encapsulate these operations in UDFs to sup-port following five types of inductive reasoning queries analogiessemantic clustering and analogy sequences clustered analogiesand odd-man-out [11] We discuss key inductive reasoning queriesbelow

bull Analogies Wikipedia defines analogy as a process of trans-ferring information or meaning from one subject to anotherA common way of expressing an analogy is to use relation-ship between a pair of entities source_1 and target_1 toreason about a possible target entity target_2 associatedwith another known source entity source_2 An exampleof an analogy query is Lawyer Client Doctor whoseanswer is Patient To solve an analogy problem of the form(X Y Q ) one needs to find a tokenW whose mean-ing vector Vw is closest to the ideal response vector VR where VR = (VQ + VY minus Vx ) [11] Recently several solu-tions have been proposed to solve this formulation of theanalogy query [7 8 10] We have implemented the 3COS-MUL approach [7] which uses both the absolute distanceand direction for identifying the vector VW as

argmaxW isinC

cos(VW VQ )cos(VW VY )

cos(VW VX ) + ϵ(1)

where ϵ = 0001 is used to avoid the denominator becoming0 Also 3COSMUL converts the cosine similarity value ofc to (c+1)

2 to ensure that the value being maximized is non-negativeFigure 7 illustrates a CI query that performs an analogy com-putation on the relational variables using the analogyUDF()This query aims to find a Vendor from the Expenditurestable (Figure 10) whose relationship to the category Q3or Third Quarter is similar to what Fairfax CountyPublic Schools has with the category Q1 or FirstQuarter (ie if Fairfax County Public Schools is themost prolific public school system in terms of expendituresduring the first quarter find such Vendors who are the mostprolific spenders in the third quarter excluding FairfaxCounty Public Schools) The analogyUDF() UDF fetches

SELECT VENDOR_NAMEAGY_AGENCY_NAME QUARTER SUM(AMOUNT) asmax_valueFROM Expenses INNER JOIN Agencies ON ExpensesAGY_AGENCY_KEY= AgenciesAGY_AGENCY_KEYWHERE analogyUDF(rsquo$aVendor1rsquo rsquo$aVendor2rsquo rsquo$aVendor3rsquoVENDOR_NAME) gt $proximity AND QUARTER == rsquo$aVendor3rsquoGROUP BY VENDOR_NAME AGY_AGENCY_NAME QUARTER ORDER BYmax_value DESC

Figure 7 Example of an analogy query

SELECT VENDOR_NAME semClustUDF(rsquo$vendorArsquorsquo$vendorBrsquorsquo$vendorCrsquoVENDOR_NAME)AS proximityValue FROM ExpensesHAVING proximityValue gt $proximityORDER BY proximityValue DESC

Figure 8 Example of a semantic clustering query

vectors for the input variables and using the 3COSMULapproach returns the analogy score between a vector corre-sponding to the input token and the computed response vec-tor Those rows whose variables (eg VENDOR_NAME) haveanalogy score greater than a specified bound (05) are se-lected To facilitate the analysis the results are filtered by thequarter of interest (Q3) and sorted in descending order by thetotal sum of expenditures and also listing the agency suchVendor(s) had the most transaction with Since analogy oper-ation is implemented using untyped vectors analogyUDF()UDF can be used to capture relationships between variablesof different types eg images and text

bull SemanticClustering Given a set of input entities X Y Z the semantic clustering process identifies a set of entitiesW that share the most dominant trait with the inputdata The semantic clustering operation has a wide set of ap-plications including customer segmentation recommenda-tion etc Figure 8 presents a CI query which uses a semanticclustering UDF semclustUDF() to identify verndors thathave the most common attributes with the input set of ven-dors eg vendorA vendorB and vendorC For solving a se-mantic clustering query of the form (X Y Z ) one needsto find a set of tokens Sw = W1W2 Wi whose mean-ing vectors Vwi are most similar to the centroid of vectorsVX VY and VZ (the centroid vectors captures the dominantfeatures of the input entities)

Another intriguing extension involves using contextual similar-ities to choose members of the schema dimension hierarchy foraggregation operations like ROLLUP or CUBE For example insteadof aggregating over all quarters for all years one can use only thosequarters that are semantically similar to a specified quarter

35 Cognitive Extensions to the RelationalData Model

There are powerful extensions to SQL that are enabled by wordvectors For this we need the ability to refer to constituent tokens

(extracted during textification) in columns of rows in whole rowsand in whole relations The extension is via a declaration in theFROM clause of the form Token e1 that states that variable e1 refersto a token To locate a token we use in the WHERE clause predicatesof the form contains(E e1) where E can be a column in a row(eg EMPAddress) a whole row (eg EMP) or a whole relation(eg EMP) With this extension we can easily express queries suchas asking for an employee whose Address contains a token whichis very close to a token in a row in the DEPT relation (Figure 9)Furthermore we can also extend SQL with relational variables sayof the form $R and column variables say X whose names are notspecified at query writing time they are bound at runtime Wecan then use these variables in queries in conjunction with Tokenvariables This enables database querying without explicit schemaknowledgewhich is useful for exploring a database Interestingly thenotation $RX is basically syntactic sugar A software translationtool can substitute for $RX an actual table name and an actualcolumn Then perform the query for each such substitution andreturn the union of the results [4]

SELECT EMPName EMPSalary DEPTNameFROM EMP DEPT Token e1 e2WHERE contains(EMPAddress e1) ANDcontains(DEPT e2) ANDcosineDistance(e1 e2) gt 075

Figure 9 Example of an SQL query with entities

Lastly one may wonder how numeric bounds on UDFs (egcosineDistance(e1 e2) gt 075 in Figure 9) are determinedThe short answer is that these bounds are application dependentmuch like hyperparameters in machine learning One learns theseby exploring the underlying database and running queries In the fu-ture we envision a tool that can guide users to select an appropriatenumeric bound for a particular CI query

4 DEMONSTRATION OVERVIEWFor demonstration purposes we will be using a Linux-based SparkScala implementation of the cognitive database system We planto demonstrate both the end-to-end features of the Spark basedCognitive database implementation as well as novel capabilitiesof CI queries using realistic enterprise-class database workloadsThe audience will be able to step through various stages of thecognitive database life cycle namely data pre-processing trainingmodel management and querying Participants will also be able tointeract with the Scala (via command-line) or Python-based (viaJupyter Notebook) interfaces of the cognitive database system andmodify the CI queries to experience their capabilities The rest ofthe section provides a glimpse of the demo by illustrating a realdatabase usecase

41 Experiencing the Cognitive DatabaseTo illustrate the demonstration flow we use a publically availableexpenditure dataset from the State of Virginia[12] It is a fairlylarge dataset (at least 53 Million records) that describes state-wide

expenditure for a year (eg 2016) listing details of every trans-action such as vendor name corresponding state agency whichgovernment fund was used etc by 190950 unique customers orVendors Note that Vendors can be individuals andor private andpublic institutions such as county agencies school districts banksbusinesses etc The data was initially organized as separate fileseach belonging to a quarter A single file was created and two otherfeatures engineered and added to the dataset mainly Quarter andCounty identifying which quarter the transaction happened andif the transaction is associated with one of the 133 counties andindependent cities in the state

AGY_AGENCY_KEYAMOUNTFNDDTL_FUND_DETAIL_KEYOBJ_OBJECT_KEYSPRG_SUB_PROGRAM_KEYVENDOR_NAMEVOUCHER_DATEQUARTERrdquo

23815246114002472711BURDEN COST09182015Q323828374914002472711BURDEN COST09182015Q32385547514002472711BURDEN COST09182015Q32389183114002472711BURDEN COST09182015Q3238113614002472711BURDEN COST09182015Q3

AGY_AGENCY_KEY_238 FNDDTL_FUND_DETAIL_KEY_1400 OBJ_OBJECT_KEY_247 SPRG_SUB_PROGRAM_KEY_2711 VENDOR_NAME_BURDEN_COST VOUCHER_DATE_09_24_2015 QUARTER_Q3 AMOUNT_0

VENDOR_NAME_BURDEN_COST 0139960 1511725 1413500 -0252524 0560768 -0748131 0094970 -0308933 0642339 0545144 -0433508 0745739 -0027321 0145872 -0414213 0949778 -1091795 -1086165 -0913118 0126721 0101270 1008395 -1058285 -0218987 0516127 -0524986 -0373136 -0479130 0903843 -0170414 0796281 0110054 -0034227 -0415399 -0214878 0329212 0934236 -0391224 0026569 0368643 -0316187 -0672016 -0021213 0692628 -0064971 -0287692 0025981 1589496 0487230 0581495 0546757 0060010 0602709 0573443 01311691801001 0571099 -0806877 -0040376 -0451411 -0347368 -0104940 -1601405 0383498 -1191644 -0194865 0212549 -1146270 0167206 -0127002 -0241348 0055484 -0085913 0187448 0491312 -1332215 0062639 -0099263 -0193136 0966303 0226783 1207567 0483473 0311355 -0283833 -0187986 0786420 -0882176 0580108 -0526009

(A)(A) (B)

Figure 10 Pre-processing the Virginia Expenditure Dataset

Figure 10(A) illustrates a portion of the CSV file that representsthe original database which contains both text and numeric valuesThe first step in the pre-processing phase is to convert the inputdatabase into a meaningful text corpus (Figure 10(B)) This textifi-cation process is implemented using Python scripts that convertsa relational row into a sentence Any original text entity is con-verted to an equivalent token eg a vendor name BURDEN_COSTis converted to a string token VENDOR_NAME_BURDEN_COST For nu-meric values eg amount of 152461 the preprocessing stage firstclusters the values using the K-Means algorithm and then replacesthe numeric value by a string that represents the associated cluster(eg AMOUNT_0 for value 152461) The resultant text document isthen used as input to build the word embedding model The wordembedding model generates a d dimensional feature (meaning) vec-tor for each unique string token in the vocabulary For exampleFigure 10(C) presents a vector of dimension 300 for the string tokenVENDOR_NAME_BURDEN_COST that corresponds to the value BURDENCOST in the original database Our demonstration will go over vari-ous pre-processing stages in detail to explain the text conversionand model building scripts

Once the model is built the user program can load the vectorsand use them in the SQL queries Figure 3 presents an exampleof a SQL CI similarity query The goal of the query is to iden-tify vendors that have overall similar transactional behavior toan input vendor over the entire dataset (ie transacted with thesame agencies with similar amounts etc) The SQL query uses anUDF proximityCust_NameUDF() which first converts the inputrelational variables into corresponding text tokens fetches the

corresponding vectors from the loaded model and computes asimilarity value by performing nearest neighbor computations onthose vectors Figure 11(A) presents the results of this query forthe vendor COUNTY OF ARLINGTON as already shown in Figure 4In addition Figure 11(B) explains why the Vendors listed are similarbased on their transaction history Since the similarity is basedon the commonality of data the table details how common eachVendor is to the County of Arlington For example the Countyof Giles had a total of 78 transactions with the state They dealtwith three state agencies of which two are the same agencies theCounty of Arlington dealt with Of the 78 transactions 74 werewith common agencies for both vendors Similar analysis is per-formed for the other fields describing a transaction (eg FundsObjectives Programs etc) For simplicity not all the fields charac-terizing a transaction are included in the table Missing are the dateof transaction which quarter it happened and if it is a county typetransaction or not

SELECT VENDOR_NAME proximityCust_NameUDF(VENDOR_NAMErsquo$aVendorrsquo) AS proximityValue FROM ExpensesWHERE proximityValue gt 05ORDER BY proximityValue DESC

Figure 11 Most similar vendors to COUNTY OF ARLING-TON

As described in Section 32 a similarity CI query can be eas-ily modified to implement a dissimilarity query by changing thecomparison term and ordering in ascending order Since the orig-inal dataset contains all types of transactions and Vendors thisquery would not be very useful For example a single transactionperformed by an individual would be very different from the 87transactions of the County of Arlington This is expected andwould not provide any useful knowledge However if the dissimilarresults are filtered down to focus on a particular group of trans-actions insight can be obtained that leads to other analysis Theoutcome of such query is shown in Figure 13

For the OLAP and Analogy examples we use another set of Ven-dors from the Expenditures dataset In this case we are lookingat the money spent by public school districts per quarter to un-derstand how much money the districts spend for a given quarterwhen it is compared to a given district Such district is the FairfaxCounty Public Schools since it is the largest district with 2x to 3xmore students than the other closest school districts as illustrated

val result2_df = sparksql(sSELECT VENDOR_NAMEproximityCust_NameUDF(VENDOR_NAMErsquo$aVendorrsquo)AS proximityValue FROM ExpensesHAVING (proximityValue lt $proximity AND proximityValue gt -1)ORDER BY proximityValue ASC)result2_dffilter(result2_df(VENDOR_NAME)contains(COUNTYOF))show(100false)

Figure 12 Example of a CI dissimilar query

Figure 13 Most dissimilar vendors to COUNTY OF ARLING-TON

in Figure 14 where the table is ordered in descending order bypopulation and it includes the K-12 student population

Figure 14 Largest Counties in the State of Virginia and cor-responding K-12 student population

Figure 6 presents a simple cognitive OLAP query that identifiesVendors and State Agencies that are similar to a given Vendor inthis case the Fairfax County Public Schools The query iden-tifies the most common Agency amongst all the Vendors that aresimilar to the target Vendor adds up the transaction value associatedwith each Vendor and presents the results in descending order asshown in Figure 15 The first observation is the same observed withsimilarity CI queries where the transactions of the different Vendorshave many fields and values in common The second observationis that the query automatically picks other school districts indicat-ing that these school districts work with the same agencies fundsobjectives and programs as Fairfax County Public Schools forthe most part The third observation is that the sorted order of the

CI query result is very similar to the order the counties are sortedfrom a population count standpoint as shown in Figure 14 It doesnot follow the sorted order from a school population standpoint

Figure 15 OLAP CI query results for Fairfax County PublicSchools

The immediate benefit of this query is that when combined withdata like the one described in Figure 14 it gives an idea how muchthe Department of Education is spending on each student percounty If there are other State Agencies responsible for handlingK-12 education expenses one can get a very good picture of thetotal amount per county and per student within the county Withexternal data such as successful graduation rate and cost of oper-ation (Buildings Materials Teachers Special Education etc) theState is better positioned to determine if the money allocated percounty is producing the desired results Note that the gathering ofinformation per county can be obtained directly with multiple filterand aggregation SQL queries particularly after the County andIndependent City identifier has been engineered into the Expen-diture dataset However the SQL CI query significantly simplifiesthe access of such data and without requiring that extra features areadded to the dataset One can easily modify this query and performthe same type of analysis to gather the amount spent per county fora given program or a set of programs The outcome of such queryis shown in Figure 16 and it shows that amongst the top schooldistricts the money spend in K-12 education comes from the sameprogram Standards of Quality for Public Education(SOQ) directlyaddressing the article in the State Constitution that requires theBoard of Education to prescribe standards of quality for the publicschools It is worth mentioning that the first 100 entries obtainedwith the OLAP CI query are all but one directly related to the publicschools and the money invested in education Furthermore theyshow that the money comes from the SOQ program or from federalassistance programs designed to help local education The entriesalso show that the budgets for public education are not necessar-ily paid directly to the school districts In some cases the countyor the county treasurer are involved in the administration of thefunds even though they are being used for education The semantic

relationships between the transactions contain such informationas obtained by the query

SELECT VENDOR_NAME SPRG_SUB_PROGRAM_NAME SUM(AMOUNT) asmax_valueFROM Expenses INNER JOIN Programs ONExpensesSPRG_SUB_PROGRAM_KEY =ProgramsSPRG_SUB_PROGRAM_KEY WHEREproximityCust_NameUDF(VENDOR_NAMErsquo$Vendor_Yrsquo) gt $proximityGROUP BY VENDOR_NAME SPRG_SUB_PROGRAM_NAME ORDER BYmax_value DESC

Figure 16 OLAP CI query with focus on Programs insteadof Agencies

To demonstrate an analogy query we use the CI query in Fig-ure 7 As described we are looking for Vendors that are similar tothe transactions of Fairfax County Public Schools in a givenquarter in terms of transactions with a State Agency and anotherquarter In a normal SQL query one would collect all the transac-tions the Vendor had with any State Agency in a given quarterRepeat the process for other quarters Afterwards one would com-pare both sets of data to determine the Vendors and State agenciesthat showed a similar behavior for a given quarter collect all thetransactions associated with the Vendor and State Agency and listthe results in descending order by by the aggregated amount valueConversely one can use a single analogy CI query and get a similarlist without the extensive comparisons Once again by combiningthe vectors of vendors and quarters the CI query captures vendorsand state agencies describing transactions in the area of countypublic education see Figure 17 This demonstrates that applyingEquation 1 still results in a vector that contains enough embeddedinformation to extract Vendors and State Agencies Without anyadditional filtering the resulting list shows other public school sys-tems dealing with similar State Agencies Like the previous examplethis CI query can be easily modified to analyze another field forexample a State Program

42 Using Python InterfacesIn this section we describe the Python implementation of CognitiveDatabases using two different approaches for creating CI queriesOne approach uses Pandas a library used for data analysis andmanipulation and sqlite3 a module that provides access to thelightweight database SQLite The other approach uses PySpark theSpark Python API in a case where big data processing is requiredIn both cases we will use Jupyter Notebook [1] a web-based appli-cation for executing and sharing code to implement the CI queriesfor interacting with the Cognitive Database During demonstration

Figure 17 Example of Analogy CI query comparing FairfaxCounty in Q1 with Vendors in Q3

the audience will be able to interact with the Jupyter notebook andmodify the CI queries

Pandas provides a rich API for data analysis and manipulationWe use Pandas in conjunction with sqlite3 to implement the CIqueries Similarly as in the Scala approach the Cognitive Databaseis initialized with the passed in model vectors and data files the userintends to use for analysis During the initialization process thedata is converted from a Pandas dataframe to a SQLite in-memorydatabase Then sqlite3rsquos create_function method is used to createthe cognitive UDFs From the userrsquos perspective using pandas SQLmethods and the internal Cognitive Database connection theycan perform CI Queries to expose powerful semantic relationships(Figure 18)

Figure 18 Example CI Queries in Jupyter Notebook usingPandas and sqlite3

The PySpark approach is useful when the user needs to performbig data processing In the PySpark approach we call Scala UDFsfrom PySpark by packaging and building the Scala UDF code intoa jar file and specifying the jar in the spark-defaultsconf config-uration file Using the SparkContextrsquos internal JVM we are ableto access and register Scala UDFs and thus the user can use themin CI queries within Python We chose this approach instead ofthe Python APIrsquos support for UDFs because of the overhead whenserializing the data between the Python interpreter and the JVMWhen creating UDFs directly in Scala it is easily accessible by theexecutor JVM and wonrsquot be a big performance hit The CI queries

in Python using PySpark look similarly as they do in the Scalaimplementation (Figure 19)

Figure 19 Example CI Queries in Jupyter Notebook usingPySpark

43 SummaryWe plan to demonstrate capabilities of the Spark-based CognitiveDatabase using both Scala and Python interfaces We will be demon-strating various types of CI queries over enterprise-class multi-modal databases such as the State of Virginia Expenditure dataThe demonstration will allow the participants to interact with thepre-processing tools as well as with different CI queries

REFERENCES[1] 2017 Jupyter Notebook Open-source web application for creating and sharing

documents (2017) httpjupyterorg[2] Apache Foundation 2017 Apache Spark A fast and general engine for large-scale

data processing httpsparkapacheorg (2017) Release 22[3] Rajesh Bordawekar Bortik Bandyopadhyay and Oded Shmueli 2017 Cog-

nitive Database A Step towards Endowing Relational Databases with Artifi-cial Intelligence Capabilities CoRR abs171207199 (December 2017) httparxivorgabs171207199

[4] Rajesh Bordawekar and Oded Shmueli 2016 Enabling Cognitive IntelligenceQueries in Relational Databases using Low-dimensional Word Embeddings CoRRabs160307185 (March 2016) httparxivorgabs160307185

[5] Rajesh Bordawekar and Oded Shmueli 2017 Using Word Embedding to EnableSemantic Queries in Relational Databases In Proceedings of the 1st Workshop onData Management for End-to-End Machine Learning (DEEMrsquo17) ACM New YorkNY USA Article 5 4 pages

[6] Ching-Tien Ho Rakesh Agrawal Nimrod Megiddo and Ramakrishnan Srikant1997 Range Queries in OLAP Data Cubes In Proceedings of the 1997 ACMSIGMOD International Conference on Management of Data 73ndash88

[7] Omer Levy and Yoav Goldberg 2014 Linguistic Regularities in Sparse andExplicit Word Representations In Proceedings of the Eighteenth Conference onComputational Natural Language Learning CoNLL 2014 171ndash180 httpaclweborganthologyWW14W14-1618pdf

[8] Tal Linzen 2016 Issues in evaluating semantic spaces using word analogiesarXiv preprint arXiv160607736 (2016)

[9] Tomas Mikolov 2013 word2vec Tool for computing continuous distributedrepresentations of words (2013) githubcomtmikolovword2vec

[10] Tomas Mikolov Ilya Sutskever Kai Chen Gregory S Corrado and Jef-frey Dean 2013 Distributed Representations of Words and Phrasesand their Compositionality In 27th Annual Conference on Neural Infor-mation Processing Systems 2013 3111ndash3119 httppapersnipsccpaper5021-distributed-representations-of-words-and-phrases-and-their-compositionality

[11] David E Rumelhart and Adele A Abrahamson 1973 A model for analogicalreasoning Cognitive Psychology 5 1 (1973) 1 ndash 28 DOIhttpsdoiorg1010160010-0285(73)90023-6

[12] State of Virginia 2016 State of Virginia 2016 Expenditureshttpwwwdatapointapavirginiagov (2016)

[13] Robert J Sternberg and Michael K Gardner 1979 Unities in Inductive ReasoningTechnical Report Technical rept no 18 1 Jul-30 Sep 79 Yale University httpwwwdticmildocscitationsADA079701

Abstract

1 Introduction

2 Background and System Architecture

3 Cognitive Intelligence Queries

31 Similarity Queries

32 Dissimilar Queries

33 Cognitive OLAP Queries

34 Inductive Reasoning Queries

35 Cognitive Extensions to the Relational Data Model

4 Demonstration Overview

41 Experiencing the Cognitive Database

42 Using Python Interfaces

43 Summary

References

(1) Optional Training Phase

VectorDomain

Domain

Relational

Domain

Relational

Tables

External Text sources

Learned Vectors

Relational

System Tables

External

Learned Vectors UDFs

CI Queries Relations

Tokenized Relations

(3) Query Execution Phase(2) Vector Storage Phase

Figure 1 End-to-end execution flow of a cognitive relationaldatabase

database is used to train the model Our training approach is char-acterized by two unique aspects (1) Usingmeaningful unstructuredtext representation of the structured relational data as input to thetraining process (ie irrespective of the associated SQL types allentries from a relational database are converted to text tokens rep-resenting them) and (2) Using the unsupervised word embeddingtechnique to generate meaning vectors from the input text corpusWe have modified the classical word embedding approach [9] toaddress various subtletiles with the relational data model (eg sup-porting primary keys) Every unique token from the input corpusis associated with a vector of dimension d In our scenario a texttoken in a training set can represent either text numeric or im-age data Thus the model builds a joint latent representation thatintegrates information across different modalities using untypeduniform feature (or meaning) vectors Alternatively this phase canalso use pre-trained vectors built from an external data repository(eg Wikipedia)

Following vector training the resultant vectors are stored in arelational system table (phase 2) At runtime the SQL query execu-tion engine uses various user-defined functions (UDFs) that fetchthe trained vectors from the system table as needed and answerCI queries (phase 3) These UDFs take typed relational values asinput and compute semantic relationships between them usinguniformly untyped meaning vectors This enables the relationaldatabase system to seamlessly analyze data of different types (egtext numeric values and images) using the same SQL CI queryOur current implementation is built on the Apache Spark 220infrastructure [2] using Scala It runs on a variety of processorsand operating systems The system implementation follows thecognitive database execution flow as presented in Figure 1 Thesystem first initializes in-memory Spark Dataframes from externaldata sources (eg relational database or CSV files) loads the associ-ated word embedding model into another Spark Dataframe (whichcan be created offline from either the database being queried orexternal knowledge bases such as Wikipedia) and then invokes theCognitive Intelligence queries using Spark SQL The SQL queriesinvoke Scala-based cognitive UDFs to enable computations on themeaning vectors A Python based implementation (Figure 2) is alsoprovided thus allowing the system to be used by both the databaseand data science communities

The distinguishing aspect of cognitive intelligence queries con-textual semantic comparison between relational variables is imple-mented using user-defined functions (UDFs) Thus the CI queriescan support both the traditional value-based as well as the new

semantic contextual computations in the same query Each CIquery uses the UDFs to measure semantic similarity between apair of sets (or sequences) of tokens associated with the inputrelational parameters The core computational operation of a cog-nitive UDF is to calculate similarity between a pair of tokens bycomputing the cosine distance between the corresponding vec-tors For two vectors v1 and v2 the cosine distance is computedas cos(v1v2) = cos(v1v2) =

v1 middotv2∥v1 ∥ ∥v2 ∥

The cosine distance valuevaries from 10 (very similar) to -10 (very dissimilar) Each CI queryuses the UDFs to execute nearest neighbor computations using thevectors from the current word-embedding model Thus CI queriesprovide approximate answers that reflect a given model

Figure 2 Spark Implementation of a cognitive relationaldatabase

Our current implementation supports four types of CI SQLqueries similarity based classification inductive reasoning pre-diction and cognitive OLAP [3] These queries can be executedover databases with multiple datatypes we currently support textnumeric and image data The similarity queries compare two re-lational variables based on similarity or dissimilarity between theinput variables Each relational variable can be either set or se-quence of tokens In case of sequences computation of the finalsimilarity value takes the ordering of tokens into account The sim-ilarity value is then used to classify and group related data Theinductive reasoning queries exploit latent semantic information inthe database to reason from part to whole or from particular to gen-eral We support five different types of inductive reasoning queriesanalogies semantic clustering analogy sequences clustered analo-gies and odd-man-out Given an item from an external data corpus(which is not present in a database) the predictive CI query canidentify items from the database that are similar or dissimilar tothe external item by using the externally trained model Finallycognitive OLAP allows SQL aggregation functions such as MAX()MIN() or AVG() over a set that is identified by contextual similaritycomputations on the relational variables

3 COGNITIVE INTELLIGENCE QUERIESThe basic UDF and its extensions are invoked by the SQL CI queriesto enable semantic operations on relational variables Each CI queryuses the UDFs to execute nearest neighbor computations using thevectors from the current word-embedding model Thus CI queriesprovide approximate answers that reflect a given model For the pur-poses of demonstrating the various CI queries a publicly available

argmaxW isinC

cos(VW VX ) + ϵ(1)

(A)(A) (B)

Abstract

1 Introduction