18
AVATAR: Using text analytics to bridge the structured–unstructured divide Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan Jayram S. Thathachar, Rajasekar Krishnamurthy, Prasad Deshpande Rahul Gupta, Krishna P. Chitrapura IBM Almaden Research Center 650 Harry Road San Jose, CA 95120, USA IBM India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing need in enterprise applications to query and analyze seamlessly across structured and unstructured data. We propose an informa- tion system in which text analytics bridges the structured–unstructured divide. Annotations ex- tracted by text analytic engines, with associated uncertainty, is automatically ingested into a struc- tured data store. We propose an interface that is capable of supporting rich queries over this hy- brid data. Uncertainty associated with the ex- tracted information is addressed by building sta- tistical models. We show that different classes of statistical models can be built to address issues such as ranking and OLAP style reporting. We are currently building a prototype system called AVATAR that utilizes an existing commercial re- lational DBMS system as the underlying storage engine. We present the architecture of AVATAR and identify several research challenges arising out of our prototyping effort. 1 Introduction While traditional enterprise applications such as HR, pay- roll, etc., operate primarily off structured (relationally mapped) data, there is a growing class of enterprise ap- plications in the areas of customer relationship manage- ment, marketing, collaboration, and e-mail that can benefit enormously from information present in unstructured (text) data. Consequently, the need for enterprise-class infras- tructure to support integrated queries over structured and unstructured data has never been greater. To motivate the need for such an infrastructure, let us consider the follow- ing example. 1.1 Auto manufacturer CRM application Consider the scenario of an auto manufacturer employing an enterprise customer relationship management (CRM) application to track and manage service requests across Figure 1: CRM Application: A table showing customer service reports its worldwide dealer operations. Using the CRM applica- tion, the individual dealers file “customer service reports”. Each report includes structured attributes such as “day”, “customer ID”, “make”, “model”, “dealer name”, “vehicle identification number (VIN)”, etc. In addition, there is a “comments” field where the service associate in charge of handling a service request can record additional informa- tion about the precise nature of the problem and how the issue was addressed. Figure 1 shows a simplified version of a “service reports” table and also highlights the text as- sociated with one of the reports. Using standard relational query interfaces, service re- ports can be queried based on the available structured at- tributes. Similarly, once an appropriate text-index has been constructed, reports can also be retrieved efficiently by run- ning keyword queries over the “comments” field, . How- ever, neither of these approaches can support the following two queries: Query 1. Return all the organization names starting with “Fire” that occur in service reports filed within the last 6 1

AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Embed Size (px)

Citation preview

Page 1: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

AVATAR: Using text analytics to bridge thestructured–unstructured divide

Huaiyu Zhu, Sriram Raghavan, Shivakumar VaithyanathanJayram S. Thathachar, Rajasekar Krishnamurthy, Prasad

DeshpandeRahul Gupta, Krishna P. Chitrapura

IBM Almaden Research Center650 Harry Road

San Jose, CA 95120, USA

IBM India Research LabBlock I, Indian Institute of Technology

Hauz Khas, New Delhi - 110016, INDIA

Abstract

There is a growing need in enterprise applicationsto query and analyze seamlessly across structuredand unstructured data. We propose an informa-tion system in which text analytics bridges thestructured–unstructured divide. Annotations ex-tracted by text analytic engines, with associateduncertainty, is automatically ingested into a struc-tured data store. We propose an interface that iscapable of supporting rich queries over this hy-brid data. Uncertainty associated with the ex-tracted information is addressed by building sta-tistical models. We show that different classes ofstatistical models can be built to address issuessuch as ranking and OLAP style reporting. Weare currently building a prototype system calledAVATAR that utilizes an existing commercial re-lational DBMS system as the underlying storageengine. We present the architecture of AVATARand identify several research challenges arisingout of our prototyping effort.

1 IntroductionWhile traditional enterprise applications such as HR, pay-roll, etc., operate primarily off structured (relationallymapped) data, there is a growing class of enterprise ap-plications in the areas of customer relationship manage-ment, marketing, collaboration, and e-mail that can benefitenormously from information present in unstructured (text)data. Consequently, the need for enterprise-class infras-tructure to support integrated queries over structured andunstructured data has never been greater. To motivate theneed for such an infrastructure, let us consider the follow-ing example.

1.1 Auto manufacturer CRM application

Consider the scenario of an auto manufacturer employingan enterprise customer relationship management (CRM)application to track and manage service requests across

Figure 1: CRM Application: A table showing customerservice reports

its worldwide dealer operations. Using the CRM applica-tion, the individual dealers file “customer service reports”.Each report includes structured attributes such as “day”,“customer ID”, “make”, “model”, “dealer name”, “vehicleidentification number (VIN)”, etc. In addition, there is a“comments” field where the service associate in charge ofhandling a service request can record additional informa-tion about the precise nature of the problem and how theissue was addressed. Figure 1 shows a simplified versionof a “service reports” table and also highlights the text as-sociated with one of the reports.

Using standard relational query interfaces, service re-ports can be queried based on the available structured at-tributes. Similarly, once an appropriate text-index has beenconstructed, reports can also be retrieved efficiently by run-ning keyword queries over the “comments” field, . How-ever, neither of these approaches can support the followingtwo queries:

Query 1. Return all the organization names starting with“Fire” that occur in service reports filed within the last 6

1

Page 2: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Figure 2: Bridging the structured–unstructured divide

months related to brake problems concerning Buick vehi-cles.

Query 2. What is the likelihood of brake problems in NewYork for Chevy vehicles whose service records contain thename “Kevin Jackson”?

Even though all of the information required to answerthese queries is likely available in the table shown in Fig-ure 1, neither of these queries can be meaningfully an-swered without combining information stored in the struc-tured attributes with semantic information embedded intext.

1.2 Proposed solution

In this paper, we propose the use of text analytics as amechanism to bridge this structured–unstructured divide.Text analytics is concerned with the identification and ex-traction of structured information from text. Recent ad-vances in text analytics (see Section 1.3) have producedsophisticated techniques for analyzing free-form text to ex-tract precisely the type of information required to answerthe queries described earlier. However, for these techniquesto be deployed in large enterprise-scale applications, thereis a need for the data management community to play asignificant role.

To this end, we envision an information system in whichtext analytics is used to extract structured information fromunstructured text, the extracted information is automati-cally ingested into a structured data store, and an integratedquery interface supports queries over existing structureddata and the “new” extracted data. Figure 2 describes ourvision through a simple schematic diagram. We assumethat there is an existing information system that containsa set of text documents Dt, along with associated struc-tured information Ds. At users’ discretion, one or moretext analysis engines (TAEs) are used to extract structuredinformation from the contents of Dt.

Each of these TAEs produces structured objects, calledannotations. The set of all these annotations constitutesDe. There may be additional structured information D r

that is somehow related to the extracted information (e.g.,if De includes names of persons extracted from customercomplaint documents, Dr can be the company’s customer

Figure 3: An example showing four annotations on a pieceof text (see Section 5.1 for an explanation of the probabilitycolumn)

database). Our goal is to build a new information systemthat can seamlessly support queries over Ds, De, and Dr.

Revisiting the CRM Example

Let us now revisit the CRM example presented earlier tounderstand how text analytics can be applied in this con-text. Using the terminology introduced above, D t repre-sents the set of “comments” filed by by all dealer locations.Ds would correspond to attributes such as “day”, “model”,etc., that are associated with each report.

To extract information from the “comments”, supposethe following four text analysis engines (TAEs) are exe-cuted: (i) a named-entity PERSON TAE trained to iden-tify person names, (ii) a named-entity ORGANIZATIONTAE trained to identify organizations (iii) a MILEAGETAE that identifies mileage and (iv) a topic TAE trainedto classify customer service reports based on the type ofproblem (“brake”, “transmission”, “engine”, “suspension”,etc.). Figure 3 shows the same service request as in Fig-ure 1 with the corresponding set of annotations generatedby the four TAEs (the probability column will be explainedin Section 5.1). De corresponds to the set of all such anno-tations produced by the four TAEs for the entire commentscolumn. Information related to these annotations presentin the customer database, dealer database, service manualdatabase, etc., constitutes Dr. Our goal is to build a sys-tem in which the information in Ds, De, and Dr can beseamlessly queried and analyzed using the types of queries(Query 1 and Query 2) described earlier.

The rest of this section is organized as follows. In Sec-tion 1.3, we provide some background on the state-of-the-art in text analytics, to substantiate our claim that the timeis ripe to exploit those advances in large-scale applications.In Section 1.4, we distinguish our approach to structured–unstructured integration from earlier efforts in that direc-tion.

1.3 State-of-the-art in text analytics

Text analytics is a mature area of research, concerned withthe problem of automatically analyzing text to extract in-formation. Involving researchers ranging from natural lan-guage processing (NLP) to machine learning, this area has

2

Page 3: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

seen tremendous growth in recent years. In particular, al-gorithms have been developed to perform tasks such as en-tity identification (e.g., identifying persons, locations, or-ganizations, etc.) [10], relationship detection (e.g., personX works in company Y)[34, 44] and co-reference resolu-tion (identifying different variants of the same entity ei-ther in the same document or different documents) [32, 36].The popular Message Understanding Conference (MUC) isaimed at evaluating the performance of algorithms for en-tity detection and co-reference resolution. Best performersfor entity detection were typically in the 90% precision andrecall range [31].

Besides specific information extraction tasks, such asthose described above, there is also significant value inclassifying entire documents or large portions of a doc-ument into topic(s). NIST holds a yearly Text REtrievalConference (TREC) that evaluates the performance of textclassification algorithms. Besides TREC, papers describ-ing text classification algorithms appear regularly in severalmajor conferences in information retrieval, machine learn-ing, etc. [28, 45]. Furthermore, in recognition of its sig-nificant practical potential, there are regular workshops onOperational Text Classification Systems[38]. A practicaldrawback of such classifiers is that they often require largeamounts of labeled training data. In large, dynamic envi-ronments, obtaining large numbers of labeled data mightbe difficult. This bottleneck has motivated recent researchin text classification to concentrate on learning from smallnumber of labeled examples[37].

A more recent trend in NLP is detecting affect in docu-ments and recently AAAI held a workshop to discuss cur-rent state-of-the-art [1]. In particular, automatic detectionof opinions or sentiment (whether positive or negative) is ofimportance in the context of business intelligence[39]. Rel-evant customer reactions to a new product and new prod-uct features is available in call-center records. Extractingthis information can provide valuable feedback for decid-ing new product features, product discontinuance etc. Con-sequently, results of 88% in detecting sentiment polarityfrom product reviews[12] is very encouraging and relevantin the context of this paper.

Despite impressive results, bar a few niche applications,advanced text analytics has not yet made a mark on main-stream enterprise software applications. We believe that itis the integration of text analytics with data managementthat will address this situation.

1.4 The structure–precision plane

To position our approach in the context of other efforts instructured–unstructured integration, it is instructive to con-sider the structure–precision plane shown in Figure 4. Asindicated in the figure, we view different classes of in-formation systems as points in this plane. The horizontalaxis of this plane represents a continuum from completelystructured data (e.g., tables in a relational database) to com-pletely unstructured data (e.g., text documents without anystructured fields). The vertical axis is a continuum repre-

Figure 4: Structure–Precision plane

senting different levels of query precision. The bottom–leftand top–right corners of this plane are occupied by tradi-tional structured database and information retrieval systemsrespectively. A variety of other systems, both full-fledgedproducts and research prototypes, can be slotted at variouspoints in this plane. For instance, systems that incorpo-rate unstructured text into the relational processing frame-work, using standards such as SQL/MM, can be placed inthe (unstructured, precise) portion of the plane. Similarly,research prototypes such as EKSO, BANKS, DataSpot,DBXplorer, etc., [2, 27] that attempt to provide keyword or“fuzzy” queries over structured databases, are in the struc-tured, imprecise part of the plane.

As illustrated by the arrows in the figure, any attemptto move down along the precision axis by converting im-precise queries to precise queries results in query uncer-tainty. Similarly, attempts to move from right to left alongthe structure axis by converting unstructured data to struc-tured data results in data uncertainty. The presence of ei-ther data or query uncertainty (or both) leads to result un-certainty. In other words, precise queries over uncertaindata as well as imprecise queries over certain data produceuncertain results.

While the need for unifying the separate disciplines ofinformation retrieval (IR) and databases (DB) was recog-nized several years ago and engendered several researchefforts in that direction (see Section 9), the vision of a fullyintegrated “IR–DB” systems is yet to be realized. Theseearlier efforts can be viewed as attempts to develop con-ceptual models that simultaneously handle both kinds ofuncertainty.

In our approach, by restricting ourselves to movementalong the structure axis, we decompose these two uncer-tainties and focus only on data uncertainty. We further ar-gue that such an approach is ideally suited to meet the de-mands of enterprises, where, text is usually associated witha significant amount of well-organized structured informa-tion (as in our CRM example). In such environments, theability to seamlessly query information extracted from text

3

Page 4: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

topic prob

Topic

ProbString

begin

IntegerInteger

end

PersonName

Prob

probname

StringServiceRequest

docdoc

De

ServiceRequest

Figure 5: Complex annotation types

in conjunction with an existing body of structured informa-tion is significantly more useful than supporting an IR-styleimprecise query model.

At IBM Research, we are building AVATAR, a proto-type system based on this approach. AVATAR needs tointerface with a system for processing, instantiating, com-posing, executing, and capturing the output of the individ-ual TAEs. The NLP community has developed several soft-ware architectures, such as GATE [11, 7], ATLAS [6, 29],and UIMA [17], for this purpose. In AVATAR, we haveimplemented an interface to IBM’s UIMA architecture andallows us to leverage the large base of UIMA-complianttext analytic engines. The precise details of the interfaceare not relevant to this paper.

In the rest of this paper, we provide an overview of thearchitecture and design of AVATAR and enumerate severalfundamental research questions arising out of our prototyp-ing effort.

2 Challenges

Since the annotations produced by TAEs are structured, itmight appear straightforward to simply store and query theannotations (De) using a structured data store. However,there are several characteristics of De that preclude such astraightforward approach. We enumerate those character-istics below:

Complex types. Annotations produced by advanced TAEsare not merely simple atomic values, such as stringsor integers, but often have a complex internal struc-ture. However, this structure is the same for all anno-tations produced by a given TAE. Figure 5 shows thestructure of two annotation types: the PersonNametype produced by a named-entity PERSON TAE and aTopic type produced by a topic TAE. The underlinedstrings are attribute names and the type of a particularattribute is indicated below the name of the attribute.

Inheritance. Software frameworks for natural languageprocessing (such as the UIMA framework describedearlier) typically require TAEs to fit their annotationtypes within a type hierarchy. Such an hierarchy al-lows one TAE to specialize and add to the informa-tion extracted by another. Figure 6 shows an exam-ple of a set of annotation types arranged into a hi-erarchy. There is a root type Annotation that con-tains attributes common to all annotations. There is anannotation type Person that represents all annotated

person names. A subtype Salutation-Person repre-sents annotation objects produced by TAEs that rec-ognize person salutations (Mr., Mrs., Hon., etc.) andthis subtype is further specialized to represent annota-tions of government officials (e.g., “Rep. Senator PaulSmith”) and military officials (e.g., “Lt. Colonel JohnBrown”).

Variability. A TAE typically produces one or more anno-tations per text document. However, given the un-structured nature of text, the number of annotationsper document can vary dramatically and the varia-tion tends to be quite specific to a given TAE. For in-stance, a named-entity PERSON TAE is likely to pro-duce more annotation objects from a document thatreports on a company’s award function as opposed to adocument that represents a section of a technical prod-uct manual.

Dynamism It is unreasonable to expect that all “useful”semantic interpretations of a collection of text docu-ments will be available a priori. As users and appli-cations use the system, new TAEs will be constantlyexecuted to extract more information from text. Fur-thermore, many TAEs act upon annotations producedby other TAEs to infer more complex relationships.For instance, in the CRM example, given the originaldocuments and the annotations produced by the fourTAEs listed earlier, a new TAE can infer the works-For relationship that associates a specific person withthe company that the person works for (e.g., associateKevin Jackson with Firestone) in Figure 3). As aresult of all these factors, we expect that there can beanywhere from a few tens to several hundreds and pos-sibly thousands of TAEs operating upon a given doc-ument collection.

Data uncertainty. As alluded to earlier, the characteristicsof natural language are such that there is uncertaintyassociated with the information extracted by TAEs.There are two sources for this uncertainty.

Algorithmic uncertainty comes about because the par-ticular algorithm underlying a TAE is limited in itsunderstanding of text. For instance, given a documentcontaining two person names, while an average hu-man being may be able to identify both names withcomplete certainty, a named-entity person TAE, giventhe same task, may (i) identify both names correctlybut only with 90% certainty, (ii) identify one namewith 90% certainty and the other with 95% certainty,(iii) identify a piece of text incorrectly as a name with20% certainty, and so on.

Inherent uncertainty is purely a consequence of theimprecise nature of natural languages. For example,given a document and asked to rate how relevant thedocument is to a specific topic, different individualsmay respond with different ratings. Therefore, in ad-dition to algorithmic uncertainty, a topic annotator that

4

Page 5: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Figure 6: An example annotation type hierarchy

is asked to perform the same task will have to dealwith the inherent uncertainty in the text of the docu-ment.

Given these characteristics, we are faced with the fol-lowing challenges in designing a system to store and queryannotations:

Storage challenge. Given that annotation objects havecomplex types, exhibit inheritance, and have highlyvariable statistics, the task of automatically design-ing efficient storage schemes to store such objects isa hard problem. For instance, naive schemes for map-ping annotation objects into a relational data store willresult in extremely sparse and inefficient tables (i.e.,tables with a significant percentage of NULLs). Inaddition, because of dynamism, there will be a con-tinuous infusion of new types and objects into the sys-tem. Designing techniques to seamlessly accommo-date these new types, with minimal or no user involve-ment, is a significant challenge.

Query challenge. Due to the presence of data uncertainty,queries involving annotations can produce uncertainresult sets, even if the queries themselves are precise.Based on their underlying algorithm, many TAEs au-tomatically provide some numerical measure of thisuncertainty for each annotation that they produce. Thequery challenge is to mathematically represent thisdata uncertainty and develop a statistical model thatprescribes how to compute result uncertainty for agiven query. The precise nature of this statisticalmodel is application dependent. In Sections 5 and 6,we describe some possible models for point retrievaland aggregate (olap-style) queries respectively.

Many of the unique characteristics of annotations, suchas large number of complex types, highly variable statistics,and dynamism, fit well within the semistructured frame-work. Therefore, the XML data model may seem a naturalfit for storing annotations. However, there appear to be prosand cons for such an approach.

On the one hand, given the data characteristics, it ap-pears likely that using an XML database as the back-endwill alleviate some of the storage challenges. On the otherhand, the fact that the schema for the annotations is well-defined, implies that we are dealing with precisely struc-tured data, albeit hierarchical. Therefore, the underlying

Figure 7: AVATAR application stack

database system needs to be sophisticated enough to useschema information during query optimization. Currently,we are not aware of any commercial or prototype XMLdatabase system that achieve this goal in a scalable fashion,for complex XML schema. Moreover, a large portion ofstructured data currently resides in relational systems. Byusing a relational data store as the underlying storage sys-tem for AVATAR, we avoid the problem of either having tomigrate the structured data to an XML database or relyingon support for querying across XML and relational data.However, since commercial database vendors are addingextensive support for XML, we intend to explore using anXML data store for AVATAR.

3 Architecture of AVATARIn this section, we identify the infrastructural componentsfor building enterprise applications in the hybrid world ofstructured and extracted data. Figure 7 shows the corecomponents of our architecture, along with the sections inwhich those components are discussed in greater detail.

Storage The raw storage lies at the lowest level of the ap-plication stack. The storage component is responsiblefor automatically designing and populating schemes,as new complex types and objects are received fromthe TAEs. Depending on the underlying data store,the storage component will generate either new rela-tional schemes, new XML schemes, or a mix of both.In our current prototype, we are using a commercialrelational database as our back-end data store. How-ever, in the interests of space, we do not present fur-ther details of the storage component in this paper.

Object Model The object model is an abstraction layer re-siding above the storage layer. Dynamism results in aback-end schema that is ever evolving. The purposeof the object model is to hide the messy details of theunderlying storage, such as table and column names,while providing user-centric abstractions such as doc-ument types, annotation types, and objects. The de-scription of the object model is agnostic to the actual

5

Page 6: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

back-end storage.

Statistical Model To address the query challenge posedby uncertain annotations we propose the building ofstatistical models. The statistical model can then beused to answer queries. In this work we assume thatthe uncertainties are available in the form of probabil-ities1. Furthermore, AVATAR will treat the TAEs asblack-boxes. Consequently, all annotation probabili-ties will be treated as given and the AVATAR statis-tical models will not attempt to capture the mechanicsof the TAEs.

Certainly, the nature of the statistical models will de-pend on the queries and therefore the applications. We dis-cuss below, two classes of enterprise applications roughlycorresponding to OLTP and OLAP. Figure 7 provides a pic-torial distinction between these applications.

Retrieval Broadly, retrieval applications involve the returnof individual objects such as documents, annotationsor more complex types. These applications can be fur-ther divided into two categories as described below.

Simple Retrieval Simple retrieval accesses the anno-tations directly (re: Figure 7). Described differ-ently this class of applications impose no inter-pretations on the uncertainty associated with theannotations. Instead they are treated identical toother attributes and therefore predicates can bedefined on these probabilities. A simple retrievalquery based on Query 1 is as given in Query 3.

Query 3. Return all the organization namesstarting with Fire (probability > 0.7) that occurin service reports filed within the last 6 monthsrelated to brake problems (probability > 0.6)concerning Buick vehicles.

An alternative paradigm would be to simplythreshold the probabilities and retain only thoseannotations with a probability greater than athreshold. These annotations can then be treatedlike any other predicates. In this paradigm thequery shown in example 1 will be a return setthat exceeded the thresholds.

Ranked Retrieval In ranked retrieval, the probabil-ities associated with the individual annotationsare used to build appropriate statistical models.The ordering of the objects in the query result(documents or other return types) will be gov-erned by the mechanics of the statistical model.For instance, in the query shown in example 2,the result will be an ordered list of all personnames that match the structured predicates (i.e.,car is a Buick and service report filed in the last6 months). The ordering will be governed by theprobabilities associated with the annotations.

1Uncertainty expressed in ways other than probabilities can be con-verted using appropriate probability models.

Business Intelligence The belief that aggregate informa-tion (at different levels of granularity) is importantfor business decisions motivated OLAP models. Spe-cialized infrastructural support in terms of schemaand join algorithms have enabled the development oflarge scale reporting applications. The counterpartin the structured-unstructured world is more involved.There is need for statistical models that appropriatelycapture the uncertainties and the relationships withdimensional hierarchies. Realizing this in a query-intensive environment while retaining the scalabilityof conventional OLAP is a significant challenge.

4 The AVATAR object model

As seen in Figure 7, the object model is a conceptual in-terface between the storage layer and the statistical modellayer, providing mechanisms for representing the struc-tured information extracted by text analytics. In Sec-tion 4.1, we list a set of properties that are required forrepresenting extracted information and describe an objectmodel that satisfies these requirements. In Section 4.2, weprovide a bird’s eye view of these properties using an exam-ple. A formal description of the object model is presentedin Appendix A.

We would like to clarify that our object model is likelyequivalent to a few other known models, such as thenested relational model (with a type system) or the object-relational model, and probably subsumed by other modelssuch as the XML schema abstract data model. The primaryreason for describing and using yet another data model isthe fact that our object model is tailored for our applicationdomain. Further, the presence of uncertainty in the anno-tated data implies that we require some theoretical analysisfor dealing with this uncertainty. By using a simple datamodel that only has features required for our applicationdomain, we hope that the theoretical analysis becomes eas-ier.

4.1 Properties of the object model

As a conceptual abstraction of the underlying data, our de-sign goals for the object model can be summarized as fol-lows:

Path expressions Able to express hierarchical structuresthrough path expressions without explicit manage-ment of foreign key references. As discussed above,such ability to shield users from unnecessary storagedetails is essential in a dynamic system where newtypes of annotation arrive at arbitrary times.

Type constructors Allow the construction of new typesthrough queries, and automatically manage their stor-age. The results of user queries may be used to con-struct future queries, in at least two cases. In thefirst case, a user doing exploratory analysis mightrun queries on the system interactively, building newqueries on previous ones that have promising results.

6

Page 7: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

In the second case, an application might be pro-grammed to construct queries in multiple steps.

Document reference Automatically keep a referencefrom annotations to their original documents. Thisallows integrated query across the structured and ex-tracted information.

Subtype query Able to use subtype relations in queries.Annotations are often organized in type hierarchies.For example, an annotator that finds place names mayoptionally recognize the granularity, such as countryor city. A query for place names should therefore beable to retrieve annotations that are explicitly places,as well as those that are countries and cities.

The object model can be formalized as a system of typesand objects. In the example in Section 1.1, each row in theCRM table is regarded as a document. The collection ofdocuments forms a type D, each individual document d inthe collection being an object of that type. Running a TAEon a document d produces zero or more annotations a. Thecollection of such annotations defines a new type A, eachannotation a being an object of that type. 2 Apart from at-tributes having usual semantics, each annotation a also hasan attribute a.d referring to the original document d fromwhich it is obtained. This makes it possible to query bothstructured, unstructured and extracted information simulta-neously.

One distinguishing feature of the object model is thateach attribute of an object is automatically an object, andeach attribute of a type is automatically a type. This allowschained attribute references in queries. For example, theobject model permits expressions such as

car.owner.name like ′John%′ (1)

Another important characteristic of our object modelis that the subtype relation is part of the intrinsic seman-tics. This means that, for example, if Car is a subtype ofVehicle, then a query for Vehicle would retrieve all ob-jects of Car as well as other subtypes of Vehicle.

4.2 Example queries in the object model

In this subsection we illustrate various concepts in the ob-ject model with an example. The example may look some-what contrived, due to our desire to fit all relevant conceptsinto the same example. However, in real world applica-tions, any combination of the issues discussed here can ap-pear.

Let us revisit the CRM example in Figure 1. The tableshown in the figure can be viewed as the definition of atype, ServiceRequest. Each type in our theory is definedby its schema and its set of objects. The schema of a type isa mapping from its set of attribute names to attribute types,as shown in Figure 8, where the type of an attribute is writ-ten under the name of the attribute. For example, city is

2A TAE can be written to produce objects of more than one type. How-ever this detail will not affect what is being discussed here.

an attribute of type City. The set of objects for this type isdefined by the rows in the table shown in 1.

A user runs the topic TAE on the comments attributeof ServiceRequest, producing an annotation type Topic(Figure 8). It contains attributes topic (name of the topic),prob (probability of the text being on topic) and doc,which is a reference to the original document. Such ref-erences are automatically maintained by the system.

As shown in the same figure, the user can alsorun a named-entity PERSON TAE on the comments at-tribute of ServiceRequest to produce an annotation typePersonName. The begin and end attributes indicates thelocation of the named entity in the text given in terms ofcharacter offsets.

A simple retrieval query for the combined data Ds +De

could be:

for allServiceRequest r,Topic t,PersonName p,

wherer = t.doc = p.doc andr.make like ’Chev%’ andt.topic = ’brake’ andt.prob > 0.5,

return PersonTopicdoc r, topic t, person p

This query is quite similar to a SQL query. However, sinceattributes of types are again types, it is allowed that thequery returns a list of complete objects. This can be viewedas defining a new type PersonTopic, as shown in Figure 8.The schema of this type is defined by the return statementof the query. The object set of the query consists of all thetuples (r, t, p) that satisfy query. Once defined, such a typecan be used by the system just like any other type.

In applications, it is often useful to link the extractedinformation to some related information contained in ex-isting database. In our current example, the person nameextracted from the comments field is often the name ofa CRM representative. These names can be correlated tonames in the Employee database. A same-person TAEcan be run on the PersonName type and Employee type,producing a new type SamePerson, which contains twoattributes pointing to the extracted PersonName and Em-ployee types (Figure 8). The system handles subtype re-lations transparently: the same-person TAE is written fora pair of Person types, yet it can be run on the pair ofPersonName and Employee types, which are subtypes oftype Person.

These newly defined types allow subsequent queries tobe handled much simpler both in terms of user input and interms of performance, since they can be based on the typesconstructed in previous queries. For example, the query

for allPersonTopic pt,SamePerson sp,

7

Page 8: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

doc

City Region

model city region comments

Model Text

doctopic

topic prob

Topic

ProbString

PersonTopic

ServiceRequest

doc begin

IntegerInteger

end

PersonName

SamePerson

serial name dept

probperson2person1

StringProb

probname

String

Prob

Employee

String String

person

Category

category

Dr

De

Ds + Dt

Figure 8: Example schema of documents and annotations

wherept.person = sp.person1 andsp.prob > 0.9

returnpt.doc.model, sp.person2.dept

would return all those (model, dept) pairs where a per-son likely (more than 90% chance) of that department hashandled a service request about probably (more than 50%chance) a brake problem on the specific Chevy model. Thisexample demonstrates the convenience of using path ex-pressions in queries without explicit foreign key reference.

It is worth pointing out that neither the query syntax northe back–end storage of the object model are tied to the re-lational view. In Appendix A.1, the query is expressed ina more abstract form that can be translated into any spe-cific query language. In Appendix A.3, a simple mappingto a relational back–end is outlined, as is currently beingimplemented.

5 Ranking

As shown in Figure 7, in ranked retrieval, a statisticalmodel prescribes how to compute result uncertainty fromdata uncertainty. Once an uncertainty measure has been as-sociated with each object in the result set, the result set canbe sorted in decreasing order of this measure to producea ranked/ordered list of result objects. In this section, wedescribe a framework for building such statistical modelsfor ranked retrieval. We will begin by developing a proba-bilistic interpretation for the uncertainty associated with anannotation.

5.1 Interpreting annotation uncertainties

As mentioned earlier in Section 4, for the purposes of de-veloping a statistical model, we treat TAEs as black boxesand make no attempt to precisely model how a TAE com-putes its uncertainty measure. In particular, we treat the nu-merical uncertainty measures provided by TAEs as proba-bilities, irrespective of whether these measures were indeedgenerated by a probability model.

We will assume that each annotation contains a singleprobability. Without loss of generality, assume that thisprobability is always stored in the special attribute prob.When a TAE produced an annotation object, it producesboth a type assignment (i.e., a statement that an object be-longs to a particular type) and a set of attribute values.Therefore, the uncertainty in an annotation can either be inthe type assignment (i.e., whether the annotation truly be-longs to the specified type) or in one of the attribute values.To illustrate, consider the Topic annotation type shown inFigure 8. We know that when a TAE produces objects ofthis type, the uncertainty is only in the value of the attributeTopic.topic. On the other hand, when a named-entityPERSON TAE produces an object a of type PersonName,the uncertainty is in whether that object is indeed a personname, i.e., in whether a belongs to type PersonName.

For the case when uncertainty is in the attribute, we positthe following interpretation:

Definition 1 (Accuracy statement). Associated with everyannotation type A, we have an attribute xA that is uncer-tain. Let r(a, xA) denotes a random variable that repre-sents the “true” value of a.xA for every annotation a of

8

Page 9: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

type A. Then, the statement (r(a, xA) = a.xA) is calledthe accuracy statement associated with annotation a. Wewill use ma to denote the accuracy statement associatedwith a.

Definition 2 (Annotation probabilities). If a is an an-notation of type A whose uncertainty value is stored ina.prob, we have

P (ma|d) = a.prob (2)

P (NOT ma|d) = 1 − a.prob (3)

P (r(a, xA) = v|d) = 0 ∀v �= a.xA (4)

In other words, we interpret a.prob as the probabilitythat the value of a.xA is accurate for annotation a. Further-more, we assume that the probability that the true value ofa.xA is anything else is zero.3

As an example, if A is the Topic annotation shownin Figure 8, we have xA = topic. Thus, given a docu-ment d and a topic annotation t with t.prob = 0.6 andt.topic = “Brake”, we have

P (r(t, topic) = Brake|d) = 0.6 (5)

P (r(t, topic) = v|d) = 0, ∀v �= Brake (6)

Uncertainty in the type assignment can always be mod-eled in terms of uncertainty in an attribute by introduc-ing a special attribute named type. For instance, whenA = PersonName, we can set xA = type and implic-itly set r(a, type) = PersonName ∀a ∈ PersonName.Thus, when a named-entity PERSON TAE produces aPersonName object a with a.prob = 0.7, the interpretationis that with 70% probability, a represents a person name(and hence is of type PersonName).

For convenience, in the rest of this section, we will usep(a|d) to denote the probability associated with annota-tion a. Thus,

p(a|d) = P (ma|d) = a.prob

5.2 Probability model

Using the above interpretation of annotation uncertainties,we will now develop a simple probability model for doc-uments and annotations and present an associated rankingscheme. To make our task tractable, we make the followingassumptions:

Assumption 1. We will only focus on document ranking inthis section. In other words, we will assume that ev-ery query returns a collection of documents. The taskof generating ranked sets for queries that return an-notations and other complex types poses greater chal-lenges that we discuss in Section 8.

3The third equation in Definition 2 is introduced for a very specificreason for the subsequent development of the ranking model. Details areprovided in [16].

Assumption 2. We will only consider document types andannotation types produced by TAEs that directly op-erate on documents. Annotations produced by TAEsthat act upon other annotations or types producedthrough user-defined queries will not be modeled inthis section. Thus, for the example shown in Fig-ure 8, we will only deal with types ServiceRequest,PersonName, and Topic.

Assumption 3. We will assume that there is only one an-notation object per document for every annotationtype. While this condition is true for the Topic anno-tation type, it is not in general true for PersonName,since there may be several person names in the samedocument. However, probabilistically modeling set-valued annotation types is a complex task that is re-served for future research.

Given these assumptions, we associate with a doc-ument type D, a set of structured attributes S(D) ={s1, s2, . . . , sk} and a set of extracted attributes E(D) ={e1, e2, . . . , er}. Each extracted attribute representsan annotation object. For instance, in Figure 8,for the document type D = ServiceRequest, wehave S(D) = {model, category, city,day, region} andE(D) = {topic, personName}.

5.3 Probabilistic ranking model

Given a query q(D) where D is a document type and q is apredicate involving the attributes of D, the result of q(D) isthe set of documents of type D that satisfy q. In the result, adocument is ranked based on the probability that it satisfiesthe query predicate. In other words, the rank of document dis given by

rank(d) = P (q(d)|d) (7)

Let Xq denote the attributes of d that are relevant to q andlet Mq denote the accuracy statements of the extracted at-tributes of d that are relevant to q. For instance, if the queryis “return all documents where model = Buick and topic =Brake”, Xq = (model, topic) and Mq = {md.topic}. Wecan rewrite (7) as

rank(d) = P (q(d)|Xq, Mq, d)P (Xq, Mq|d) (8)

The first term on the right hand side of the above equationrepresents the probability that a particular document willsatisfy the query predicate, given the values of all the rel-evant attributes of the document. Clearly, this term cor-responds to a precise database query with a determinis-tic 1/0 answer. In particular, if we restrict ourselves todocuments which actually satisfy the query predicate q(d),the first term resolves to unity and rank(d) reduces toP (Xq, Mq|d). Since Xq is a deterministic variable rep-resenting the attributes of d, we get:

rank(d) = P (Mq|d) (9)

9

Page 10: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Thus, we have reduced the problem of assigning a rank fordocument d to the problem of estimating the probabilityP (Mq|d).

A simple approach for estimating the probability in (9)is to make the assumption that every annotation on a docu-ment is independent of all other annotations. We can there-fore decompose Mq into individual terms involving eachannotation. Thus, (9) becomes:

rank(d) =∏

e∈Xq∩E(D)

P (me|d)

Finally, using (5.1), we get

rank(d) =∏

e∈Xq∩E(D)

p(e|d)

6 Business IntelligenceAs in business intelligence, with structured data the goal inthe structured-unstructured world is to enable slicing anddicing of measures with respect to dimensions. Measuresand dimensions in the world of Ds + De are similar tothat of the structured world. As an example consider theQuery 2 in Section 1.1. Certainly, the answer to this queryis the aggregate Brake information of all service recordscontaining the Person name Kevin Jackson reported fromNew York for Chevy vehicles. Complications arise due tothe fact that the measure (Brake information) is a proba-bility distribution and one of the dimensions (Person an-notation) has a probability associated with it. To enable asystematic tackling of the issues we consider the followingthree cases.

Case1: Dimensions from De + Ds and measure from De

Consider Query 2. This is the most general of thecases we consider. De contributes both to the measureand the dimension of the cube from which this queryis answered.

Case 2: Dimensions from De + Ds and measure from Ds

Consider, instead, the query given below. Person isan “extracted” dimension while the measure is simplythe count of number of service records.

Query 4. How many service records contain the name“Kevin Jackson” and are from New York ?

Case 3: Dimensions from Ds and measure from De

This is the simplest of the three cases where Locationand Automobile are the dimensions and the measureis the extracted “Brake topic”.

Query 5. What is the likelihood of brake problems inNew York for Chevy vehicles ?

The cases described above are listed in decreasing orderof complexity. Cases 1 and 2 need to deal with the probabil-ities as well as the representational issues - i.e., we need the

equivalent of a star-schema for dimensions extracted fromDe. The representational issue arise because the data in De

may not be regular (e.g., the mentions of Kevin Jacksonmay be highly variable across documents). Fitting such ir-regular data into a star-schema is challenging. A related is-sue is the definition of hierarchies for dimensions extractedfrom De. These could be defined using the annotation typehierarchy (Figure 6) or possibly from Dr (re:Section 1).

Case 3, on the other hand, needs simply to dealwith probabilistic measures. It is precisely for thiscase that we have proposed a solution which is de-scribed in our recent submission[8]. ProbabilisticOLAP (PrOLAP), based on theoretical development de-scribed in [20], proposes a statistical model-based solu-tion. Further, the query in Case 3 above is interpretedas P (Topic = Brake|City = NY, Category = Chevy).The goal is to answer all such queries (slice and dice)preferably using existing SQL support for OLAP applica-tions. A detailed explanation of PrOLAP is provided in ourrecent submission [8] but an overview is provided in Ap-pendix B.

7 Application ExamplesThe commercial success of relational databases is primarilydue to large scale applications such as enterprise resourceplanning (ERP), human resources (HR), payroll, OLAPreporting, etc. Similarly, the success of a system suchas AVATAR hinges largely on the identification of killerapplicationss that can leverage the combined structured–unstructured data. The first important step is to identifyapplications that contain data sources where text is associ-ated with important structured data. Besides CRM, othersuch hybrid sources are e-mail and documents in collabo-ration applications such as Lotus Notes. We provide belowa few scenarios that drive home the value of AVATAR.

CRM Revisiting our CRM example let us consider a salespromotion application. Such a promotion could haveone or more of several possible goals: customer reten-tion, increasing goodwill, reduce inventory, etc. Theapplication designer would translate this broad goalinto appropriate queries on AVATAR. Existing sys-tems score customers while playing a delicate bal-ance between precision (not miss customers who arelikely to buy) and recall (not mail rebates to cus-tomers who might treat it as spam). A system suchas AVATAR incorporates a new dimension and there-fore can potentially increase the timeliness and theprecision. Consider, for example, an application thatsends out customer rebate coupons. The followingquery helps to target such rebates towards customersthat drive older Buicks and have had recent engineproblems. The results of the query could be processedautomatically to send out the offers.

Query 6. Return a list of customers who drive Buicksmanufactured before 1998 and have complained of en-gine problems in the last 6 months.

10

Page 11: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Alternatively, consider an application trying to trackproblem types that required the attention of “ServiceManagers” before the record is closed. This require-ment can be translated to the following query.

Query 7. Return all problems from service recordsthat mention a Service Manager and the record hasbeen closed ?

E-Mail Another important domain for seamless query-ing across structured and unstructured data is e-mail.Consider a user who receives receives regular emailsfrom John Smith on various topics. The search prob-lem is where the user is looking for the name andand phone number of a particular database expert thatJohn had referred to him in an e-mail. This searchneed is captured, somewhat, in the query given below.In the absence of annotations, the only recourse is toperform some sort of keyword search on emails fromJohn Smith and then read every email in the result setto obtain the information.

Query 8. Return all phone numbers and names ofPersons mentioned in e-mails from John Smith thatdiscuss database research ?

The above query assumes that the only annotationsavailable are named-entity (Persons) and topics. Sup-pose, however, a relationship annotator was avail-able and indeed had identified Persons as database re-searchers then the query would be somewhat different.

Query 9. Return all names and phone numbers ofPersons who are database experts mentioned in JohnSmith’s emails ?

Note that Query 9 is a much stronger query thanQuery 8 but requires a significantly more powerful re-lationship identification TAE. However, as shown byQuery 8 even simple TAEs, available today, can beused to form very powerful queries.

8 Issues for Future Research

The vision and architecture that we have presented in thispaper raises several issues that warrant significant research.Drawing from our experience in building the AVATARprototype, we enumerate some of these issues below.

Time-dependent object model

The object model prescribes a set of rules for a consistentsystem of types and objects. In practice the type system isever expanding. New types may be added due to annota-tions or queries. New subtype relations may be introduced.The set of objects for a type may change due to new an-notations. Since these types and objects are automaticallymanaged by the system, it is important to develop rules bywhich the consistency of the system is maintained at each

step. Since each attribute is a reference, the distinction be-tween copy and reference semantics becomes an issue fornew types defined by user queries. In essence, it is neces-sary to develop a theory of time-dependent type system.

Multiple annotators for the same type

In a real world situation, it is often the case that severalversions of TAEs for essentially the same semantic infor-mation may be available. They may be developed indepen-dently, using similar or different algorithms. The types ofannotations they produce may be different, with differentbut often overlapping attributes. A user can apply severalsuch TAEs on the same documents, producing slightly dif-ferent annotations. The user might want to consider all ofthem as subtypes of a supertype in some queries, yet asseparate types in other queries. The user might even wantthe system to discover possible subtype relations automat-ically. Correct treatment of these issues require careful in-vestigation.

Extended ranking models for retrieval

In Section 5, we made several assumptions about the in-terpretations of annotation uncertainties. We assumed thatquery return types are documents, that queries only involvedirect annotations on documents (rather than annotationson annotations or annotations defined by user queries), thateach document contains exactly one annotation of a giventype, and that the probability for each annotation is inde-pendent of other annotations. Removing these restrictionswould enlarge the set of query semantics for which rank-ing of the results can be defined. However, removing anyone of these assumption will remove independence amongthe probability distribution for different annotations. Forexample, if an annotation p points to another annotationb with probability, there will be statistical dependence be-tween these two annotations.

Extended models for business intelligence

The Our current effort in OLAP has been restricted to thesimple case where only the measure is derived from De.Extensions to incorporate extracted dimensions into the theOLAP data-model is a direction for future research. Thepresent theoretical basis for PrOLAP is restricted to ob-taining average aggregate behavior over distributions[20].Extending the analysis to facilitate obtaining extreme andother aggregate behaviors is another future area of research.Data paucity, observations for only some cells of a cube isavailable, is another important issue that needs addressing.Unlike conventional OLAP, the probabilistic model in PrO-LAP provides a predictive capability that enables a system-atic solution.

9 Related workAs we mentioned in Section 1, there is a long history ofwork in the area of bridging text retrieval systems and struc-tured databases [14, 22, 13, 23, 43, 9, 30, 19, 15, 40].

11

Page 12: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

A number of research prototypes have been developed toaddress the problem of supporting keyword queries overstructured data (the structured, imprecise part of the planein Figure 4). In DBXplorer [2] and DISCOVER [27],techniques for supporting keyword queries over relationaldatabases are presented. In [26], algorithms for returningthe top-k results in this context are presented. In [5, 21],algorithms for evaluating keyword queries over graph-structured data are presented.

Recently, in [3, 4, 18, 41, 42], query languages that inte-grate information retrieval related features such as rankingand relevance-oriented search into XML queries have beenproposed. Techniques to evaluate these ranked queries arealso proposed in [3, 4, 41, 42]. In [33], the problem ofranking SGML documents using term occurrences is con-sidered.

Several proposals have been made for ranked searchover a corpus of document databases combining keywordand structure components [25, 35]. In [25], structure is im-posed on text documents by partitioning the text into multi-paragraph units that correspond to subunits. The structurepart of the query can restrict the keywords to a particularsubtopic. In [35], a more general structure model is pre-sented, where the structure of the text document is orga-nized as a set of independent hierarchies.

In [24], an approach is presented for adding structure tothe (unstructured) HTML web pages present on the WorldWide Web. The authors argue that creating structured datatypically requires technical expertise and substantial up-front effort; hence people usually prefer to create unstruc-tured data instead. So, one of the core ideas of the systemis to make structured content creation, sharing, and main-tenance easy and rewarding. They also introduce a set ofmechanisms to achieve this goal. Notice how this approachis different from our approach, where we use text analyt-ics to add structure to unstructured data. Our focus is onenterprise applications, whose characteristics are a lot dif-ferent from web data authoring. First, domain knowledgeenables us to use a variety of text analytic tools. In contrast,in [24] the authors mention that they do not use informa-tion extraction techniques in their system as it will requiredomain knowledge, which may not be available for theirscenario. Second, the cost for manually adding structure tounstructured data is likely to be pretty high in an enterpriseapplication.

10 Conclusion

In this paper, we have laid out a case for text analytics asa mechanism for bridging the structured–unstructured di-vide in the enterprise. We presented an overview of recentadvances in text analytics and argued that the data manage-ment community has a significant role to play in bringingthese advances to the enterprise. Towards this end, we laidout a vision and architecture for how these two communi-ties can come together. Based on our prototyping experi-ence, we identified several challenges as well as open is-sues that warrant further investigation. We believe that the

vision that we have laid out in this paper will prove to be afertile area of research with contributions from the machinelearning, data management, and NLP/IR communities.

References

[1] AAAI 2004. http://www.clairvoyancecorp.com/Research/Workshops/AAAI-EAAT-2004/home.html.

[2] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer:A system for keyword-based search over relationaldatabases. In Proc. of ICDE, 2002.

[3] S. Al-Khalifa, C. Yu, and H. V. Jagadish. Queryingstructured text in an xml database. In SIGMOD, 2003.

[4] S. Amer-Yahia, S. Cho, and D. Srivastava. Tree pat-tern relaxation. In EDBT, 2002.

[5] G. Bhalotia et al. Keyword searching and browsing indatabases using BANKS. In Proc. of ICDE, 2002.

[6] S. Bird, D. Day, J. Garofolo, J. Henderson, C. Laprun,and M. Liberman. ATLAS: A flexible and extensi-ble architecture for linguistic annotation. In Proc. ofthe Second Intl. Language Resources and EvaluationConf. (LREC), 2000.

[7] K. Bontcheva, V. Tablan, D. Maynard, and H. Cun-ningham. Evolving GATE to meet new challenges inlanguage engineering. Natural Language Engineer-ing, June 2004.

[8] Doug Burdick, Prasad Deshpande, T.S. Jayram, andShivakumar Vaithyanathan. Prolap: Probabilisticolap, 2004.

[9] W. Bruce Croft, Lisa Ann Smith, and Howard R. Tur-tle. A loosely-coupled integration of a text retrievalsystem and an object-oriented database system. InProc. of the 15th Annual Intl. ACM SIGIR Conf. onResearch and Development in Information Retrieval,pages 223–232, June 1992.

[10] H. Cunningham. Information extraction - a userguide. Technical Report CS-97-02, University ofSheffield, 1997.

[11] H. Cunningham, D. Maynard, K. Bontcheva, andV. Tablan. GATE: A framework and graphical devel-opment environment for robust nlp tools and applica-tions. In Proc. of the 40th Anniversary Meeting of theAssociation for Computational Linguistics (ACL02),2002.

[12] D. Dave and S. Lawrence. Mining the peanutgallery: opinion extraction and semantic classifica-tion of product reviews. In Proc. of the Twelfth Intl.World Wide Web Conference (WWW2003), 2003.

12

Page 13: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

[13] Arjen P. de Vries and Annita N. Wilschut. On theintegration of IR and databases. In Proc. of the 8thIFIP 2.6 Working Conferene on Database Semantics,January 1999.

[14] Samuel DeFazio, Amjad M. Daoud, Lisa Ann Smith,Jagannathan Srinivasan, W. Bruce Croft, and James P.Callan. Integrating IR and RDBMS using cooperativeindexing. In Proc. of the 18th Annual Intl. ACM SI-GIR Conf. on Research and Development in Informa-tion Retrieval, pages 84–92, July 1995.

[15] Stefan Deßloch and Nelson Mendonca Mattos. In-tegrating SQL databases with content-specific searchengines. In Proc. of 23rd Intl. Conf. on Very LargeData Bases, pages 528–537, August 1997.

[16] Huaiyu Zhu et. al. Probabilistic ranking modelsfor annotated data. Technical report, IBM ResearchTechnical Report, 2004.

[17] D. Ferrucci and A. Lally. UIMA: An architectural ap-proach to unstructured information processing in thecorporate research environment. Natural LanguageEngineering, June 2004.

[18] N. Fuhr and K. Grobjohann. XIRQL: A language forinformation retrieval in XML documents. In Proc. ofSIGIR, 2001.

[19] Norbert Fuhr and Thomas Rolleke. A probabilisticrelational algebra for the integration of informationretrieval and database systems. ACM Transactions onInformation Systems, 15(1):32–66, 1997.

[20] Ashutosh Garg, Jayram Thathachar, ShivakumarVaithyanathan, and Huaiyu Zhu. Generalized opin-ion pooling, 2004.

[21] R. Goldman et al. Proximity search in databases. InProc. of VLDB, 1998.

[22] David A. Grossman, Ophir Frieder, David O. Holmes,and David C. Roberts. Integrating structured data andtext: A relational approach. Journal of the Ameri-can Society for Information Sciences, 48(2):122–132,1997.

[23] J. Gu, U. Thiel, and J. Zhao. Efficient retrieval ofcomplex objects: Query processing in a hybrid db andir system. In Proc. of the 1st German National Conf.on Information Retrieval, 1993.

[24] Alon Y. Halevy, Oren Etzioni, AnHai Doan,Zachary G. Ives, Jayant Madhavan, Luke McDowell,and Igor Tatarinov. Crossing the Structure Chasm. InCIDR, 2003.

[25] M. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In Proc. of SIGIR, 1993.

[26] V. Hristidis, L. Gravano, and Y. Papakonstanti-nou. Efficient ir-style keyword search over relationaldatabases. In VLDB, 2003.

[27] V. Hristidis and Y. Papakonstantinou. DISCOVER:keyword search in relational databases. In VLDB,2002.

[28] Thorsten Joachims. Text categorization with supportvector machines: learning with many relevant fea-tures. In Proc. of 10th European Conf. on MachineLearning (ECML98), 1998.

[29] C. Laprun, J. Fiscus, J. Garofolo, and P. Sylvain. Apractical introduction to ATLAS. In Proc. of the ThirdIntl. Conf. on Language Resources and Evaluation(LREC), 2001.

[30] Clifford A. Lynch and Michael Stonebraker. Ex-tended user-defined indexing with application to tex-tual databases. In Proc. of the Fourteenth Intl. Conf.on Very Large Data Bases, pages 306–317, August1988.

[31] E. Marsh and D. Perzanowski. Overview of results ofthe muc-7 evaluation. In Proc. of the Sixth MessageUnderstanding Conf. (MUC-7), pages 13–31, 1996.

[32] Joseph F. McCarthy and Wendy G. Lehnert. Usingdecision trees for coreference resolution. In IJCAI,pages 1050–1055, 1995.

[33] S. Myaeng et al. A flexible model for retrieval ofSGML documents. In SIGIR, 1998.

[34] Kambhatla Nanda. Combining lexical, syntactic andsemantic features with maximum entropy models forextracting relations. In Proc. of the 42nd Anniver-sary Meeting of the Association for ComputationalLinguistics (ACL04), 2004.

[35] G. Navarro and R. Baeza-Yates. Proximal nodes: Amodel to query document databases by content andstructure. ACM Transactions on Information Systems,15(4), 1997.

[36] Vincent Ng and Claire Cardie. Improving machinelearning approaches to coreference resolution. InProc. of the 40th Annual Meeting of the Associationfor Computational Linguistics, pages 104–111, 2002.

[37] Kamal Nigam, Andrew K. McCallum, SebastianThrun, and Tom M. Mitchell. Text classification fromlabeled and unlabeled documents using EM. MachineLearning, 39(2/3):103–134, 2000.

[38] OTC 2001. http://www.daviddlewis.com/events/otc2001/index.html.

[39] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.Thumbs up? Sentiment classification using machine

13

Page 14: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

learning techniques. In Proceedings of the 2002 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 79–86, 2002.

[40] Ron Sacks-Davis, Alan J. Kent, Kotagiri Ramamo-hanarao, James A. Thom, and Justin Zobel. Atlas:A nested relational database system for text applica-tions. Transactions on Knowledge and Data Engi-neering, 7(3):454–470, 1995.

[41] T. Schlieder and H. Meuss. Result ranking for struc-tured queries against XML documents. In DELOSWorkshop on Information Seeking, Searching andQuerying in Digital Libraries, 2000.

[42] A. Theobald and G. Weikum. The index-based XXLsearch engine for querying XML data with relevanceranking. In EDBT, 2002.

[43] Marc Volz, Karl Aberer, and Klemens Bohm. AnOODBMS-IR coupling for structured documents.Bulletin of the Technical Committee on Data Engi-neering, 19(1):34–42, 1996.

[44] Aone Chinatsu Zelenko Dmitry and Richardella An-thony. kernel methods for relation extraction. Journalof Machine Learning Research, 3:1083–1106, 2003.

[45] Tong Zhang and Frank J. Oles. Text categorizationbased on regularized linear classification methods. In-formation Retrieval, 4(1):5–31, 2001.

. . .

A

A.x A.y A.z

x y z

Figure 9: Type attribute paths as a tree

A Outline of Object ModelA.1 The type system of object model

Here we give an outline of the object model. This subsec-tion does not deal with documents and annotations, whichwill be treated in the next subsection. The basic conceptsof conern here are objects a, b, c, . . . , types A, B, C, . . . ,and attribute names x, y, z, . . . . A type A can be a subtypeof B, written as as A � B. The subtype relation forms apartial order among the types.

Every object a belongs to a specific type type(a), andevery type A is associated with a set of objects objs(A),consisting of all the objects that are of subtype of A. Thatis,

a ∈ objs(A) ⇐⇒ type(a) � A. (10)

As a consequence of � being a partial order, we have

A � B =⇒ objs(A) ⊆ objs(B). (11)

Every type A has a set of named attributes. The schemaof A maps each attribute name x to corresponding attributetype, A.x

sch(A) : x → A.x. (12)

If A � B, then A.x � B.x for every attribute x of B.Similarly, every object a has a set of named attributes. Eachattribute name x maps to an object a.x. If type(a) � A,then type(a.x) � A.x for every attribute x of A.

Since A.x is again a type whenever it is defined, we canchain the attributes to denote types such as A.x1.x2. . . . .We call x = x1.x2. · · · .xn a path. The set of all validpaths for type A is denoted C(A).

As shown in Figure 9, a type A can be associated with arooted edge-labeled tree. The root of the tree is the type A.Each label x on an edge leaving A is an attribute, leading tothe child node A.x. An edge path x1, x2, · · · , xm startingfrom the root A is a path x1.x2. · · · .xm ∈ C(A).

As in relational algebra where queries are expressedas table constructors, in our object model queries are ex-pressed as type constructors. These type constructors are

14

Page 15: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

similar to the corresponding operators in relational alge-bra. The main differences are: (1) Instead of dealing witha single attribute (a column in a table), we deal with an en-tire path of attributes. (2) Subtypes relations are taking intoconsideration.

Cartesian product. The Cartesian product type con-structor T forms a new type A by stringing together sev-eral existing types Ai using attribute names xi. Given aset of names xi and corresponding types Ai, the type A isdefined such that A.xi = Ai, and that objs(A) consists ofobjects a such that a.xi = ai and ai ∈ objs(Ai). We write

A = T S, where S : xi → Ai. (13)

Projection. The projection operator forms a new type Bby stringing together a subset of paths y i of a type A usingattribute names xi. Given a type A, a set of names xi andcorresponding paths y i ∈ C(A), the type B is defined suchthat B.xi = A.yi, and that objs(B) consists of objects bsuch that b.xi = a.yi and a ∈ objs(A). We write

B = πsA, where s : xi → yi (14)

Selection. The selection operator forms a new type Bby selecting a subset of objects of a type A according to apredicate p defined on A. A predict p defined on a type Ais a map from objects of A to the truth values. Given atype A and a predicate p defined on A, the type B is definesuch that it has the same schema as A, and that objs(B)consists of objects a ∈ objs(A) that satisfies the predicate,p(a). We write

B = σpA, (15)

Note that the predicate p may also involve path expressions.General type constructor. A general type construc-

tor can be formed using Cartesian product, projection andselection operators. Given a finite map f from names totypes, a finite map s from names to paths in C(T f), anda predicate p defined on the type T f , a new type can bedefined as

πsσp T f. (16)

This is a general form for specifying queries in the objectmodel. It is written in a form similar to relational alge-bra. To bring out the conceptual linkage between the objectmodel with existing data models, it is instructive to writethis in syntaxes similar to these other formalisms. Thus wehave, in the style of relational calculus or set comprehen-sion notation,

{a.s : a ∈ f, p(a)}, (17)

in the style of SQL,

SELECT s (18)

FROM f (19)

WHERE p (20)

and finally, in the style of XQuery,

for a in f (21)

where p(a) (22)

return a/s (23)

The syntactic details of these query forms are of course dif-ferent, depending on the ways the mappings f, s and thepredicate p are spelled out.

A.2 Documents and Annotations

The theory outlined in the previous subsection addressesthe issues of path exression, type construction and subtypequeries. It does not address the issue of document refer-ence, which is addressed in this subsection. We introducea mapping from some types to pairs of types and attributenames:

doc : A → D, d (24)

where A, D are types, d is an attribute name of A, andA.d = D. Intuitively, D is considered a document type. Ais considered an annotation type, obtained by performingannotation on a text attribute of the document. d is the ref-erence from the annotation back to the original document.Exactly which attribute t of D is regarded as text is of littleconcern to us here. Note that the annotations produced bythe same TAE but on different text field of the same docu-ment type D would be considered as different types.

Annotation types are the primary mechanisms for com-bining structured and unstructured data. For example,given set of documents D, we might perform multiple an-notations A1, A2, . . . on D, resulting in

doc(A1) = (D, d1), (25)

doc(A2) = (D, d2), (26)

. . . (27)

A query might ask for all the annotations a1 ∈ objs(A1),a2 ∈ objs(A2), . . . , such that a1.d1 = a2.d2, . . . , in ad-ditional to whatever predicates these objects must satisfy.Such queries where all the annotations are joined by com-mon documents occur quite often in practice. As a con-venience to the user, we define a new operator A, whichworks in the following way: Given a mapping f from somenames to annotation types

f : x1 → Ai, where doc(Ai) = Di, di, (28)

a new annotation type A = A f is defined similar toT f , except that it has an additional attribute d such thatA.d = Di for all i. Therefore it is only defined when allthe annotations Ai share the same document type, whichalso becomes its own document type. Furthermore, forany object a ∈ objs(A), the attribute a.d = a.xi.di forall i. That is, an object a in A is made up of objectsai in Ai, such that all the ai are annotations on the samedocument, plus an attribute a.d that equals the document.

15

Page 16: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

Clearly, the type A thus defined is also an annotation type,with doc(A) = (A.d, d). Similar to the operator T , wecan use the operator A to form general queries of the formπsσp A f .

A.3 Simple relational backend

The object model as described above is independent of ac-tual storage schemes. Due to its similarity to relationalmodels, it is relatively easy to translate object model con-cepts into a relational models for storage. We describehere the most straight forward translation. Other transla-tions are possible, having different performances and opti-mization issues. It is also possible to translate the objectmodel into XML data models. When such backend stor-age becomes practically available, we intend to make cor-responding translations available as well. This will not bediscussed any further here.

In the simple backend scheme, the object model is trans-lated into relational model according to the following rules:

• A type A is translated into a table t = tab(A).

• An attribute x of A is translated into a column c =col(A, x). An additional column i = id(A) acts asprimary key of the table t.

• An object a of type A is translated into a row intab(A).

• If x is an atomic attribute, then the value a.x is storedin the column c of table t.

• If x is a non-atomic attribute, then the value a.x isstored in the foreign table t1 = tab(A.x) with id col-umn i1 = col(A.x). The column c in table t is aforeign key reference to t1, c1.

With these rules, one additional type introduced into theobject model corresponds to one or several additional ta-bles in the backend relational model. One additional objectcorresponds to one extra row in the tables corresponding tothe object type and the non-atomic types of all its paths.Atomic attributes are stored in place in the tables whilenon-atomic types are stored in foreign tables.

This translation scheme, together with the type aspectsof the object model, such as subtype relations and schemas,can themselves be described in terms of relational tables,called the metadata tables. A query to the object model canbe uniquely translated in to an SQL query to the backendrelational model according to information in the metadatatables. Consider a query in the form

πsσp T f, (29)

where s is a map from some names to paths, p is a predi-cate, and f is a mapping from some names to types. Thetranslation involves the following the main steps

• Resolving path expressions appearing in either s orp. A path of the form x1.x2. . . . introduces additionaltable aliases for each link.

• Constructing join conditions associated with path ex-pressions. These join conditions are produced by theforeign key references related to each link in the path.

• Construcing join conditions associated document ref-erence. If the query is based on operator A insteadof T , additional join conditions for the common doc-ument references are introduced.

B Probabilistic OLAP

In this section we outline the mechanics of our solutionfor Case 3 from Section 6 Let A1, A2, . . . , Ak denote theattributes over all the dimensions. The data is given as atable of records where each record contains an assignmentof values to A1, A2, . . . , Ak, and a probability distributionof a single uncertain measure, called opinion4 denoted byO. This table is constructed by obtaining A1, A2, . . . , Ak

from Ds and O from De (cf. Figure 2).The statistical model is formalized by considering

the joint probability distribution P (O, A1, A2, . . . , Ak),which factors into the product of P (O|A1, . . . , Ak) andP (A1, . . . , Ak). The first term P (O|A1, . . . , Ak) modelsthe uncertainty in the opinion with respect to the attributeassignment. The other term in the product can be viewedas the weight associated with that opinion. The joint distri-bution P (O, A1, . . . , Ak) is obtained by optimizing a KL-divergence objective function as shown in [20, 8]. In thissetting, each record in the data is interpreted as an empiricalconditional probability distribution on the opinion. Specif-ically, let a1, a2, . . . , ak denote the assignment of values toA1, A2, . . . , Ak, respectively. Then, the probability asso-ciated with opinion value o is denoted by p(O = o|A1 =a1, . . . , Ak = ak), or simply, p(o|a1, . . . , ak).

The approach is illustrated using our running examplewhere Topic = Brake is the measure. The associated at-tributes are MODEL, CATEGORY, STATE and REGION. Theattributes can be grouped into two hierarchies: MODEL de-termines CATEGORY and STATE determines REGION. Con-sider the query “What is the chance that brake prob-lems occur in New York for Chevy ?”. This correspondsto the query P (BRAKE|STATE = ‘NY ′, CATEGORY =‘Chevy′) on the statistical model. Note that this isequivalent to P (BRAKE|STATE = ‘NY ′, REGION =‘East′, CATEGORY = ‘Chevy′) by the hierarchical con-straint. Now consider the query “What is the probabil-ity of brake related problems according to the category ofthe automobile?” This is equivalent to a set of queries,P (BRAKE|CATEGORY) one for each possible category. Theanswer is a set of two distributions, one for trucks and an-other for sedans.

Queries on the the statistical model can be answeredfrom the joint distribution P (O, A1, . . . , Ak). Let us con-sider a query on the statistical model P (O|a), where a is

4The term opinion originates from the fact that the approach in PrO-LAP has been motivated by opinion pooling, a well-known statisticalmethod for obtaining consensus distributions.

16

Page 17: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

an assignment to the attributes A1, A2, . . . Ak. The assign-ment a could either be a complete or a partial assignmentdepending on the level of the query. Now, for any o,

P (o|a) =P (o,a)P (a)

=∑

a′ P (o,a,a′)∑a′ P (a,a′)

(30)

=∑

a′ P (o,a′)∑a′ P (a′)

(31)

=∑

a′ P (o,a)∑o′

∑a′ P (o′,a′)

(32)

In the above equations, a′ ranges over all complete as-signments consistent with a and o′ ranges over all possi-ble measure values. The derivation from Equation 30 toEquation 31 follows from the fact that a ′ determines a, soP (a,a′) = P (a′). All terms on the right hand side areknown from the joint distribution, so P (o|a) can be com-puted. In essence, we are marginalizing over all completeassignments by assigning the missing attributes to all pos-sible values. A similar equation can be derived to answerrange queries where the query explicitly specifies a rangealong each dimension to be aggregated.

Mapping to SQL

We use a star schema to store the probability distributionP (O, A1, A2, . . . , Ak). This enables us to compute the ag-gregates as described in Equation 32 as a star join query.A star schema consists of a fact table and set of dimen-sion tables. In our case, the fact table has n + m attributes,where n denotes the number of dimensions and m denotesthe number of possible values for the opinion measure. Then dimension columns correspond to the leaf level attributesof the dimensions. Each leaf attribute completely deter-mines the other attributes in the corresponding hierarchy.The m measure columns store the probabilities correspond-ing to each opinion value. Thus, the measure column corre-sponding to the opinion value o stores P (o, a1, a2, . . . , ak)for each row that has the leaf attributes among a i in thecorresponding dimension column. There are n dimensiontables, with each table having the attributes correspond-ing to that dimension. Tuples in the dimension table en-code the hierarchy in that dimension. Each dimension ta-ble joins with the fact table through the leaf level attributeof that dimension. For example, Figure 10 shows the starschema for Brake data. Since the measure can take two val-ues, there are two corresponding columns BRAKE YES andBRAKE NO to store the probabilities.

Now consider Equation 32 used to compute a pointquery. It can be seen that the set a′ of tuples consistentwith a is in fact equivalent to the star join of the fact tablewith the dimension tables with constraints on attributes ofthe dimension tables set to values corresponding to thoseattributes that are assigned in a. This implies that the nu-merator of Equation 32 can be obtained by a simple Sum

CITYSTATEREGION

SUB_AREAAREA

MODELCATEGORY

DAYWEEKQUARTER

DAYCITYMODELSUB_AREABRAKE_YESBRAKE_NO

TIME

AUTOMOBILE AREA

LOCATION

BRAKE

CITYSTATEREGION

SUB_AREAAREA

MODELCATEGORY

DAYWEEKQUARTER

DAYCITYMODELSUB_AREABRAKE_YESBRAKE_NO

TIME

AUTOMOBILE AREA

LOCATION

BRAKE

Figure 10: Star Schema

aggregation on the star join query. Since o ′ ranges overall opinion values, the denominator will have the Sum ag-gregate over all the measure columns. With this mapping,Equation 32 can be written as a SQL query over the starschema. The query can be optimized to eliminate dimen-sion tables that don’t contribute any constraints from thestar join. We will elaborate with a few queries on the ex-ample star schema.

Query 10. What is the probability of brake relatedproblems in New York for Chevys?

This query is equivalent to:P (BRAKE = ‘Y ES′|STATE = ‘NY ′, CATEGORY =‘Chevy′)5

SELECT SUM (BRAKE YES) / ( SUM (BRAKE YES)+ SUM (BRAKE NO))

FROM BRAKE, LOCATION, AUTOMOBILEWHERE LOCATION.CITY = BRAKE.CITY AND

AUTOMOBILE.MODEL = BRAKE.MODEL ANDLOCATION.STATE = ‘NY’ ANDAUTOMOBILE.CATEGORY = ‘Chevy’

Note that in this case, dimension tables TIME and AREAhave been dropped from the join as there is no constrainton their attributes

Query 11. What is the probability of brake related prob-lems according to the Category of the automobile?This is equivalent to a set of point queries P (BRAKE =‘Y ES′|CATEGORY), one for each possible Category.

SELECT CATEGORY,SUM (BRAKE YES) / ( SUM (BRAKE YES)+ SUM (BRAKE NO))

FROM BRAKE, AUTOMOBILEWHERE AUTOMOBILE.MODEL = BRAKE.MODEL ANDGROUP BYCATEGORY

In this case, a group by clause is added to groups tuplescorresponding to each make.

5We have used P (BRAKE = Y ES|City =NY, Category = Chevy) instead of P (TOPIC =Brake|City = NY, Category = Chevy) for readability.

17

Page 18: AVATAR: Using text analytics to bridge the structured ... India Research Lab Block I, Indian Institute of Technology Hauz Khas, New Delhi - 110016, INDIA Abstract There is a growing

As shown in these examples, the opinion aggregationoperators can be mapped to existing SQL operators with-out the need for any defining new operators. This is quitepowerful since it makes it possible to use existing OLAP in-frastructure and any existing OLAP optimizations like pre-computation or specialized query evaluation algorithms.

18