Database Driven Discovery of Structure from Partially

Database Driven Discovery of Structure from Partially Structured Data

Dean Williams

School of Computer Science and Information SystemsBirkbeck College, University of London

[email protected]

Abstract. Many database applications store important information in the form of unstructured free text alongside structured data. Conventional database tech-nologies provide only limited facilities for exploiting text e.g. keyword search-ing. This research enhances database technology by making use of the related structured data and schema to drive the extraction of structure from the text us-ing Natural Language Processing techniques. The design of a demonstrator sys-tem for this approach is given.

1 Motivation for the Research

The original motivation for this research came from a research report by P.J.H. King and A. Poulovassilis [1], in which they define a distinct category of data - par-tially structured data (PSD). Many database applications rely on storing significant amounts of data in the form of free text. Recent developments in database technology have improved the facilities available for storing large amounts of text. However the provision for making use of this data largely relies on searching the text for keywords. For a broad class of applications the information to be stored consists partly of some structured data conforming to a schema with the remainder left as free text – we con-sider this data to be partially structured. This idea of PSD is distinct from semi-struc-tured data, which is generally taken to mean data that is ‘self-describing’. In semi-structured data there may not be a schema defined but the data itself contains some structural information e.g. XML tags.

An example of an application based around the use of PSD is operational intelli-gence gathering, which is used in serious crime investigations. The data collected in this application area takes the form of a report that contains some structured data such as the name of the Police Officer making the report, the time and location etc and this is combined with the actual report of the sighting or information received which is captured as text. The data stored as free text can be essential information and is not just stored in this form as a result of its secondary importance. Often for this class of application, while the basic information to be captured can be anticipated in advance and a schema developed, the detail cannot due to the unpredictability of the subject content.

2 Research Questions

The research will address a number of questions:

Integrating NLP with Database Technology. This research is focused on enhancing database support for text-based data rather than in developing new natural language processing (NLP) techniques. Reviewing the current state of NLP technology and identifying appropriate techniques and technologies will then be followed by finding ways to make use of these in a database setting.

Using the Database Schema to Drive the Discovery of Structure. In PSD applica-tions a database schema exists in addition to the free text. Making use of this database schema to drive the extraction of structure from the text is a novel approach and tech-niques need to be developed to do this. In existing NLP systems structured informa-tion is extracted and then, sometimes, as a separate step stored in a database. In the approach envisaged the database schema and already existing structured data would be used to direct the discovery of the structure.

Making Use of Discovered Data and Ontologies to Support Queries Against Text Data. The discovered data will be used to support queries against the text data. The information discovered will be a collection of many small fragments of structure i.e. instance + schema – which will be referred to as the collection of ‘facts’ discovered. Ontologies are increasingly being used to provide common reference for terms – link-ing general and domain specific ontologies to the collection of extracted facts will en-able concept level queries to be run against the text rather than just looking for key-words. Designing techniques for supporting queries against the text and evaluating the effectiveness of this approach compared to keyword searching will be one research question to be answered.

Using Discovered Facts to Extend and Enhance Instance Data. Some of the dis-covered facts may either refer to existing instances in the database or could represent new instances of entity types not already contained in the structured data. For exam-ple if a fact representing a “blue ford mondeo registration X188 CLD” is discovered then this could be added as an instance of vehicles if this is a new vehicle, not previ-ously in the database. Alternatively the colour could be added as a previously un-known attribute of an already known instance. Methods to decide which of the facts discovered in the text relate to existing entity types and how to enhance existing in-stances and add new ones will be investigated.

Using Discovered Structure to Extend the Database Schema. A further objective to be looked at is the possibility of using the discovered structure to go further and ex-tend not only the instance information but also the schema.

Developing a Workbench for Database Driven Structure Discovery. As far as possible the discovery of structure and the integration of the discovered data into the database will be automated. However human intervention will be required in the

process and a workbench will be built to facilitate the process of extending the data-base in the ways described above.

3 Related Work

This work relates to a number of areas, in particular:

Information Extraction. Information Extraction (IE) [2] is a branch of natural lan-guage processing that is concerned with extracting pre-defined entities from text and filling a template with the extracted detail. While Information Retrieval works by identifying and retrieving documents that are of interest, IE extracts data from within the document. Much of the original work on IE was based on participation in the US Defense Advanced Research Projects Agency sponsored TIPSTER text program which ran from 1991 to 1998. A key feature of the TIPSTER program was the series of Message Understanding Conferences (MUC). At these conferences a number of IE systems competed to solve the same IE tasks e.g. extraction of people and organisa-tions from newswire reports. The MUC tasks were: named entity recognition i.e. identifying and classify entities in text e.g. telephone numbers, people, dates, corefer-ence resolution i.e. identifying multiple references to the same entity including anaphoric references and template construction i.e. filling in fixed format templates describing entities and their attributes (template elements) and identifying relations between template elements (scenario templates).

The degree of success achieved in IE is impressive with systems reaching levels of over 90% for the named entity recognition tasks, over 60% for coreference, over 80% from template elements and around 50% for scenario templates at MUC-7 [3]. As well as the systems developed by universities and research organisations (e.g. FAS-TUS [4], LOLITA [5], GATE [6]) to compete in these tests there is significant inter-est in industry. IE products available include Quenza [7] for which one of the target markets is crime investigation.

Graph Based Databases. Current industry standard databases are generally based on the relational model or some form of object data model where the schema must be de-termined in advance of populating the database. These databases are essentially record based. A graph based data model offers finer semantic granularity and greater flexibility and allows limitations caused by the inflexibility of the record based mod-els to be overcome.

The Automed [8] project maps heterogeneous datasources onto the HDM graph based data model that is used as a common data model. A series of reversible schema transformations provides a mediated schema offering both-as-view (which captures more semantic information than both local-as-view and global-as-view). The pro-posed research can be thought of as a series of schema transformations and media-tions over new data sources and will make use of the facilities Automed provides.

The Tristarp project [9] uses a graph based data model and has three layers – a triple store to store the relations, a database programming language layer (functional and logic based languages are used) and an application interface layer where ad-vanced visual interfaces are implemented. Recent work in the Tristarp group has re-sulted in advanced visualisation tools for graph-based databases becoming available [10] that may be of assistance in the proposed user workbench. This research interest is also reflected in recent products developed in industry, the Sentences [11] DBMS from Lazysoft is based on a quadruple store and sets out to challenge the dominance of the relational model.

Semantic Web. The semantic web [12] is a concerned with finding ways to make the information on the World Wide Web useable by computers. A significant amount of the current research and application development effort is directed at the development of ontologies, which provide a commonly accepted terminology amongst a commu-nity. There are general, natural language, ontologies such as WordNet [13] as well as domain specific ontologies for a range of application areas. The ability to link words to concepts is important to the techniques envisaged for the proposed research.

Structure Extraction. Database research into ways of extracting structure from text includes the NoDoSE [14] system, which uses a GUI to allow the user to highlight structure in a document, and then the tool builds a grammar for the document. Re-search into extracting structure from the web has focused on overcoming the need for wrapper programs for each data source by making the discovery of structure as auto-mated as possible. The DIPRE system [15] discovers relations on the web by using an initial sample of the target relation, finding occurrences of that relation, using the set of occurrences to generate patterns and find new occurrences using those patterns. The process is repeated until a large enough set of occurrences is found. The patterns are based on the words found before, in between and following each item in the rela-tion. Snowball [16] takes the DIPRE methodology but uses commercial named entity recognition software to enable patterns to include entity types e.g. the pattern “<PER-SON> was born in <COUNTRY>” becomes possible. The LSD system [17] attempts to extract data from similar sites but with different data schemas (e.g. a collection of estate agent websites) and use a variety of learners to attempt to match the schemas and provide a mediated schema.

4 Research Approach and Methodology

In order to progress the research we have decided to proceed in the following stages:

Review of Related Work. There are a number of related areas of research to be re-viewed including databases, natural language processing, semantic web etc. My first step was to review these areas and become familiar with current research activity in these fields.

Develop Initial Ideas and Techniques / Build Demonstrator System. Following the review of the related work we developed some initial ideas on ways to develop database technology to discover structure and integrate the discovered data into a PSD. After initial techniques and approaches had been identified the next step was to design and implement a demonstrator system that can be used as a testbed to develop these ideas.

Conduct Experiments Using Demonstrator Testbed. The demonstrator will be used with two sample application domains – Police Operational Intelligence reports and Road Traffic Accident Reports. For each, a set of IE components will be assem-bled and customised and experiments conducted to review the effectiveness of both the information extraction and database enhancement tasks. A facility in the GATE systems allows a set of manually annotated documents to be compared with automati-cally derived annotations to measure the effectiveness of the IE performed, these will be combined with tests to gauge the effectiveness of the database enhancement.

Design Techniques and Develop Prototype. When the results of the demonstrator experiments are obtained we will describe the techniques we intend to develop, imple-ment them in a prototype system and test their effectiveness in sample application do-mains.

5. Preliminary Ideas and Results Achieved So Far

Selection of NLP Technology. In reviewing available NLP techniques, similarities appear to exist between the IE approach and database systems. Templates are used to define the entities that will be searched for and their attributes. These templates can be thought of as fragments of some database schema which is to be populated by the IE process. For this reason it was decided to investigate the suitability of IE techniques in the proposed research. Current IE systems have been reviewed against relevant crite-ria including: the ability to integrate the IE components into a standalone system, the range of IE components available as standard, the stability of the software etc. From this analysis the GATE 2 system [18] was selected as the best base for the IE func-tionality. GATE has been developed at the University of Sheffield and includes a framework for developing language processing systems, especially IE systems. Lan-guage processing components can be developed as Java beans implementing inter-faces and used within the system. A set of language engineering components is pro-vided including the ANNIE IE system. A Graphical User Interface combines process-ing resources, text resources & annotation data stores.

Previous research in analysing Road Traffic Accident Reports [19] had investi-

gated extracting location information from text in road traffic accident reports and matching this against a database schema using a ‘deeper’ NLP technique based on de-scription logic. As a proof of concept as grammar was developed and used with the standard GATE components to extract location information from the reports and the results compared favorably with those from the earlier work.

Design of Demonstrator Software. To develop and test our initial ideas a demonstra-tor called the Experimental Software To Extract Structure From Text (ESTEST) has been designed and is in development. The application will make use of the Automed infrastructure for the various graph based representations of the data and will make use of the GATE IE infrastructure to access both existing GATE language process modules and new modules developed for the system. When constructed the system will then be used for a series of experiments into both the development of database fo-cused annotators and alternative strategies for the later extraction and data appending steps. The system will be implemented in JAVA with relational tables created in Oracle and then treated as an Automed data source.

Figure 1 shows an overview of the system and each of the steps are now described:

Import Partially Structured Database Into Automed. The initial partially struc-tured database will be assumed to be in the form of relational database tables with the text stored as columns on a table. Using the existing Automed relational model defini-tions the schema will be mapped onto an Automed Hypergraph Data Model (HDM) model. Each of the subsequent steps will make use of the Automed model definition and schema transformation facilities to manage the graph manipulations required.

IE Components Annotate the Text. The text will then be processed by IE programs to discover annotations. IE components e.g. tokenisers, NE recognisers, pattern matchers & their grammars will need to be specified and customised for the domain. These will be implemented as CREOLE language components called from the GATE infrastructure and bundled together as a GATE application, which will be a Java pro-gram. Off the shelf components distributed through the GATE project will be ac-cessed in this way but the GATE infrastructure is also suitable for building cus-tomised components and linking to other resources such as Word Net. The previously configured GATE components will process the text column and build a data structure representing the string as a list of tokens with annotations. The resulting GATE style annotations will then be transformed into schema described above and accessible from Automed.

Extracted Information is Matched. It is likely that the result of the previous step will be a large number of annotations, many covering the same text. The next step will be to see which of the annotations represent information in the text that is a refer-ence to entity types stored in the structured part of the database. The task is to target annotations that refer to entities - entity annotations e.g. CAR but not COLOUR. These entity annotations will have an “annotation schema” to show possible attributes that can be found from the annotations e.g. cars may have colours and / or registration marks etc. These will be stored in a separate Automed schema and each of these an-notation schemas will be linked to the text in the original schema. While it is useful that the IE components are separated from those concerned with database evolution, one weakness of the initial ESTEST design is that rules to add new attributes will have to be added separately to both the IE component and the ETEST matching code. For example extensions may have to be made to a JAPE grammar to create the new

annotations and then the matching schema will have to be altered independently to make use of the new annotation. It is intended to consolidate entity / attribute defini-tions in future development.

Fig. 1. Overview of the ESTEST system including the database transformations that take place

Having the grammar rules separate from the annotation schema means that some of the information on why a rule fired to produce an annotation is lost e.g. a CAR anno-tation could be fired just on the word “car” or on a combination of other tokens e.g. a CAR + a COLOUR. But a colour can refer to things other than cars. ESTEST will as -sume that any attributes contained in the text covered by the entity annotation e.g. if the CAR annotation covers the whole of the string “dark blue Mondeo” and there is a COLOUR annotation covering “dark blue” then ESTEST will assume that the COLOUR refers to the CAR. These entity schemas will be used also be used to map onto data in the structured database. For each attribute in the annotation schema there will be an Automed transformation defined e.g. to map the SUNROOF attribute to the any_sunroof column in the partially structured database schema. The presence or ab-sence of a particular attribute may give a differing amount of evidence in trying to match to entities in the structured database e.g. if you have a registration number of a car that’s better evidence of its identity than its colour or make. Each possible at-tribute will have a weight from 0 – 1 showing what evidence that the presence of that attribute on its own should be given e.g. 0.97 certainty given a registration number, 0.10 for a colour. If attributes themselves have ‘modifier’ attributes e.g. the shade of colour then this will show a replacement weight for the attribute to be used if the

modifier is present & also matches e.g. a shade modifier of 0.14 will replace the 0.10 for a match on “dark blue” instead of just “blue”. The set of possible entity occur-rences or facts is now extracted using a previously developed search algorithm.

Once the discovery of facts has been completed, each will be taken in turn and compared to all the instances of the entity type that exist in the database (using the an-notation schema mappings). When comparing to an instance each attribute will be compared. If the attributes agree the match confidence will be increased by the amount of the attribute weight (& and if the attribute has attributes each of these will be examined etc). If the attributes are both present but disagree (e.g. different registra-tion numbers) then the match confidence is decreased by the amount of the attribute weight. If this confidence level is the best found so far it will be remembered as the best match. Facts are Appended to the Database. The result of the previous step will be a set of facts, the entity in the database they best match and the confidence that this match is correct. These matches will be added to a new schema of derived data in the same for-mat as the original database. Once this has been created an additional combined schema will be created as a union of the two to give a single view over all the known + suggested information. Matches where the confidence level is low (say between –0.5 and 0.5) will be discarded. Matches where the level is > 0.5 will be regarded as matches and will be added to the database only with the attributes that exist solely in the extracted fact i.e. these are new attributes which have been discovered for existing entity instances. Matches where the level is <0.5 will be regarded as new instances of the entity type and will be added. The merged schema will be available for queries over the combined data. Additionally an ‘index’ of the text will be available linking the instances of the text linked to the combined information on the entity instances ex-tracted.

6 Contribution Made by the PhD Research

The research proposed will make the following contributions: Current database technology fails to meet the requirements of a class of ap-

plications with some structured and some free text data. The research will in-vestigate how graph based data models can be used to better make use of the information currently stored as free text.

Information Extraction has had significant success in detecting structure in text. It is hoped that the database driven approach suggested will make fur-ther progress by leveraging the database schema in the IE process.

Allowing the structure extracted to be used both to extend the instance data in the structured part of the database and to extend the schema will enhance database technology.

The effectiveness of this new database text handling application will be ex-plored by the implementation of a prototype system. This will include ex-ploring the need for new kinds of user interface for this task.

The scope of the Automed project will be extended to cover unstructured data in addition to the structured and semistructured data already envisaged [20,21].

References

1 P.J.H.King and A.Poulovassilis. Enhancing database technology to better manage and ex-ploit partially structured data, Birkbeck College, University of London, Technical Report BBKCS-00-14, 2000

2 D. Appelt. An Introduction to Information Extraction. Artificial Intelligence Communica-tions, 1999.

3 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/ overview.html

4 Hobbs, Jerry and Appelt, Douglas and Bear, John and Israel, David and Kameyama, Megumi and Stickel, Mark and Tyson, Mabry. FASTUS: A Cascaded Finite-State Transducer for Ex-tracting Information from Natural-Language Text. in Finite State Devices for Natural Lan-guage Processing, MIT Press, 1996

4 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/ dur_muc7.pdf

5 H. Cunningham and R. Gaizauskas and K. Humphreys and Y. Wilks. Experience with a Lan-guage Engineering Architecture: Three Years of GATE. In Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, The Society for the Study of Artificial Intelligence and Simulation of Behaviour, Edinburgh, U.K. Apr, 1999.

6 http://www.xanalys.com/quenza.html 7 http://www.doc.ic.ac.uk/automed/9 http://www.dcs.bbk.ac.uk/TriStarp 10 M.N.Smith & P.J.H.King. Incrementally Visualizing Criminal Networks, 6th International

Conference Information Visualisation, 200210 http://www.lazysoft.co.uk/sentences/default.htm12 T. Berbers-Lee & Fischetti. Weaving the Web. Harper, San Francisco, 1999.13 G.A.Miller, R.Beckwith, C.Fellbaum, D. Gross, and K. Miller, Introduction to WordNet: An

on-line lexical database, International Journal of Lexicography, vol. 3(4), pp. 235--244, 1990.14 B. Adelberg. NoDoSE - A tool for semi-automatically extracting structured and semistruc-

tured data from text documents. In Proc. SIGMOD'98, pages 283-294, 199815 Brin, Sergey. Extracting Patterns and Relations from the World Wide Web., WebDB Work-

shop at EDBT'9816 Eugene Agichtein, Luis Gravano, Jeff Pavel, Viktoriya Sokolova, Aleksandr Voskoboynik:

Snowball: A Prototype System for Extracting Relations from Large Text Collections. SIG-MOD Conference 2001

17 AnHai Doan, Pedro Domingos, Alon Levy, Learning Source Descriptions for Data Integra-tion (2000), WebDB (Informal Proceedings)

18 http://www.gate.ac.uk 19 J. Wu and B. Heydecker. Natural language understanding in road accident data analysis.

Advances in Engineering Software, 29(79):599-610, 1998. 20 P.McBrien and A.Poulovassilis , A Uniform Approach to Inter-Model Transformations. Proc.

CAiSE'99, Heidelberg, June 1999. Springer-Verlag LNCS 1626, pp 333-348. 21 P.McBrien and A.Poulovassilis, A Semantic Approach to Integrating XML and Structured

Data Sources. Proc. CAiSE'01, Interlaken, June 2001. Springer-Verlag LNCS 2068, pp 330-345.

http://www.gate.ac.uk/

http://www.dcs.bbk.ac.uk/TriStarp

http://www.xanalys.com/quenza.html

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/dur_muc7.pdf

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/dur_muc7.pdf

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html

Documents

Database Driven Discovery of Structure from Partially