8

Click here to load reader

NL Interface for Database - EJSR 20(4)

Embed Size (px)

DESCRIPTION

Imran Sarwar Bajwa, Shahzad Mumtaz, M. Shahid Naveed [2008], "Database Interfacing using Natural Language Processing", European Journal of Scientific Research, Jul 2008, Vol. 20 No. 04, pp:844-851

Citation preview

Page 1: NL Interface for Database - EJSR 20(4)

European Journal of Scientific Research ISSN 1450-216X Vol.20 No.4 (2008), pp.844-851 © EuroJournals Publishing, Inc. 2008 http://www.eurojournals.com/ejsr.htm

Database Interfacing using Natural Language Processing

Imran Sarwar Bajwa Department of Computer Science and IT, The Islamia University of Bahawalpur

E-mail: [email protected]

Shahzad Mumtaz Department of Computer Science and IT, The Islamia University of Bahawalpur

E-mail: [email protected]

M. Shahid Naweed Department of Computer Science and IT, The Islamia University of Bahawalpur

E-mail: [email protected]

Abstract

To write technically correct SQL queries is a complex and skill requiring task especially for a novel user. This situation becomes more complex when a low skilled person has to use a database management system for a specific business purpose. S/He has to write some quires at his own and perform various tasks. This scenario requires more expertise and skills in terms of understanding and writing the accurate and functional queries. The task of the novel user can be simplified by providing an easy interface that is well known to that user. In order to resolve all such issues, automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has the ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The paper describes a system that can create SQL queries automatically. The designed system provides a quick and reliable way to generate SQL queries to save time and budget of both the user and system analyst. Keywords: Information extraction, Automatic Query Generation, Knowledge Retrieval,

Natural language processing. 1.0. Introduction Relational databases are the premier way of storing common data repositories. After storing the data contents in a database, an interfacing mechanism is required to talk with the prearranged repository of the confined data. The conventional way of communicating with a database is to fist build a connection stream and then adding, deleting or updating the data contents in the database by using a standardized interfacing mechanism [1]. Simple command shells are typically used and they are often incorporated within every distinct database product. These command shells are typically simple filters which helps a use to log on to the database, execute particular commands and receive output. These command shells provide access to the database from the machine on which the database is actually running [2]. After hooking to a particular database a user or a programmer requires an interface and typically that

Page 2: NL Interface for Database - EJSR 20(4)

Database Interfacing using Natural Language Processing 845

interface is provided by some technical languages. These languages are called query languages and are constituted of the database commands typically used for asking questions to a distinctive database and getting intended response. SQL [3] (Structured Query Language) is the most popular query language which is actually the defacto language of databases today. SQL is an orthodox tool of database querying. Different database management systems implement this standardized language with trivial alterations and adjustments. However, in spite of these proprietary extensions by the vendors, the core of this querying language is the same in all of the environments.

From an application programmer's point of view, the major novelty in the relational database is that one uses a declarative query language, SQL. Most computer languages are procedural. The programmer tells the computer what to do, step by step, specifying a procedure. Using SQL interface, the programmer defines his requirements and questions and the RDBMS query planner figures out how to get it [5]. There are two compensations of using a declarative language. The first is that the queries no longer depend on the data depiction. The RDBMS is free to store data according to its own design requirements [6]. The second major factor is improved software dependability. For various web-based and stand-alone applications the generic SQL is used to make the things simple and straightforward. Besides these praising compensations occupied by SQL, it’s technical and trifle interface makes this language monotonous and difficult to learn and use. It is quite intricate to remember these SQL commands and use them accurately and precisely.

In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. As far as this software is concerns the time, it takes to explore all the facilities and services, should be quite less than a minute and this information is quite useful for the users. 2.0. Problem Description Modern software engineering requires quick and automated solutions which may have ability to create the accurate and precise SQL queries automatically. For complex queries an expert programmer also requires assistance in terms of automatic query generation. He can use these queries after making appropriate adjustments and alterations in the automated generated queries with less effort in less time as compared to the traditional approaches.

The task of the novel user can be simplified by providing an easy interface that is more familiar and well known to that user. In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has obvious ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The designed system has robust ability to create code automatically without external environment. The designed system provides a quick and reliable way to generate SQL queries to save the time and budget of both the user and system analyst 3.0. Used Methodology The understanding and multi-aspect processing of the natural languages that are also termed as "speech languages", is actually one of the arguments of greater interest in the field artificial intelligence field [8]. The natural languages are irregular and asymmetrical. Traditionally, natural languages are based on un-formal grammars. There are the geographical, psychological and sociological factors which influence the behaviours of natural languages [12]. There are undefined set of words and they also change and vary area to area and time to time.Due to these variations and inconsistencies, the natural languages have different flavours as English language has more than half dozen renowned flavours all over the world [14]. These flavours have different accents, set of vocabularies and phonological aspects. These ominous and menacing discrepancies and inconsistencies in natural languages make it a difficult task to process them as compared to the formal languages [13].

Page 3: NL Interface for Database - EJSR 20(4)

846 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

The English language statements are effortlessly converted into a SQL query by using the newly designed rule based algorithm. Select query is the common query used to choose a set of values from a table [4]. An example of a college database has been used in the conducted research. Student’s data will be retrieved, inserted and deleted by automatically generated queries from simple English text. 3.1. SELECT Query

First of all the ‘SELECT’ query has been processed. ‘SELECT’ query has four parts as following: SELECT * FROM Students

Keyword Required Set keyword Table Name ‘SELECT’ query can easily be generated from the provided input string of as there are two

keywords ‘SELECT’ and ‘FROM’. Other two required values are ‘Required Set’ and ‘Table Name’. To process the speech language text and find ‘Required Set’ and ‘Table Name’ the conventional norms of the English language and grammatical rule are used. The conventional structure of simple English sentence is the key rule of comprehending and analyzing the natural language text [13] as in the following example:

“I need names of all students.” Following is the complete analysis of this simple sentence.

Table 01: Generating SELCET Query from text

Lexicons Phase-I Phase –II I Noun ---------- need Verb ---------- names Noun Field Name of preposition ---------- all Noun * students Noun Table Name

In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table

Name’ filed is filled by the ‘Table Name’ attribute as following: Select * from Students Here the table Name is searched from the array of available all tables in the database. From all

available tables, the nearest table name is picked that ‘students’ in this example. 3.2. INSERT Query

After ‘SELECT’ query ‘INSERT’ query has been processed. ‘INSERT’ query has five fragments as following:

INSERT INTO Students VALUES (5, ‘Ali’)

Keyword keyword Table Name Keyword Record

‘INSERT’ query can also produced from the given statement as there are three keywords ‘INSERT’, ‘INTO’ and ‘VALUES’ [6]. Other two required parameters are ‘Table Name’ and ‘Record’. Using same rule based algorithm ‘Table Name’ and ‘Record’ are extracted. As in the following example:

“I want to insert a student whose Roll No. is 5 and Name is Ali.” Following is the complete analysis of this simple sentence.

Page 4: NL Interface for Database - EJSR 20(4)

Database Interfacing using Natural Language Processing 847

Table 02: Generating INSERT Query from text Lexicons Phase-I Phase –II I Noun ----------- want Verb ----------- to Preposition ----------- insert Verb Action a article ----------- student Noun Table Name whose Conjunction ----------- Roll No Noun Attribute is Helping Verb ------------ 5 Noun Value and Conjunction ------------ Name Noun Attribute is Helping Verb ------------ Ali Noun Value

In this example the ‘Required Set’ field is filled by the ‘Filed Name’ attribute and the ‘Table

Name’ filed is filled by the ‘Table Name’ attribute. Here the table Name is searched from the array of available all table sin the database. From all available tables, the nearest table name is picked that ‘students’ in this example. 3.3. DELETE Query

Same like ‘SELECT’ and ‘INSERT’ queries ‘DELETE’ query can also be easily processed. ‘DELETE’ query has five parts as following:

DELETE FROM Students WHERE Age > 25

Keyword Keyword Table Name Keyword Condition The ‘DELETE’ query typically consists of three keywords as ‘DELETE’, ‘FROM’ and

‘WHERE’. Other two required values are ‘Table Name’ and ‘Condition’. To find ‘Table Name’ and ‘Condition’ parameters the English language defined grammatical rule are used as in the following example:

“I want to delete the students more than 25 years age.” Following is the complete analysis of this simple sentence.

Table 03: Generating DELETE Query from text

Lexicons Phase-I Phase –II I Noun --------- want Verb --------- to preposition --------- delete verb Action the article --------- students Noun Table Name more preposition Condition than Noun ---------- 25 Noun Value years Noun ----------- age Noun Parameter

For ‘DELETE’ query, first the condition is defined. In this example Parameter and Value are

combined with Condition parameters. In this example table Name is also retrieved from the array of available all tables in the database.

Page 5: NL Interface for Database - EJSR 20(4)

848 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

4.0. Work Flow of Designed System The designed system “Computational Linguistics based System for Automatic Database Query Generation” is adequately capable of automatically generating queries. This designed system performs its function in multi-phase procedure. There are five modules in total that are Text input acquisition, text comprehension, Information retrieval and ultimately generation of SQL Queries. Following is the brief detail of all these phases. 4.1. Text input Acquisition

This module helps to acquire input text scenario. User provides the business scenario in from of strings of the text. This module reads the input text in the form characters and generates the words by concatenating the input characters. This module is the implementation of the lexical phase. Lexicons and tokens are generated in this module. After the lexicons generation further processing can be performed on the input text.

Figure 01: Lexical analysis of input text string

4.2. Text Comprehension

This module reads the input from module one in the form of words or lexicons. These words are categorized into various classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions, conjunctions, etc. These classes are further used to understand the various parts of the given sentence.

Figure 02: Parts of speech tagging of input text

4.3. Information Retrieval

This module, extracts key words of the SQL queries as Select, Insert, Delete, From, Into, Where, etc. Keywords are found by matching the tokens with the given array of al possible keywords. These key

Page 6: NL Interface for Database - EJSR 20(4)

Database Interfacing using Natural Language Processing 849

words are further used to generate the respective queries. The information like table name, field name, number of attributes and logical conditions are also extracted in this phase.

Figure 03: Query information extraction

4.4. SQL Queries generation

This module combines the keywords and other required parameters for a particular query. SQL query is ultimately generated here according to the given rules in the designed algorithm. As separate scenario will be provided for various types of queries, the separate functions have been implemented for particular query.

Figure 04: Generation of SQL Query

5.0. Results and Analysis After designing and coding the query generating system, its accuracy and efficiency was tested. For testing purpose of the queries generated by the designed system simple and complex level queries were generated. Each query from each category as Select, Insert, Delete was checked.

15 sample queries were generated and the intended results have been shown in the following table.

Page 7: NL Interface for Database - EJSR 20(4)

850 Imran Sarwar Bajwa, Shahzad Mumtaz and M. Shahid Naweed

Table 04: Accuracy ratio of various types of queries

Types Simple Complex Total SELECT 14 13 90% INSERT 13 11 80% DELETE 14 12 87%

Total Accuracy = 86%

A matrix representing accuracy of query generation test (%) for simple level and complex level queries has been constructed. Overall diagrams accuracy for all types of queries is determined by adding total accuracy of all categories and calculating its average that is 86% in this case.

Figure 05: Graphical representation of the results

0

2

4

6

8

10

12

14

SELECT INSERT DELETE

Simple

Complex

The graph above is showing the accuracy ratio of various SELECT, INSERT & DELETE queries in terms of simple and complex queries parameters. 6.0. Conclusion The designed system “Computational Linguistics based System for Automatic Database Query Generation” facilitates both users and software engineers in terms of generating SQL queries automatically. The task of the novel user can be simplified by providing an easy interface that is more familiar and well known to that user. In order to resolve all such issues, an automated software is needed, which facilitates both users and software engineers. User writes the requirements in simple English in a few statements and the designed system has obvious ability to analyze the given script. After composite analysis and mining of associated information, the designed system generates the intended SQL queries that can be run directly. The designed system has robust ability to create code automatically without external environment. The designed system provides a quick and reliable way to generate SQL queries to save the time and budget of both the user and system analyst. An elegant graphical user interface has also been provided to the user for entering the Input scenario in a proper way and generating UML diagrams. 7.0. Future Work There is also some margin of improvements in the algorithms for generating the intended SQL queries. Current accuracy of generating diagrams is about 80% to 85%. It can be enhanced up to 95% by improving the algorithms and inducing the ability of learning in the system. In this research only three types of queries has been addressed as SELECT, INSERT, and DELETE query. There are still other types of queries that require some sufficient solution.

Page 8: NL Interface for Database - EJSR 20(4)

Database Interfacing using Natural Language Processing 851

References [1] Allen,J. (1994) Natural Language Understanding. Benjamin- Cummings Publishing Company,

New York. [2] Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language

Structure and Use. Cambridge Univ. Press, Cambridge, U.K. [3] D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. (2003) A Comprehensive XQuery to SQL

Translation using Dynamic Interval Encoding. In SIGMOD. [4] C. A. Thompson, R. J. Mooney and L. R. Tang, Learning to parse natural language database

queries into logical form, in: Workshop on Automata Induction, Grammatical Inference and Language Acquisition (1997).

[5] Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York.

[6] A. Rosenthal. D. Reiner, Extending the Algebraic Framework of Query Processing to Handle Outer joins, Proc. VLDB Singa- pore 1984. pp. 334-343.

[7] Fagan, J. L. (1989). The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40 (2), 115–132.

[8] J. M. Zelle and R. J. Mooney, Learning semantic grammars with constructive inductive logic programming, in: Proceedings of the 11th National Conference on Arti_cial Intelligence (AAAI Press/MIT Press, Washington, D.C., 1993), pp. 817ñ822.

[9] Kowalski, G. (1998). Information Retrieval Systems: Theory and Implementation. Kluwer, Boston.

[10] Krovetz, R., & Croft, W. B. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10, 115–141.

[11] Losee, R. M. (1988). Parameter estimation for probabilistic document retrieval models. Journal of the American Society for Information Science, 39(1), 8–16.

[12] Losee, R. M. (1996a). Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis for grammatical rules. Information Processing and Management, 32(2), 185–197.

[13] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Mass.

[14] Partee, B. H., Meulen, A. t., &Wall, R. E. (1990). Mathematical Methods in Linguistics. Kluwer, Dordrecht, The Netherlands.