Upload
nishna-ma
View
756
Download
3
Embed Size (px)
DESCRIPTION
A detail project report on Record matching over multiple query result.
Citation preview
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
TABLE OF CONTENTS
Page
Table of Contents……………………………………………………….. i
List of Tables…………………………………………………………… ii
List of Figures………………………………………………………….. iii
Acknowledgements……………………………………………….……. iv
Synopsis……………………………………………………………….. v
Chapters
1 INTRODUCTION…………………………………………….. 1
2 LITERATURE SURVEY…………………………………….. 3
3 SYSTEM ANALYSIS….…………………………………….. 13
4 TOOLS REQUIRED…………………………………………. 15
5 SYSTEM DESIGN ….………………………………………. 26
6 SYSTEM TESTING……………….…………………….…… 34
7 OBSERVATION AND ANALYSIS…….………………….. 39
CONCLUSION……………………………………………… 52
REFERENCES………………………………………………. 53
1. INTRODUCTION
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Today, more and more databases that dynamically generate web pages in response to
user queries are available on the web. These web databases compose the deep or
hidden web, which is estimated to contain a much larger amount of high quality,
usually structured information and to have a faster growth rate than the static web.
Most web databases are only accessible via a query interface through which users
can submit queries. Once a query is received, the web server will retrieve the
corresponding results from the back-end database and return them to the user. To
build a system that helps users integrate and, more importantly, compare the query
results returned from multiple web databases, a crucial task is to match the different
sources records that refer to the same real world entity.
Benefits:
The project named record matching, identifies the records that represent the same
real-world entity, is an important step for data integration. To address the problem of
record matching in the Web database scenario, we present an unsupervised, online
record matching method, UDD, which, for a given query, can effectively identify
duplicates from the query result records of multiple Web databases. We use two
cooperating classifiers, a weighted component similarity summing classifier and an
SVM classifier, to iteratively identify duplicates in the query results from multiple
Web databases. Experimental results show that UDD works well for the Web
database scenario where existing supervised methods do not apply. This system was
designed in order to meet the disadvantages faced by the existing system. Existing
system does not provide record matching methods are supervised, which requires the
user to provide training data. These methods are not applicable for the Web database
scenario, where the records to match are query results dynamically generated on the
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
fly. Such records are query-dependent and a relearned method using training
examples from previous query results may fail on the results of a new query
calculated in the existing system. Also, there is no efficient storage system for
storing all these details and it consumes more time. To address the problem of
record matching in the Web database scenario, we present an unsupervised, online
record matching method, UDD, which, for a given query, can effectively identify
duplicates from the query result records of multiple Web databases.
Record Matching:
Input design is the process of connecting the user-originated inputs into a computer
to used formats The goal of the input design is to make data entry Logical and free
from errors. Errors in the input database controlled by input design This application
is being developed in a user-friendly manner. The forms are being designed in such a
way that during the processing the cursor is placed in the position where the data
must be entered. An option of selecting an appropriate input from the values of
validation is made for each of the data entered. Concerning clients comfort the
project is designed with perfect validation on each field and to display error
messages with appropriate suggestions. Help managers are also provided whenever
user entry to a new field he/she can understand what is to be entered. Whenever user
enters an error data error manager displayed user can move to next field only after
entering a correct data. After removal of the same-source duplicates, the “presumed”
non-duplicate records from the same source can be used as training examples
alleviating the burden of users having to manually label training examples. Starting
from the non-duplicate set, we use two cooperating classifiers.
2. LITERATURE SURVEYGeneral
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
According to Weifeng et al. (2010), record matching methods are supervised, which requires
the user to provide training data. Since query results are dynamically generated, record
matching methods are not applicable for the web database scenario. The record matching
works most closely related to unsupervised duplicate detection (UDD) is Christens
method. Using nearest based approach, Christen first performed a comparison step to
generate weighted vectors for each pair of records and select those weight vectors as
training examples.
UDD Algorithm
UDD algorithm is used for online duplicate detection. A linear kernel which is as fast
as kernel function is used in duplicate detection. Two classifiers are implemented to
avoid duplication problem. They are weighted component similarity summing
classifier (WCSS) and support vector machine classifier (SVM). In this algorithm,
WCSS plays an important role. It is used to identify some duplicate vectors when
there are no positive examples. After iteration begins, it is used again to cooperate
with SVM to identify new duplicate vectors. Since no duplicate vectors are available,
classifiers that need class information to train, such as decision tree cannot be used.
Two types of intuition in WCSS are duplicate intuition and non duplicate intuition.
In duplicate intuition the similarity between two records should be equal to one and
in nonduplicate intuition the similarity for two non duplicate records should be equal
to zero. Experimental results show that UDD works well for web database scenario
where existing supervised methods are not applicable. Two classifiers are
implemented to avoid duplication problem. They are weighted component similarity
summing classifier and support vector machine classifier.
Schema Matching
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Bin and Kevin (2006), explains that, scheme matching is fundamental for supporting
query mediation across deep web sources. Schema matching provides co occurrence
in formation of attributes and these attributes are grouped by data complex matching
(DCM).The DCM framework consist of data processing, matching discovery and
matching construction. Before executing the DCM framework, query schemas in
web interfaces are not readily mixable in hyper text markup language (HTML)
format. To evaluate the performance of DCM framework, there are two experiments.
First is to isolate and evaluate effectiveness of DCM framework. Second is to
automatically extract the DCM framework.
Metaquerier
The goal of the metaquerier is to build a middleware system that help users to find
their query web sources. Metaquerier system has two critical parts which uses the
result of matching query interfaces. First part builds a unified interface for each
domain through which users can issue queries. Second part can be used for
improving the quality of source selection. In form assistant method, a form assistant
toolkit is developed instead of developing a complete metaquerier system. This
method helps users to translate queries from one interface to other relevant
interfaces. If a user fills the query form in one source then the form assistant can
suggest translated queries for another interested source. To enable such query
translation, the form assistant needs to find matching attributes between two
interfaces. The matching algorithm employs a domain thesaurus that specifies the
correspondences of attributes in the domain. In this scenario matching and translation
errors can be tolerated. The errors can be reduced by providing the best effort query
suggestion.
Bayesian Decision Models
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Vassilos et al. (2002), suggested that, classification is one of the primary task of data
mining. The goal of classification is to correctly assign cases to one of a finite
number of classes. Bayesian decision theory is a fundamental statistical approach
to the problem of pattern classification. This approach is based on the decision
problem where the relevant probability values are known. According to this theory,
record matching or linking is the process of identifying records in a data store that refer to
the same real world entity. There are two types of record matching. The first type is called
exact or deterministic and it is primarily used when there are unique identifiers for each
record. The other type of record matching is called approximate. The two principle steps
in record matching process are searching and matching. In the searching step
potentially linkable pairs of records are searched. The matching step decides whether
a given pair is correctly matched or not. The theoretical decision rule is given by, If r
>upper, then designate pair as link. If lower ≤ r ≤ upper, then designate the pair as a
possible link and hold for clerical review. If r <lower, then designate the pair as non-
link. The upper and lower cutoff thresholds are determined by an error bounds on
false matches and false nonmatches.
Similarity Functions
Mikhail (2003), proposed two string similarity measures. They are learnable edit
distance with affine gaps and learnable vector space similarity based on pair wise
classification. These similarity functions can be trained using a corpus of labeled
pairs of equivalent and nonequivalent strings. Record linkage is the method for
combining two similarity functions. Record linkage algorithms fundamentally
depend on string similarity functions for discriminating between equivalent and non
equivalent record fields. Advanced record linkage system is adaptive record linkage
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
using induction (MARLIN). This system learns to combine trainable string matrices
in a two layer framework for identifying duplicate database records.
Positive Example Based Learning
Hwanjo et al. (2004), documented that, positive example based learning (PEBL) for
web page classification eliminates the need for manually collecting negative training
examples in preprocessing. The goal is to achieve classification accuracy from
positive and unlabeled data. PEBL achieves high accuracy without loss of efficiency
in testing. There are two challenges in this approach. The first challenge is to collect
unbiased unlabeled data from a universal set. The second one is to achieve
classification accuracy from positive and unlabeled data. The PEBL framework
applies an algorithm called mapping convergence (MC) which uses the SVM
techniques. The property of the SVM technique is to ensure classification accuracy
from positive and unlabeled data. PEBL runs MC algorithm in the training phase to
construct an accurate SVM from positive and unlabeled data. Once the SVM is
constructed, classification performance in the testing phase will be the same as that
of a typical SVM in terms of both accuracy and efficiency. One class support vector
machine (OSVM) distinguishes one class of data from the rest of feature space given
only positive dataset. OSVM draws the class boundary of the positive data set in the
feature space.
Mapping Convergence Algorithm
The main goal of MC is to achieve high classification accuracy. MC runs in two
stages such as the mapping stage and the convergence stage. In the mapping stage,
the algorithm uses a weak classifier that draws an initial approximation of strong
negative data. The convergence stage uses a second base classifier to make
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
progressively better approximation of negative data. MC identifies strong positive
features from positive and unlabeled data. This identification is done by checking the
frequency of the features within positive and unlabeled training data. A feature that
occurs in ninety percent of positive data but only in ten percent of unlabeled data is
said to be a strong positive feature. Consider a list of positive feature that occurs in
the positive training data more often than in the unlabeled data. By using this list of
the positive features, any positive data point can be filtered. The unlabeled data set
that leaves only strong negative data is called as strong negatives. OSVM draws its
boundary around positive data set in the feature space. Its boundary cannot be as
accurate as that of MC.
Fast Duplication Detection
Cyju and Sundar (2011), gave an idea about duplication detection that identifies the
records which represent the same real world entity. To address the problem of record
matching in the web database scenario, a fast duplication detection (FDD) is
introduced.. FDD uses clustering in order to reduce record comparison. For a given
query, FDD can effectively identify duplicates from the query result records of
multiple web databases. FDD is an efficient approach to solve duplication detection
problem in web database scenario where the records to match query are dependant
and can be changed dynamically. The various steps in duplicate detection includes
similarity calculation, k means clustering, vector generation, dynamic weight
allocation, support vector machine and actual duplicate identification. In similarity
calculation, a threshold value is set to identify the duplicate vector and non duplicate
vector. K means clustering method incorporates clustering before obtaining initial
potential duplicate and non-duplicate vector. Clustering is the unsupervised
classification of patterns or data items into groups or clusters. Vector generation
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
method generates potential duplicate vector and non duplicate vector which is
clustered more proficiently. The dynamic allocation of weights to different fields in
each record is performed by the dynamic allocation algorithm. This step includes two
types of intuitions such as duplicate intuition and non duplicate intuition. SVM
classifier is a useful technique for data classification.
Duplicate Elimination
According to Vijayaraja et al. (2011), there are two system methodologies. They are
unsupervised duplicate elimination (UDE) and fuzzy ontological document
clustering (FODC). UDE is based on adjusting the weight set for record fields. UDE
employs a similarity function to find field similarity. It uses a similarity vector to
represent a pair of records. There are two classifiers in UDE. They are weight
component similarity summing classifier (WCSS) and support vector machine
classifier (SVM). WCSS is used to identify duplicate vectors. SVM classifier should
be insensitive to the relative size of positive and negative examples. Database
schema matching is very essential step in data integration. Data integration is the
process of finding mappings between attributes of two schemas that semantically
correspond to each other. In UDE, a global schema for specific type of records is
predefined and each query result has been matched to the global schema.
Fuzzy Ontological Document Clustering
The methodology for fuzzy ontological document clustering (FODC) includes the
following steps. The first step in building a patent ontology of the FODC method
requires the use of a knowledge based editing tool called protege. The tool assists the
domain experts in defining an ontology schema using a graphical interface. Natural
language processing and terminology training is the second step used in FODC. It is
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
trained using set of patent documents. After natural language processing and
terminology training, all of the sentence concepts are inferred in the terminology
analyzer step. After analyzing the terminology, in the knowledge extraction it
computes the concept probabilities for each chunk. The chunks implying concepts as
predicates are the first to enter into the ontology. Patent similarity match method is
the final step in FODC.
Record Linkage
Peter (2004), explained about record or data linkage which is an important enabling
technology in the health sector. Linked data is a cost effective resource that can help
to improve research into health policies and uncover fraud within the health system.
Significant advances originating from data mining and machine learning have been
made in recent years in various areas of record linkage techniques. Most of these new
methods are not yet implemented in current record linkage systems, or are hidden
within black box commercial software. This makes it difficult for users to learn about
new record linkage techniques, as well as to compare existing linkage techniques
with new ones. As most real-world data collections contain noisy, incomplete and
incorrectly formatted information, data cleaning and standardization are important
pre-processing steps for successful record linkage, and also data can be loaded into
data warehouses or can be used for further analysis or data mining.
Febrl GUI Structure and Functionality
Freely available record linkage system (FEBRL) is implemented in Python, a free
object oriented programming language that is available on all major computing
platforms and operating systems. Originally developed as scripting language, Python
is now used in a large number of applications, ranging from Internet search engines
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
and web applications to steering of computer graphics for Hollywood movies and
large scientific simulation codes. Many organizations use Python, including Google
and NASA, and due to its clear structure and syntax it is also used by various
universities for undergraduate teaching in introductory programming courses. Python
is an ideal platform for rapid prototype development as it provides data structures
such as sets, lists and dictionaries (associative arrays) that allow efficient handling of
very large data sets, and includes many modules offering a large variety of
functionalities. For example, it has excellent built-in string handling capabilities, and
the large number of extension modules facilitate, for example, database access and
graphical user interface (GUI) development. A general schematic outline of the
record linkage process is given in Figure.1.
Figure.1. General Record Linkage Process.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Entity Resolution
Omar et al. (2009), observed that, entity resolution (ER) is the process of identifying
and merging records to represent same real world entity. Record matching is
computationally expensive and application specific. For example, customer
information management solutions from a company have been interacting with users.
Combination of nickname algorithms, edit distance algorithms, fuzzy logic
algorithms, and trainable engines are used to match customer records. The
assumptions used in entity resolution are pair wise decisions, no confidences, no
relationships and consistent labels. Pair wise decisions match and merge records
operating between two records at a time. Their operation depends on the data in these
records, and not on the evidence in other records. No confidence functions may
compute numeric similarities, but in the end they make yes or no decisions to check
whether a record matches or not. No relationship records contain all the information
that pertains to each entity. In consistent labels, input data will go through a schema-
level integration phase, where incoming data is mapped to a common set of well-
defined labels.
Input Data Initialization
In the first step, a user has to select if she or he wishes to conduct a project for
cleaning and standardization of a data set reduplication of a data set, or linkage of
two data sets. The data page of the febrl graphical user interface (GUI) will change
accordingly and either show one or two data set selection areas. Several text based
data set types are currently supported, including the most commonly used comma
separated values (CSV) file format. This makes it difficult for users to learn about
new record linkage techniques, as well as to compare existing linkage techniques
with new ones. SQL database access will be added in the near future.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Information Integration
Sharma (2009), suggested about information integration, query by keywords and
ranking results in the context of web queries. Although internet search itself has been
around for a while and is used by general populace as well as by technical users,
querying the web is still in its infancy and is limited to specific domains and
applications. In contrast, querying a structured database has been around for several
decades and query answering as well query optimization has advanced to a
significant stage. The challenge now is whether the work on querying a structured
database can be redirected meaningfully towards querying the web. Unlike search,
querying the internet requires intelligent integration of information from multiple
web sources to construct meaningful answers. A number of new issues such as
how to pose a query, how to determine sources for answering a query, how to deal
with lack of schema, data extraction from web sources, web query optimization,
integration or combining data from multiple sources, and ranking of results, need to
be addressed in order to solve the problem of information integration.
Query Specification
The difference between search, metasearch, and information integration are
described and also introduces the differences between query processing over a
database and query processing over the internet. A general framework and an
architecture to highlight the sub problems involved in processing an arbitrary query
over the internet is also introduced. This will bring out a number of new problems,
their complexity and where they stand currently in terms of solutions, integration or
combining data from multiple sources present the general purpose problem along
with details of sub-problems that need to be solved in order to accomplish true
information integration.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
3. SYSTEM ANALYSIS
Analysis involves a detailed study of the current system, leading to specifications
of a new system. Analysis is a detailed study of various operations performed by
a system and their relationships within and outside the system. During analysis,
data are collected on the available files, decision points and transactions
questionnaire are the tools used for system analysis. The main points to be
discussed in system analysis are: Specifications of what the new system is to
accomplish based on the user requirements. Functional hierarchy shows the
functions to be performed by the new system and their relationship with each
other. The techniques of software engineering principles-system study and
analysis, system requirement specifications, system design, system coding,
system testing and implementation were obtained from the book, Fundamentals of
Software Engineering.
Project Features:
The featured project section gives the overview of the various tasks that are there
in the project along with their interpretations in the phases in the project. In the
system study and analysis phase the existing system was compared with the
proposed system by means of the analysis done in the course of the project. The
feasibility study was also done in this phase. All the requirements that are
needed in the project, both software and hardware requirements are specified
in the system requirement specification phase of the project. The system
design phase in the software development life cycle is an inevitable part in
the development of the project. The entire model for frame work of any
software development life cycle lies. The system design phase is an
inevitable part in the development of the project.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Requirement Analysis:
Systems analysis is the study of systems sets of interacting entities, including
computer systems. This field is closely related to operations research. It is also an
explicit formal inquiry carried out to help someone, referred to as the decision maker,
identify a better course of action and make a better decision than he might have
otherwise made. Employment utilizing systems analysis includes systems analyst,
business analyst, manufacturing engineers, enterprise and architect. Systems analysis
is the process of examining a business situation for the purpose of developing a
system solution to a problem or devising improvements to such a situation. Before
the development of any system can begin, a project proposal is prepared by the users
of the potential system and/or by system analysts and submitted to an appropriate
managerial structure within the organization. So the objective of the system analysis
phase is the establishment of the requirements for the system to be acquired,
developed and installed.
Relevant Analytics:
Relevant analytics capabilities are often interwoven into applications for sales,
marketing, and customer service. Sales analytics let companies monitor and
understand customer actions and preferences, through sales forecasting, data quality
management. Applications generally come with predictive analytics to improve
customer segmentation and targeting, and features for measuring the effectiveness of
online, offline, and search marketing campaign. Web analytics have evolved
significantly from their starting point of merely tracking mouse clicks on websites.
By evaluating customer buy signals marketers can see which prospects are most
likely to transact and also identify those who are bogged down in a sales process and
need assistance.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
4. TOOLS REQUIRED
Dot NET Framework an Overview:
The language is intended for use in developing software components suitable for
deployment in distributed environments. Source code portability is very important
especially for those programmers who are already familiar with C and C++. Support
for internationalization is very important. C# is intended to be suitable for writing
applications for both hosted and embedded systems, ranging from the very large that
use sophisticated operating systems, down to the very small having dedicated
functions. The Microsoft .NET Framework is a software framework that can be
installed on computers running Microsoft Windows operating systems. It includes a
large library of coded solutions to common programming problems and a virtual
machine that manages the execution of programs written specifically for the
framework. Although C# applications are intended to be economical with regard to
memory and processing power requirements, the language was not intended to
compete directly on performance and size with C or assembly language. The .NET
Framework is a key Microsoft offering and is intended to be used by most new
applications created for the Windows platform. The framework's Base Class Library
provides a large range of features including user interface, data and data access,
database connectivity, cryptography, web application development, numeric
algorithms, and network communications. The class library is used by programmers,
who combine it with their own code to produce applications. Programs written for
the .NET Framework execute in a software environment that manages the program's
runtime requirements. Also part of the .NET Framework, this runtime environment is
known as the Common Language Runtime. The CLR provides the appearance of an
application virtual machine so that programmers need not consider the capabilities of
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
the specific CPU that will execute the program. The CLR also provides other
important services such as security, memory management, and exception handling.
The entire model for frame work of any software development life cycle is
intended to be suitable for writing applications for both hosted and embedded
systems, ranging from the very large that use sophisticated operating systems, down
to the very small having dedicated functions.
ASP.NET:
To create dynamic web pages by using server side scripts, Microsoft has
introduced ASP. The .NET version ASP is ASP.NET. It is a standard HTML file
that contains embedded server side script .ASP.NET provides various advantages
of server side scripting. ASP.NET enables to access information from data sources
such as backend databases and text files that are stored on a Web server or a
computer that is accessible to a Web server.ASP.NET enables to use a set of
programming code called templates to create HTML documents. The advantage of
using templates is that we can dynamically insert the content retrieved from data
sources, such as backend databases and text files into an HTML documents before
the HTML document is displayed to users. For this reason, the information need not
be changed manually as and when the contents retrieved from the data source
change.
ASP.NET in .NET Framework:
ASP.NET, which is the .NET version of ASP, is inbuilt on the Microsoft .NET
framework. Microsoft introduced the .NET framework to help developers create
globally distributed software with internet functionality and interoperability. As
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
displayed in the preceding figure, the elements of an ASP.NET application includes
the web forms, the configuration files, and XML web service files. Microsoft
introduced the .NET framework to help developers create globally distributed
software with internet. The elements of an ASP.NET application services to
provide a mechanism for programs to communicate on the internet. Web forms and
state management features to ASP.NET runtime services include session
application state management, web security, and calling mechanism of ASP.NET
applications. The ASP.NET architecture is shown in the Figure.2.
ASP.NET Application Elements
ASP.NET Page Framework
XML
.NET Framework Base classes
Common Language Runtime
Figure.2.ASP.NET Framework
Web forms Pages with .aspx extension and corresponding class files
Configuration files with .config extension
XML Web services files with .aspx extension
ASP.NET Run-time Services State Management
View state
Session state
Application state
Web Security
Caching
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Hardware Specifications:
The hardware specifications vary from time to time. For heavier applications the
hardware specifications will be of higher demand. A hard disk drive is a non-volatile
storage device which stores digitally encoded data on rapidly rotating platters with
magnetic surfaces. . Some of the main advantages of using D2 shape distributions are
that its concise to store, quick to compute, invariant to transforms, efficient to match,
insensitive to noise, insensitive to topology, robust to degeneracies, invariant to
deformations and discriminating. Strictly speaking, "drive" refers to a device distinct
from its medium, such as a tape drive and its tape, or a floppy disk drive and its
floppy disk Higher the hardware capability, higher will be the convenience for the
developer. The main memory or RAM is the working memory. It decides the speed
of the system. Hard disk is used for storing data. Processing speed determines the
speed of execution of the program. HDDs record data by magnetizing ferromagnetic
material directionally, to represent either a 0 or a 1 binary digit. They read the data
back by detecting the magnetization of the material. A typical HDD design consists
of a spindle that holds one or more flat circular disks called platters, onto which the
data are recorded. The platters are spun at very high speeds. Hardware specification
is shown in Table.1
Table.1. Hardware Specifications
Main Memory 512MB or Above
Hard Disk Min 10 GB Free
Processor Pentium 4
Processor Speed 1.5 GHZ
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Software Specifications:
Operating system preferred is windows XP. Microsoft Visual Studio 2005 is the
development framework for the application which is also called the Integrated
Development Environment. Windows XP is the most preferable platform rather than
windows vista because xp is the most widely used operating system in the world. The
hardware specifications vary from time to time. For heavier applications the
hardware specifications will be of higher demand. It is the working turf for the
language that we choose.Sql server 2005 is the database, in which all the data related
to this project will be stored. Windows XP is the most preferable platform rather than
other versions because xp is the most widely used operating system in the world.
Microsoft visual studio 2005 is the development framework and The Microsoft .NET
Framework is a software framework that can be installed on computers running
Microsoft Windows operating systems. It includes a large library of coded solutions
to common programming problems and a virtual machine that manages the execution
of programs written specifically for the framework. The .NET Framework is a key
Microsoft offering and is intended to be used by most new applications created for
the Windows platform. The framework's Base Class Library provides a large range
of features. The software specification is shown in Table.2.
Table.2. Software Specifications
Operating System All versions of Windows
Required .net framework 2.0 Framework
Front end Microsoft Visual Studio 2005, VC#
Back end SQL Server 2005
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Services of ASP.NET:
The run time services of ASP.NET interact with .NET Framework base classes,
which in turn, interact with the common language runtime it provide a robust web
based document environment. Application includes web forms, configuration
textbox, list box controls, and the application logic of web applications.
Configuration files enable to store the configuration settings of as ASP.NET
application. The elements of an ASP.NET application also include web services to
provide a mechanism for programs to communicate on the internet. Web forms and
state management features to ASP.NET runtime services include session
application state management, web security, and calling mechanism of ASP.NET
applications. The runtime services of ASP.NET interact with .NET framework base
classes, which in turn, interact with common language runtime to provide a robust
web based development environment.
Execution of an ASP.NET File:
To execute ASP.NET file, the following steps are taken: A web browser sends a
request for an ASP.NET file to a web server by using a uniform resource Locater.
The web server receives the request and retrieves the appropriate ASP.NET file
from the disk or memory. The web server forwards the ASP.NET file to the
ASP.NET script engine for processing. The ASP.NET script engine reads the file
from top to bottom and executes any server side script it encounters. The processed
ASP.NET file is generated as HTML document and the ASP.NET script engine
sends the HTML page to the web server. To execute the file web browser send a
request for an ASP. The web server then sends the HTML page to the client and the
browser interprets the output and displays that can be referred during future work.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Features of ASP.NET:
In addition to hiding script commands, ASP.NET has some advanced features that
help develop robust web applications. It includes the compiled code written in
ASP.NET that is only compiled and not interpreted. This makes ASP.NET
applications faster to execute than other server side scripts that are interpreted, such
as the scripts written in a previous version of ASP. The ASP.NET Framework is
provided with a rich toolbox and designer in the Visual Studio.NET IDE. Some of
the features of the powerful tool are, what u see is what you get edited. We can also
drag and drop server controls and can perform automatic deployment. ASP.NET
applications are based on common language runtime. As a result, the power and
flexibility of the NET platform is available to ASP.NET application developers.
Security feature of ASP.NET provide a number of options for implementing
security and restricting user access. Scalability feature of ASP.NET has been
designed that help to improve performance in a multiprocessor environment.
Applications:
ASP.NET applications enable to ensure that the .NET Framework class library,
messaging, and the data access solutions are seamlessly accessible on the
Web.ASP.NET. These features are language independent. ASP.NET enables to
build user interfaces that separate application logic from presentation content. In
addition, common language runtime simplifies application development by
using managed code services, such as automatic references counting and
garbage collection. The code written in ASP.NET is compiled and not interpreted.
For this reason, ASP.NET makes it easy to perform common tasks ranging from
submission and client authentication to web site configuration.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Extensible Markup Language:
SQL server includes native support for managing data, in addition to relational
data. For this purpose, it defined a data type that could be used either as a data type
in database columns or as literals in queries. Here columns can be associated with
schemas and the data being stored is verified against the schema. It is converted to
an internal binary data type before being stored in the database. Specialized
indexing methods were made available for XML data. Data is queried using X
Query. Common language runtime integration is the main feature with this edition
where one could write SQL code as managed code. In addition, it also defines a
new extension that allows query based modifications to XML data. When the data
is accessed over web services, results are returned as XML.
Data Storage:
The main unit of data storage is a database, which is a collection of tables with
typed columns. SQL Server supports different data types, including primary types
such as integer, float, decimal, char including character strings, varchar variable
length character strings, binary for unstructured blobs of data, text for textual data
among others. The rounding of floats to integers uses either symmetric arithmetic
Rounding or symmetric round down fix depending on arguments. Then
Microsoft SQL Server also allows user defined composite types to be defined
and used. It also makes server statistics available as virtual tables and views called
dynamic management views. In addition to tables, a database can also contain other
object including views, stored procedures, indexes and constraints, along with a
transaction log. The data in the database are stored as primary data files with an
extension .mdf. The amount of memory available to SQL server decides how many
pages will be cached in memory.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Features of SQL:
Log files are identified with the .1df extension. Storage space allocated to a
database is divided into sequentially numbered pages, each 8 KB in size. A page is
the base unit of input-output for SQL Server operations. A page is marked with a
96byte header which stores metadata about the page including the page number,
page type, free space on the page and ID of the object that owns it. Page type
defines the data contained in the page data stored in the database index, allocation
map which holds information about how pages are allocated to table and indexes,
change map which holds information about the changes made to other pages since
last backup or logging, or contain large data types such as image or text. While
page is the basic unit of an operation, space is actually managed in terms of an
extent which consists of 8 pages. A database object can either span all 8 pages in an
extent uniform extent or share an extent with up to 7 more object mixed extent. A
row in a database table cannot span more than one page, so is limited to 8 KB in
size. Secondary data files identified with an .ndf extension are used to store
optional metadata. also makes server statistics available as virtual tables and views
called dynamic. SQL Server supports different data types, including primary types
such as integer, float, decimal, char including character strings, variable
length character strings, binary for unstructured blobs of data, text for textual data
among others.
Database Table:
A table is split into multiple partitions in order to spread a database over a cluster.
Rows in each partition are stored in either Tree or heap structure. If the table has an
associated index to allow fast retrieval of rows, the rows are stored in order
according to their index values, with a Tree providing the index. The data is in the
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
leaf node of the leaves, and other nodes storing the index values for the leaf data
reachable from the respective nodes. If the index is no clustered, the rows are not
sorted according to the index keys. An indexed view has the same storage structure
as an index table. A table without an index is stored in an unordered heap structure.
Both heap and btree can span multiple allocation units of data storage.
Data Retrieval:
The main mode of retrieving data from an data base from is querying or it. The
query is expressed using a variant of SQL called TSQL. Both the old as well as the
new version of the row are stored and maintained, through the old version are
moved out of the database into a system database identified as tempdb. When a row
Is in the process of being updated, any other requests are not blocked unlike locking
but are executed on the older version of the row. If the other request is an update
statement it will result in two different versions of the rows both of them will be
stored by the database, identified by their respective transaction ids declaratively
specifies what is to be retrieved. It is processed by the query processor, which
figures out the sequence of steps that will be necessary retrieve the requested data.
The sequence of actions that is necessary to execute a query is called a query plan
that is to be followed.
Query Processing:
There might be multiple ways to process the same query. For example, for a query
that contains a join statement and a select statement, executing join on both the
tables and then executing select on the results would give the same result as
selecting from each table and then executing the join, but result in different
execution plans. In such case, database server chooses the plan that is supposed to
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
yield the results in the shortest possible time. Given a query, the query optimizer
looks at the database the schema, the database statistics and the system load at that
time. It then decides which sequence to access the tables referred in the query,
which sequence to execute the operations and what access method to be used to
access the tables. If the table has an associated index, whether the index should be
used or not if the index is on a column which is not unique for most of the columns
low selectivity, it might not be worthwhile to use the index to access the data. The
queries are sent by the client as taken as input parameters and send back the results
as output parameters. They can call defined functions and other stored procedures,
including the same stored procedure up to asset number of times. The run time
services of ASP.NET interact with .NET Framework base classes, which in turn,
interact with the common language runtime that provide a robust web based
document environment. Application includes web forms, configuration textbox, list
box controls, and the application logic of web applications. The runtime services of
ASP.NET interact with .NET framework base classes, which in turn, interact with
common language runtime to provide a robust web based service. ASP.NET makes
it easy to perform submission and client authentication to web site configuration.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
5. SYSTEM DESIGN
Systems design is the process or art of defining the architecture, components,
modules, interfaces, and data for a system to satisfy specified requirements. One
could see it as the application of systems theory to product development. There is
some overlap with the disciplines of systems analysis, systems architecture and
systems engineering. The broader topic of product development blends the
perspective of marketing, design, and manufacturing into a single approach. Design
is the act of taking the marketing information and creating the design of the product
to be manufactured. Systems design is therefore the process of defining and
developing systems to satisfy specified requirements of the user. Until the 1990s
systems design had a crucial and respected role in the data processing industry. In the
1990s standardization of hardware and software resulted in the ability to build
modular systems. The increasing importance of software running on generic
platforms has enhanced the discipline of software engineering. Object-oriented
analysis and design methods are becoming the most widely used methods for
computer system design.
Unified Modeling Language:
The Unified Modeling Language(UML) has become the standard language used in
Object-oriented analysis and design. It is widely used for modeling software systems
and is increasingly used for high designing non-software systems and organizations.
Graphical system design is a modern approach to designing, prototyping, and
deploying embedded systems that combines open graphical programming with
COTS. Graphical system design simplifies development resulting in higher-quality
designs with a migration to custom design. Designing is basically a way for domain
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
experts or non-embedded experts to access the embedded designs where they would
traditionally need to outsource an embedded design expert. This approach to
embedded system design is a super-set of Electronic System Level (ESL) design.
Graphical system design expands on the EDA-based ESL definition to include other
types of embedded system design including industrial machines and medical devices.
Many of these expanded applications can be defined as the long tail applications.
Graphical system design is a complementary but encompassing approach that
includes embedded and electronic system design, implementation, and verification
tools. ESL and graphical system design are really part of the same movement. Higher
abstraction and more design automation looking to solve the real engineering
challenges that designers are facing today. Addressing design flaws that are
introduced at the specification stage to ensure they are detected well before
validation for on time product delivery.
Integrated Development Environment:
Microsoft Visual Studio is an Integrated Development Environment from Microsoft.
It can be used to develop console and graphical user interface applications. Windows
forms applications, web sites, web applications, and web services in both native code
and managed code are supported by Microsoft Windows. Windows Mobile,
Windows CE, .NET Framework, .NET Compact Framework and Microsoft
Silverlight also supports Microsoft windows. Visual Studio includes a code editor
supporting Intelligence as well as code refactoring. The integrated debugger works
both as a source-level debugger and a machine-level debugger. Other built in tools
includes a form designer for building GUI applications, web designer, class designer,
and database schema designer. It also accepts plug ins that enhance the functionality
at almost every level including adding support for source control systems like
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Subversions and Visual SourceSafe for some new tool sets like editors and visual
designers, Domain specific languages or toolsets for other aspects of the software
development lifecycle. Visual Studio supports languages by means of language
services, which allow the code editor and debugger to support nearly any
programming language, provided a language-specific service exists. It Supports
languages such as M, Python, and Ruby which are available through language
services installed separately. It also supports JavaScript and CSS. Language specific
versions of Visual Studio also exist which provides more limited language services
to the user. These individual packages are called Microsoft Visual Basic, Visual J#,
Visual C#, and Visual C++. Microsoft provides Express editions of its Visual Studio
2010 components Visual Basic, Visual C#, Visual C++, and Visual Web Developer
at no cost. Visual Studio 2010, 2008 and 2005 Professional Editions, along with
language-specific versions of Visual Studio 2005 are available for free to students as
downloads through Microsoft's Dream Spark program.
Code Editor:
Visual Studio, like any other IDE, includes a code editor that supports syntax
highlighting and code completion using IntelliSense. It is used not only for variables,
functions and methods but also for language constructs like loops and queries.
IntelliSense supports included languages such as XML, Cascading Style Sheets and
JavaScript for development of web sites and web applications. Auto complete
suggestions are popped up in a modeless list box, overlaid on top of the code editor.
In Visual Studio 2008 onwards, it can be made temporarily semi transparent to see
the code obstructed by it. The code editor is used for all supported languages. The
Visual Studio code editor also supports setting bookmarks in code for quick
navigation. Other navigational aids include collapsing code blocks and incremental
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
search in addition to normal text and report search. The code editor also includes
amulti item clipboard and a task list. The code editor supports code snippets, which
are saved templates for repetitive code and can be inserted into code and customized
for the project being worked on. Visual Studio provides background compilation
which is called as incremental compilation. As code is being written, Visual Studio
compiles it in the background in order to provide feedback about syntax and
compilation errors, which are flagged with a red wavy underline. Warnings are
marked with a green underline. Background compilation does not generate
executable code, since it requires a different compiler than the one used to generate
executable code. Background compilation was initially introduced with Microsoft
Visual Basic but has now been expanded for all included languages.
Debugger:
Visual studio includes a debugger that works both as a source level debugger and
machine level debugger. It works with both managed code as well as native code and
can be used for debugging applications written in any language supported by visual
studio. In addition, it can also attach to running processes and monitor and debug
those processes. If source code for the running process is available, it displays the
code as it is being run. If source code is not available, it can show the disassembly.
The Visual Studio debugger can also create memory dumps and load them later for
debugging. Multi-threaded programs are also supported. The debugger can be
configured to be launched. When an application runs outside, debugger crashes and
the system reports an error in the internal system registry. The debugger allows
setting breakpoints (which allow execution to be stopped temporarily at a certain
position) and watches (which monitor the values of variables as the execution
progresses). Breakpoints can be conditional, meaning they get triggered when the
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
condition is met. Code can be stepped over, i.e., run one line of source code at a time.
During coding, the Visual Studio debugger lets certain functions be invoked
manually from the immediate tool window. The parameters to the method are
supplied at the immediate window. The DFD is also known as bubble chart. It is
simple graphical formalism that can be used to represent a system in terms of data to
the system. Various processing are carried out on these data and the output data is
generated by the system.
Search Engine:
Most Web databases are only accessible via a query interface through which users
can submit queries. Once a query is received, the Web server will retrieve the
corresponding results from the back-end database and return them to the user. To
build a system that helps users to integrate and more importantly, compare the query
results returned from multiple Web databases, a crucial task is to match the different
sources records that refer to the same real-world entity as shown in Figure.3.
Figure.3.Search Engine
Database
Database
Database
Query Result set Searching Databases
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Level 0:
The Authentication module is used for security purpose. Only the authenticate person
can make all the operation in this project. That authentication is given to
administrator only. Administrator is the only authorised user for this smart record
matching software. The User is only allowed for Searching as shown in Figure.4.
Figure.4.Level 0
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Level 1:
The fields of several records are uploaded in which the search must proceed in order
to find the matching records. The Administrator is only allowed to upload the book
data’s. This module contains Book Shop Name, Author Name, ISBN Number, Book
title, Year and Specified Book for uploading as shown in Figure.5.
Figure.5.Level 1
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Use Case Diagram:
So Many Book Shops are available in this world. Each Book Shop have separate
database for future references or any other references. The Book shop want to upload
the books in website definitely create the database in web for books. Before Upload
the book he/she have to fill the registration form in website. It contains the Book
Shop Name, Address of the book shop, Contact Number, Mail ID, User Name and
each book shop have separate password for later login. The users are allowed to
search for books as shown in Figure.6.
Figure.6.Use Case Diagram
User1
User2
User3 Etc.,
Search Engine Book Shop 1
Book Shop 2
Book Shop 3Etc.,
Books Uploading
Books Uploading
Databases
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
6. SYSTEM TESTING
Software testing is an investigation conducted to provide stakeholders with
information about the quality of the product or service under test. Software testing
also provides an objective independent view of the software to allow the business to
appreciate and understand the risks at implementation of the software. Testing
techniques include the process of executing a program or application with the intent
of finding software bugs. Software testing can also be stated as the process of
validating and verifying that a software program meets the business and technical
requirements that guided its design and development works as expected and can be
implemented with the same characteristics.
Software Testing:
Software testing can be implemented at any time in the development process
depending on the testing method employed. However, most of the test effort occurs
after the requirements have been defined and the coding process has been completed.
As such, the methodology of the test is governed by the software development
methodology adopted. Different software development models will focus the test
effort at different points in the development process. Newer development models,
such as Agile, often employ test driven development and place an increased portion
of the testing in the hands of the developer, before it reaches a formal team of testers.
In a more traditional model, most of the test execution occurs after the requirements
have been defined and the coding process has been completed. Testing can never
completely identify all the defects within software. Instead, it furnishes a criticism or
comparison that compares the state and behavior of the product against oracles,
principles or mechanisms by which someone might recognize a problem. These
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
oracles may include (but are not limited to) specifications, contracts, comparable
products, past versions of the same product, inferences about intended or expected
purpose, user or customer expectations, relevant standards, applicable laws, or other
criteria. Every software product has a target audience. For example, the audience for
video game software is completely different from banking software. Therefore, when
an organization develops or otherwise invests in a software product, it can assess
whether the software product will be acceptable to its end users, its target audience,
its purchasers, and other stakeholders. Software testing is the process of attempting
to make this assessment. A primary purpose for testing is to detect software failures
so that defects may be uncovered and corrected. This is a non-trivial pursuit. Testing
cannot establish that a product functions properly under all conditions but can only
establish that it does not function properly under specific conditions.
Scope:
The scope of software testing often includes examination of code as well as
execution of that code in various environments and conditions as well as examining
the aspects of code: does it do what it is supposed to do and do what it needs to do. In
the current culture of software development, a testing organization may be separate
from the development team. There are various roles for testing team members.
Information derived from software testing may be used to correct the process by
which software is developed. Functional testing refers to tests that verify a specific
action or function of the code. These are usually found in the code requirements
documentation, although some development methodologies work from use cases
or user stories. Functional tests tend to answer the question of can the user do this or
does this particular feature work. Non functional testing refers to aspects of the
software that may not be related to a specific function or user action, such as
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
scalability or security. Non-functional testing tends to answer such questions as how
many people can log in at once, or how easy is it to hack this software. Not all
software defects are caused by coding errors. One common source of expensive
defects is caused by requirement gaps, unrecognized requirements that result in
errors of omission by the program designer. A common source of requirements gaps
is non-functional requirements such as testability, scalability, maintainability,
usability, performance, and security.
Tolerance:
Software faults occur through the following processes. A programmer makes an error
mistake, which results in a defect fault, bug in the software source code. If this defect
is executed, in certain situations the system will produce wrong results, causing a
failure. Not all defects will necessarily result in failures. For example, defects in dead
code will never result in failures. A defect can turn into a failure when the
environment is changed. Examples of these changes in environment include the
software being run on a new hardware platform, alterations in source data or
interacting with different software. A single defect may result in a wide range of
failure symptoms. A common cause of software failure real or perceived is a lack of
compatibility with other application software, operating systems or operating system
versions, old or new, or target environments that differ greatly from the original such
as a terminal or GUI application intended to be run on the desktop now being
required to become a web application, which must render in a web browser. For
example, in the case of a lack of backward compatibility, this can occur because the
programmers develop and test software only on the latest version of the target
environment, which not all users may be running. This result in the unintended
consequence that the latest work may not function on earlier versions of the target
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
environment or on older hardware those earlier versions of the target environment
was capable of using. A very fundamental problem with software testing is that
testing under all combinations of inputs and preconditions initial state is not feasible,
even with a simple product. This means that the number of defects in a software
product can be very large and defects that occur infrequently are difficult to find in
testing. More significantly, non-functional dimensions of quality how it is supposed
to be versus what it is supposed to usability, scalability, performance, compatibility,
and reliability can be highly subjective; something that constitutes sufficient value to
one person may be intolerable to another.
Validation:
There are many approaches to software testing. Reviews, walkthroughs, or
inspections are considered as static testing, whereas actually executing programmed
code with a given set of test cases is referred to as dynamic testing. Static testing can
be and unfortunately in practice often is omitted. Dynamic testing takes place when
the program itself is used for the first time which is generally considered the
beginning of the testing stage. Dynamic testing may begin before the program is
100% complete in order to test particular sections of code modules or discrete
functions. Typical techniques for this are either using stubs/drivers or execution from
a debugger environment. For example, spreadsheet programs are, by their very
nature, tested to a large extent interactively on the fly, with results displayed
immediately after each calculation or text manipulation. Though controversial,
software testing may be viewed as an important part of the software quality
assurance (SQA) process. In SQA, software process specialists and auditors take a
broader view on software and its development. They examine and change the
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
software engineering process itself to reduce the amount of faults that end up in the.
delivered software: the so-called defect rate.
Software Faults:
Software faults occur through the following processes. A programmer makes an error
mistake, which results in a defect fault, bug in the software source code. If this defect
is executed, in certain situations the system will produce wrong results, causing a
failure. Not all defects will necessarily result in failures. For example, defects in dead
code will never result in failures. A defect can turn into a failure when the
environment is changed. A very fundamental problem with software testing is that
testing under all combinations of inputs and preconditions initial state is not feasible,
even with a simple product. This means that the number of defects in a software
product can be very large and defects that occur infrequently are difficult to find in
testing. Test techniques include, but are not limited to, the process of executing a
program or application with the intent of finding software bugs. Software testing can
also be stated as the process of validating and verifying that a software program. To
make each query more accurate, accuracy settings were adjusted accordingly.
Though controversial, software testing may be viewed as an important part of the
software quality assurance (SQA) process. In SQA, software process specialists and
auditors take a broader view on software and its development. They examine and
change the software engineering process itself to reduce the amount of faults that end
up in the delivered software: the so-called defect rate. What constitutes an acceptable
defect rate depends on the nature of the software.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
7. OBSERVATION AND ANALYSIS
The record matching system is proposed to overcome the disadvantages of the
existing system. The existing system is based on predefined matching rules hand
coded by domain experts. The matching rules are learned offline by some learning
method from a set of training examples. Hand coding or offline learning approaches
are not appropriate because of two reasons. The first reason is that full data set is not
available beforehand and therefore good representative data for training are hard to
obtain. Second reason is that even if good representative data are found and labelled
for learning, the rules learned on the representatives of a full data set may not work
well on a partial and biased part of that data set.
Record Matching:
To overcome the disadvantages of the existing system, a new record matching
method is proposed which is called as unsupervised duplicate detection (UDD)
method. UDD is used for identifying duplicates among records in query results from
multiple Web databases. This method focus on techniques for adjusting the weights
of the record fields in calculating the similarity between two records. Two records
are considered as duplicates if they are similar enough on their fields. Due to the
absence of labeled training examples this method use a sample of universal data
consisting of record pairs from different data sources as an approximation for a
negative training set as well as the record pairs from the same data source. The
experimental results verify that this method is reasonable since the proportion of
duplicate records in the universal set is usually much smaller than the proportion of
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
non duplicates. The two main advantages for this system are that it solves the online
duplicate detection problem and also it provides specific record matching.
Programming Environment:
Microsoft visual studio is an integrated development environment (IDE) from
microsoft. It can be used to develop console and graphical user interface applications
along with windows forms applications, web sites, web applications, and web
services in both native code together with managed code for all platforms supported
by microsoft windows, windows mobile, windows CE, .NET framework, .NET
compact framework and microsoft silver light which. The Integrated Developing
Environment is as shown in Figure.7.
Figure.7. Developing Environment
Form Screen Shots:
The screen shots provide an overview of the various forms which are required for the
implementation of the project. This system considers the registration of books and its
searching and uploading. For this project there are several forms and several tables in
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
the database which includes user registration, user login, new book registration,
searching based on number and name, user registration report and book report.
Home Page:
Home page provides different options for registration, search engine and the contact
details. It also explains about record matching so that users can understand about
record matching and the function of this system. On clicking the registration button,
a registration page will be shown. Using this, user can register by giving the
company name or book name. Search engine button can be clicked in order to search
books based on book name, author name and number. The homepage is shown in
Figure.8.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.8.Home page
User Registration:
User registration page provides provisions for new users to register to the system.
Here, the user can register by giving the company name, address, contact number,
email-id, user name and password. The page below shows the registration of a
bookshop named winner. By clicking the submit button, user can confirm the
registration. After registration the user can login to the system using the user name
and password. The user registration page is show in the Figure.9.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.9.User Registration Page
User Login:
User login page provide the new and existing users to login to the system by giving
the user name and password. Only the registered user can login to the system. If the
user name and password entered is correct, the user can login successfully.
Otherwise login will be failed. Registered user can add new book and search the
required books whenever necessary. The user login page is shown in Figure.10.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.10.User Login Page
New Book Registration:
This page provides provisions for adding new books by giving the book shop name,
ISBN number, author name, book title, publication, URL (Uniform Resource
Locator), year of publication and cost. It also provides provisions for uploading
books. User can upload the books from the provided URL. The book registration
page is shown in Figure.11.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.11.Book Registration Page
ISBN Number Wise Searching:
The searching page is found by clicking the search engine button from the home
page. Searching can be done based on the ISBN number, author name, book name
and year. In ISBN number wise searching, searching is done on the basis of ISBN
number. The user can be able to search by giving the whole ISBN number or a part
of it. The searched result is obtained after avoiding duplication. Here, the user is
given the last three digits of the ISBN number as shown in Figure.12.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
.
Figure.12.ISBN Number Wise Searching
Book Name Wise Searching:
In this searching, searching is done on the basis of book names. The user can be able
to search by giving the whole book name or a part of it. Here the user is given a part
of the book name and he retrieved all the results that match the given query. The
searched result is obtained after avoiding duplication. From the retrieved results, user
can select the required book and can upload the book. Book name wise searching is
shown in the Figure.13.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.13.Book Name Wise Searching
Year Wise Searching:
In this searching, searching is done on the basis of year of publication. The user can
be able to search by giving the year of publication. Here the user is given the year as
2008 and he retrieved all the results that match the given query. The searched result
is obtained after avoiding duplication. From the retrieved results, user can select the
required book and can upload the book. User can upload the book published in the
year 2008. Year wise searching is shown in the Figure.14.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.14.Year Wise SearchingAuthor Wise Searching:
In this searching, searching is done on the basis of author’s name. The user can be
able to search by giving the name of the author as a whole or a part of it. Here the
user is given the part of the author name and he retrieved all the results that match
the given query. The searched result is obtained after avoiding duplication. From the
retrieved results, user can select the required book and can upload the book. Author
wise searching is shown in the Figure.15.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.15.Author Wise SearchingAdministrator Login:
Administrator can login to the system by giving the admin name and password. After
login, the administrator can be able to view the registration report, books report from
both the databases. The administrator can view the list of registered users by clicking
the registration report button. The administrator can view the list of books from both
the databases by clicking the book report buttons. Administrator login provision is
shown in the Figure.16.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.16.Administrator LoginDatabase Design:
Microsoft SQL Server is a relational model database server produced by Microsoft.
Its primary query languages are T-SQL and ANSI SQL. SQL Server allows multiple
clients to use the same database concurrently. As such, it needs to control concurrent
access to shared data, to ensure data integrity - when multiple clients update the same
data, or clients attempt to read data that is in the process of being changed by another
client. SQL Server provides two modes of concurrency control: pessimistic
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
concurrency and optimistic concurrency. Protocol layer implements the external
interface to SQL Server. The Database design window is shown in Figure .17.
Figure.17. Database Design
User Registration Report:
User registration is done by all the users. The user registration report comprises of
the id, the name of the book shop, address of the book shops, contact number as well
the address. This helps the user to contact different book shops. The user can contact
the shops even through mail or using phone since both the details are provided in the
user registration report. The user registration report is shown in Figure.18.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
Figure.18. User Registration Report
CONCLUSION
Record matching, which identifies the records that represent the same real-world
entity, is an important step for data integration. Duplicate detection is an important
step in data integration and most methods are based on offline learning techniques,
which require training data. Record matching methods are supervised, which requires
the user to provide training data. Since query results are dynamically generated,
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i
record matching methods are not applicable for the web database scenario . In the
Web database scenario, where records to match are greatly query-dependent, a pre
trained approach is not applicable as the set of records in each query’s results is a
biased subset of the full data set. To overcome this problem an unsupervised, online
approach, unsupervised duplicate detection (UDD) is introduced.UDD is used for
detecting duplicates over the query results of multiple Web databases and also used
for online duplicate detection. A linear kernel which is as fast as kernel function is
used in duplicate detection Two classifiers are implemented to avoid duplication
problem. They are weighted component similarity summing classifier (WCSS) and
support vector machine classifier (SVM). In this algorithm, WCSS plays an
important role. It is used to identify some duplicate vectors when there are no
positive examples. After iteration begins, it is used again to cooperate with SVM to
identify new duplicate vectors. Since no duplicate vectors are available, classifiers
that need class information to train, such as decision tree cannot be used. Two types
of intuition in WCSS are duplicate intuition and non duplicate intuition. In duplicate
intuition the similarity between two records should be equal to one and in no
duplicate intuition the similarity for two non duplicate records should be equal to
zero.
REFERENCES 1. Yang C. H., V. Sotiris., E. Rhemand and N. Vichare, (2010), “Automation of
Data Mining for Telemetry Database of Computer Systems”., An International Symposium on Computer, Control and Automation., Tainan, Taiwan.
2. Kumar S. and M. Pecht, (2008), “Baseline Performance of Notebook Computer under Various Environmental and Usage Conditions for Prognostics”., International Journal of Computer, Information, Systems Science and Engineering., 5, 9, 234.
NEHRU COLLEGE OF ENGINEERING AND RESEARCH CENTRE
RECORD MATCHING OVER QUERY RESULTS i