Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Semantic Annotation and Retrieval inE-Recruitment
By
Malik Nabeel Ahmed Awan
2011-NUST-DirPhD-IT-44
Supervisor
Dr. Sharifullah Khan (TI)
Department of Computing
A thesis submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Information Technology
In
School of Electrical Engineering and Computer Science,
National University of Sciences and Technology (NUST),
Islamabad, Pakistan.
(August 2019)
Abstract
E-recruitment processes prioritize matching between job descriptions and user
queries to identify relevant candidates. Existing e-recruitment systems face chal-
lenges in extracting job descriptions due to unstructured nature of content and
text nomenclature differences for defining the same content. The systems are par-
ticularly unable to extract effectively contextual entities, such as job requirements
and job responsibilities from job descriptions. They also lack in producing effec-
tively desired search results due to semantic differences in job descriptions and
users English natural language queries. This thesis proposes a framework to cater
for challenges in the existing e-recruitment systems.
The proposed Semantic Extraction, Enrichment and Transformation (SExEnT)
framework extracts entities from job descriptions using a domain specific dictio-
nary. The extraction process first performs linguistic analysis and then extracts
entities and compound words. After the extraction of entities and compound
words, it builds job context using a job description domain ontology. The ontol-
ogy provides an underlying schema for defining how concepts are related to each
other. Besides building a contextual relationship among entities, the entities are
also enriched using Linked Open Data (LOD) that improves search capability in
finding suitable jobs. In the proposed framework, Web Ontology Language (OWL)
is used to represent information for machine-understanding. The framework ap-
i
ii
propriately matches users queries and job descriptions.
The evaluation data set has been collected from various jobs portals, such as
Indeed, Personforce, DBWorld. A total of 860 jobs were collected that belong
to multiple categories, such as technology, medical, management and others. The
data set was vetted and verified by HR experts. The evaluation has been performed
using precision, recall, F-1 measure, accuracy and error rate. The proposed frame-
work achieved an overall F-1 measure of 87.83% and accuracy of 94% for entities
extraction. The application has a precision of 99.9% in representing and retriev-
ing job descriptions from its knowledge base. The job description ontology has
an overall concept coverage of 96%. The evaluation results show that the pro-
posed framework performs well in extracting, modelling, enriching, and retrieving
job description against queries. At current, the proposed framework is neither
able to automatically generate pattern/action rules, nor provide a complex ranked
retrieval of job descriptions against a user profile nor automatically extend dictio-
nary to increase extraction precision. In future, the framework can be extended
to resolve these limitations.
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Overview of Proposed Solution . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 11
2.1 Information Extraction in E-Recruitment . . . . . . . . . . . . . . . 12
2.1.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Data Extraction in E-Recruitment . . . . . . . . . . . . . . 15
2.2 Ontology Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Ontology Design Frameworks . . . . . . . . . . . . . . . . . 17
2.2.2 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 E-Recruitment Ontology . . . . . . . . . . . . . . . . . . . . 19
2.3 Information Enrichment . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Natural Language Queries . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
TABLE OF CONTENTS iv
3 SExEnT Framework 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Information Extraction and Enrichment . . . . . . . . . . . . . . . 27
3.3 Job Description Domain Ontology . . . . . . . . . . . . . . . . . . . 28
3.4 Job Query Transformation . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 SAJ Framework 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Linguistic Analysis and Extraction . . . . . . . . . . . . . . . . . . 33
4.4.1 Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.2 Entities Extraction . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Context Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.1 Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7.1 Data-set Acquisition . . . . . . . . . . . . . . . . . . . . . . 47
4.7.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 48
4.7.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 49
4.7.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . 50
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Job Description Ontology 54
5.1 Ontology Design Methodology . . . . . . . . . . . . . . . . . . . . . 54
5.2 Ontology Expressiveness . . . . . . . . . . . . . . . . . . . . . . . . 56
TABLE OF CONTENTS v
5.3 Job Description Ontology . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Identify Purpose . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 Build Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.3 Ontology Development and Documentation . . . . . . . . . . 64
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Domain Coverage . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Application based Evaluation . . . . . . . . . . . . . . . . . 66
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Sem-QA Framework 71
6.1 The Sem-QA Framework . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Semantic Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Query Template Matching . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Query Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.5 Working Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.6 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 77
6.6.2 Data Set Specification . . . . . . . . . . . . . . . . . . . . . 77
6.6.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 79
6.6.4 System Performance for Semantic Association of Atomic FC 79
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Conclusion 81
7.1 Research Description . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.4 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 84
List of Figures
1.1 A sample job description with marked segments . . . . . . . . . . . 2
1.2 A semantic difference between a requirement and a user query . . . 4
3.1 Proposed SExEnT framework . . . . . . . . . . . . . . . . . . . . . 26
4.1 High level block diagram for the proposed framework SAJ . . . . . 31
4.2 Extracting context-aware requirement entity from a job description 35
4.3 Sample of educational requirement in a job description . . . . . . . 41
4.4 Graph structure showing entities and connections in SAJ . . . . . . 42
4.5 Entities enrichment process using LOD in SAJ . . . . . . . . . . . . 44
4.6 N3 notation of a job description in knowledge-base . . . . . . . . . 46
4.7 Evaluation of extraction comparison of accuracy vs error for SAJ . 50
4.8 SAJ, Alchemy API and OpenCalais extraction comparison for job
titles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 SAJ and OpenCalais extraction comparison for requirements . . . . 52
4.10 Comparison of precision, recall and f1-measure with ground truth . 52
5.1 Uschold and Kings enterprise methodology . . . . . . . . . . . . . . 55
5.2 Job description ontology . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Job description basic entities as N3 notation . . . . . . . . . . . . . 60
vi
LIST OF FIGURES vii
5.4 Job description requirements entity as N3 notation . . . . . . . . . 61
5.5 Job description responsibilities as N3 notation . . . . . . . . . . . . 62
5.6 Job description education as N3 notation . . . . . . . . . . . . . . . 63
5.7 Job description profile as N3 notation . . . . . . . . . . . . . . . . . 64
5.8 A sample N3 representation of the job description ontology . . . . . 65
5.9 A sample SPARQL query to retrieve job title labels after execution 69
6.1 A set of sample queries from Mooney data set . . . . . . . . . . . . 73
6.2 A sample query processing representation of Mooney data set . . . 76
6.3 Time comparison between various Filter Constraints queries . . . . 80
List of Tables
2.1 Gap analysis for information extraction . . . . . . . . . . . . . . . . 23
2.2 Gap analysis for ontology design . . . . . . . . . . . . . . . . . . . . 23
2.3 Gap analysis for question answering . . . . . . . . . . . . . . . . . . 24
4.1 Sample rules for segmentation and extraction with description . . . 32
4.2 Compound words identification and extraction rules . . . . . . . . . 34
4.3 Sample text showing nomenclature variation in job description . . . 35
4.4 Basic entities along with examples from a job description in SAJ . . 36
4.5 Sample rule for boundary detection for requirement using JAPE in
SAJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Sample rule for job requirement using JAPE in SAJ . . . . . . . . . 38
4.7 Sample rule for detecting responsibilities from a job description in
SAJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 A sample rule for education extraction from a job description in SAJ 41
4.9 Statistics of jobs collected from various e-recruitment systems . . . 47
4.10 Statistics of job description in various job categories collected ran-
domly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.11 Results of entities extraction from job description in SAJ . . . . . . 49
5.1 DL basic expressive labels along with details . . . . . . . . . . . . . 56
viii
LIST OF TABLES ix
5.2 DL extension expressivity labels along with details . . . . . . . . . 57
5.3 Important concepts in job description ontology . . . . . . . . . . . . 60
5.4 Job requirements properties and description . . . . . . . . . . . . . 61
5.5 Job Responsibilities properties and description . . . . . . . . . . . . 62
5.6 Job position properties and descriptions . . . . . . . . . . . . . . . 63
5.7 Domain coverage of job description ontology . . . . . . . . . . . . . 68
5.8 Job description evaluation queries categorization . . . . . . . . . . . 69
5.9 Job Description user retrieval summary . . . . . . . . . . . . . . . . 69
6.1 Examples of entities detected from natural language job queries . . 73
6.2 Query count for Mooney and Perforce data set . . . . . . . . . . . . 77
6.3 Query Categorization based on number of Filter Constraints . . . . 78
6.4 Comparative analysis of Mooney and Personforce data-set . . . . . 79
Chapter 1
Introduction
The current chapter provides an overview of the problem domain. The chapter
also identifies critical research gaps and objectives. Along with that, a dissertation
outline is also discussed.
1.1 Motivation
With recent advancements in technology, human reliance on the internet has signif-
icantly increased. Information is now mostly available and shared via the internet
using sources, such as websites, social media and web portals. This advancement
in internet technology has also had an impact on recruiting potential employees
for an organization. Automatic ways of exploring, joining and sharing information
in Web 3.0 has improved the use-ability of web resources. This evolution of the
web (Bogh, 2012) had a direct impact on applications design and content sharing
over the web. Many years back, during the Web 1.0 era, recruitment was also just
plain vanilla lateral (Bhagia, 2015). Organizations post jobs via various channels,
such as newspaper or digital print media and then wait back for job seekers to
1
CHAPTER 1. INTRODUCTION 2
send them their resumes via email or postal mail. Organizations mainly created
in-house banks for resumes. With the advancement in Web 1.0 to Web 2.0, also
saw a drift in recruitment systems.
Figure 1.1: A sample job description with marked segments
Web 2.0 was an era of Social Media. The old-style static recruitment processes
enhanced to passive recruitment, e.g., platforms like LinkedIn 1 enabled passive
recruitment. The organizational recruiters now have a larger pool of information
to search their required candidates. Now, it is not only job seekers looking for
jobs, but also organizational recruitment agents are also looking for knowledgeable
1https://www.linkedin.com/
CHAPTER 1. INTRODUCTION 3
candidates using Web 2.0. However, was it enough? Web 3.0 gave a new dimension
to recruitment (Matthew Jeffery, 2011). It enabled systems to understand complex
job descriptions and process search queries more amicably. The understanding of
job descriptions and search queries was only possible by understanding the context
hidden in the contents (McConell, 2014).
The job description in Fig 1.1 outlines the location, job title, requirements and
responsibilities. Features in a job description, such as location, job title, skills and
expertise level, are described as entities. These entities are contextually associated
with each other to yield contextual entities, such as the job requirements. The
employer’s primary emphasis in candidate filtering is on job requirements because
the requirements define a baseline for the selection of a potential candidate.
1.2 Problem Statement
Existing e-recruitment systems, such as Indeed2, Monster 3, Personforce4, An-
gel.co5, LinkedIn6, Career Builder 7, Glassdoor 8, SimplyHired9 store information
as raw text and apply keyword matching or faceted search to provide better search
results to both organization and users. The recruitment process starts with adver-
tising a job description. The job description outlines the requirements for selecting
a potential user/candidate. The typical job description may comprise of location,
job title, requirements and responsibilities. Some features in a job description, such
as location, job title, skills and expertise level, are described as entities. These en-
2https://www.indeed.com3https://www.monster.com/4https://www.personforce.com/5https://angel.co/6https://www.linkedin.com7https://www.careerbuilder.com/8https://www.glassdoor.com/9https://www.simplyhired.com/
CHAPTER 1. INTRODUCTION 4
tities are contextually associated with each other to yield contextual entities, such
as the job requirements.
The job description/candidate profile contents, thus hold crucial importance.
However, the information provided in the job description/user profile provides
challenges for extraction, such as content is unstructured; there is no standard
format for defining content, and there are text nomenclature differences for defining
the same content. Recruitment processes prioritize matching/relevance between
job description and candidate queries to filter out irrelevant candidates or get the
most well-fitting job for the candidate. Manually performing this matching process
is time-consuming and challenging 10.
Figure 1.2: A semantic difference between a requirement and a user query
The process has been carried out automatically in e-recruitment systems. How-
ever, this process is not merely the matching of text because there can be semantic
heterogeneities in the texts. For example, in Fig 1.2, the text of a job description
and a user query have been illustrated. There is no match in them lexically; how-
ever, they are semantically matching because Android Development is a type of
mobile application. The matching process is complex and needs to understand the
context (i.e. domain-specific information) of the text to resolve semantic hetero-
10https://ckscience.co.uk/is-your-recruitment-process-costing-you-time-money-and-good-candidates
CHAPTER 1. INTRODUCTION 5
geneity in matchmaking.
Existing e-recruitment systems (Owoseni et al., 2017; Valle et al., 2007) do
not extract domain-specific information, such as ’mobile application’ in the above
example from the job requirement text to match with the candidate query of
Android Development. These domain-specific entities are contextually associated
with each other. Existing systems extract entities from text independently of each
other, without considering the context associated with the entities. For example,
the job requirement Strong fundamentals (OO, algorithm, data structure have the
entity strong as the expertise level and OO, algorithm, data structure as skills, but
actually OO, algorithm, data structure requires strong expertise. These entities
combine to generate a contextual entity, i.e., a job requirement. Another drawback
of existing systems is that of the limited availability of information (Silvello et al.,
2017; Candela et al., 2017) contained in the knowledge-base for enrichment either
through in-house data or with an external source that is static concerning data
growth. On the other hand, Linked Open Data (LOD) (lod, 2017) do not suffer
from data staleness, and data can expand over time. Multiple sources are actively
contributing to it, such as Wikipedia 11, Getty 12, and GeoNames 13. Existing
approaches (Shin et al., 2015; Gregory et al., 2011) do not properly implement
LOD principles. Besides extracting, enriching and building information context,
retrieval of information is another important aspect. The system users are not
domain experts to query machine-understandable data using its desired format,
i.e., SPARQL query language. A translation mechanism is required to translate the
users Natural Language Queries (NLQ) into the machine-understandable format,
i.e., SPARQL query language.
11https://www.wikipedia.org/12http://www.getty.edu/research/tools/vocabularies/lod/13https://lod-cloud.net/dataset/geonames-semantic-web
CHAPTER 1. INTRODUCTION 6
So, the main problems summarized from above discussion is:
1. Information loss in extraction of domain specific e-recruitment entities for
job description due to lack of their context and inter and intra document
linkages
2. Information loss due to the absence of a comprehensive domain ontology
in e-recruitment for building relationship among extracted entities for jobs
descriptions
3. Usage of external static sources for data enrichment (expansion) resulting in
data staleness
4. Deficiency in translating Natural Language queries for machine understand-
able data due to identifying entities and their context
1.3 Research Challenges
The extraction of contextually associated entities for filtering job candidates posses
major challenges, that are:
1. Unstructured Content : The text written in a job description is of unstruc-
tured nature, such as ’Previous working experience as a Java Developer for
minimum 2 years’. The aforementioned example include structured informa-
tion, such as skill: Java, job title: Java Developer and experience: 2 years
but mentioned in an unstructured way.
2. Nomenclature differences : The text represents same information in different
way, such as Requires 2+ years of java experience and Java experience of
CHAPTER 1. INTRODUCTION 7
2+ years required. The job requirement is of an experience of 2+ years but
representation is different.
3. Semantic heterogeneities : The text represents same information with its
synonym form, such as Android Developer and Mobile Developer. In the
aforementioned example Android Developer is type of Mobile Developer but
represented with a synonym form.
4. Contextual Entities : Extraction of contextually associated entities is it self
major challenge, such as Expertise in Spring boot is a contextual entity but
it extraction requires in-depth understanding of content.
1.4 Research Objectives
The main research objectives are:
1. To design an extraction and transformation methodology for identification
of entities and compound words from the job description for building infor-
mation context
2. To design a comprehensive web ontology for the representation of e-recruitment
job description in order to resolve semantic heterogeneities
3. To design an enrichment methodology for enriching entities and compound
words from Linked Open Data to cater for data staleness in e-recruitment
4. To design a methodology for translating natural language queries into ma-
chine understandable data, i.e., SPARQL queries
CHAPTER 1. INTRODUCTION 8
1.5 Overview of Proposed Solution
The proposed solution extracts, enriches, retrieves and transforms e-recruitment
natural language content of (Gupta, 2016) job descriptions and NLQ in a machine-
understandable format. The transformation improves searching in e-recruitment
systems by considering the context in matching relevance. It transform the raw
text into a machine-understandable format (Graupner et al., 2017) using the job
description domain ontology (Ahmed et al., 2016). The extracted entities are also
enriched using Linked Open Data, which is a non-stale data source. The enrich-
ment of job description entities increases accuracy and precision of the retrieval
process, such as java micro-service has connected concepts, such as Eureka, Rib-
bon and Feign. Connecting them via LOD will enhance the exploring capabilities
of knowledge-base. The system also facilitates a user to input complex queries in
plain English natural language. The input in plain English natural language pro-
vides users a way to express their requirements more efficiently and effectively. The
system transforms the plain English natural language query into a SPARQL for-
mat, which is a machine-understandable format and retrieves accurate and precise
matching results.
Each component of the framework has been evaluated separately, i.e., extrac-
tion and enrichment, the job description domain ontology and the NLQ trans-
formation process. The framework evaluation metrics are precision, recall, F1-
measure, and accuracy. The data-set consists of 860 jobs captured from vari-
ous job portals, such as Indeed, Personfroce and DB World. The data-set was
evaluated by domain experts to build an evaluation ground truth. Overall the
framework achieved comparably good results. The extraction and enrichment
component achieved highest F1-measure for job title as 95% whereas education
has the highest accuracy of 99.9% due to fewer variations in nomenclature. The
CHAPTER 1. INTRODUCTION 9
application-driven evaluation for job description ontology model has been able to
retrieve all relevant job. The transformation process evaluation has been carried
out on two different data-sets, i.e. Mooney data-set 14 containing 620 job queries
and Personforce data-set 15 containing 500 job queries. The transformation process
achieved F1-measure of 99.9%.
1.6 Thesis Organization
The thesis is organized as,
1. Chapter 2 discusses the existing work in the domain. The research gap anal-
ysis is also carried out in this chapter in order to identify research objectives.
2. Chapter 3 acts as a bridge for overall discussion for the SExEnT framework.
The chapter discusses all the components of the framework briefly, besides
building a logical bridge among the components. The forthcoming chapters
then discuss each component in details, along with evaluation.
3. Chapter 4 provides in-depth details and evaluation for e-recruitment job
description extraction, context building and enrichment framework. The
chapter discusses in-details algorithms used for extraction, context building
and enrichment of job description text.
4. Chapter-5 provides in-depth details and discussion on evaluation for the job
description ontology. The chapter discusses in details the design methodol-
ogy and rationale behind the design of the job description ontology. It also
provides a comparison against other similar schema.
14https://www.ifi.uzh.ch/en/ddis/research/talking/OWL-Test-Data.html15http://personforce.com/
CHAPTER 1. INTRODUCTION 10
5. Chapter 6 provides in-depth details on how transformation takes place and
evaluates the job description queries transformation process. It discusses
how critical and important the transformation is and then presents in-depth
details of the transformation.
6. Chapter 7 concludes the discussion. It summarizes the overall content of
the thesis providing critical information from each chapter. Alongside it also
provides future direction of the work.
Chapter 2
Literature Review
The primary objective of the current chapter is to review and discuss various
techniques, approaches, solutions, and methodologies that contribute to the de-
velopment of an e-recruitment solution. The discussion will start with a review
of existing extraction methodologies and techniques. An analysis on extraction
technique follows a discussion on enrichment, domain ontologies and retrieval of
information.
The organization of the current chapter is as follows: Section 2.1 will discuss
various extraction techniques and methodologies, in general, and specific to e-
recruitment. Section 2.2 will discuss ontology designs in various domains along
with e-recruitment specific systems. Section 2.3 will discuss enrichment in existing
systems in general and specific to e-recruitment. Section 2.4 will discuss the con-
tent retrieval in general and specific to e-recruitment. Section 2.5 concludes the
discussion in the chapter.
11
CHAPTER 2. LITERATURE REVIEW 12
2.1 Information Extraction in E-Recruitment
The purpose of this section is to first discuss existing work in the domain of
information extraction then followed by various extraction techniques proposed
specifically to e-recruitment. The work discussed will focus on trivial extraction
techniques as well as ontology-based extraction techniques for better understanding
of the domain.
2.1.1 Information Extraction
Information extraction is a technique for identification of entities, mentions, re-
lations in unstructured text (Karkaletsis et al., 2011; Jayram et al., 2006). The
extraction techniques includes, (a) Wrapper based techniques, (b) Rule and pattern
based technique, (3) Machine learning based technique, and (4) Ontology-based
technique.
2.1.1.1 Wrapper based techniques
The wrapper-based technique is a fundamental technique designed to extract data
from web pages using the Document Object Model (DOM). WebQL (Arocena and
Mendelzon, 1999) uses a wrapper-based technique to extract meaningful informa-
tion from web pages. The manually written wrappers proved inefficient due to
changes in the DOM structure of web pages that result in a complete rewrite.
Mingsheng et al. (Mingsheng et al., 2012) proposed a technique to overcome the
manual efforts. The technique first understands the DOM structure and then
extracts meaningful information. The focus of all these techniques was on web
pages. It was not able to extract content from other document formats such as
MS Word and PDF (Flesca et al., 2011). The rules and patterns (Jayram et al.,
CHAPTER 2. LITERATURE REVIEW 13
2006) technique, machine learning (Bijalwan et al., 2014) technique and ontology
(Vicient et al., 2011) based technique were then used to cater to other formats.
2.1.1.2 Rule and pattern-based techniques
The rule and pattern-based technique identify hidden features in the text by uti-
lizing predefined rules or known patterns (Jayram et al., 2006), e.g., extraction
of a person’s phone number may require the occurrence of phrases, such as at ,
can be reached at and similar pattern phrases. Absence of these phrases will result
in the non-extraction of the person’s phone number. Rule-based techniques have
been applied in multiple domains, such as aspect extraction of product reviews by
exploiting practical knowledge and sentence dependency trees (Poria et al., 2014),
relation extraction using background knowledge (Rocktaschel et al., 2015), extrac-
tion of patient’s clinical data from medical texts (Mykowiecka et al., 2009) and
numerous others. These methods may process unstructured text multiple times
to obtain meaningful information. Besides this, the rule-based technique have
been applied in the extraction of compound entities from bio-medical domains
(Ramakrishnan et al., 2008) using BioInfer and GENIA corpora. The drawback
of compound word extraction using the technique mentioned by Cartic Ramakr-
ishnan et al. (Ramakrishnan et al., 2008) is that any concept that is missing
in the BioInfer and GENIA corpora will not be recognized as a compound word.
These methods cannot identify any new rule or pattern that is not already defined.
Machine learning based techniques can deal with such problems.
2.1.1.3 Machine learning based techniques
Machine learning based techniques help in extracting existing and new informa-
tion from unstructured texts. Machine learning based techniques require large
CHAPTER 2. LITERATURE REVIEW 14
data-sets for training and evaluation purposes. These techniques mostly focus on
text categorization and classification problems. Existing techniques, such as Term
Model Graph (Wang et al., 2005), kNN (Bijalwan et al., 2014), Naive Bayes (Tang
et al., 2016) and Support Vector Machines (Guenther et al., 2016) can be applied
for classification and categorization. These techniques can also be applied to sce-
narios, such as sentiment analysis (Gautam and Yadav, 2014) on Twitter data or a
micro-blog data (Bontcheva et al., 2013) or even extracting data from bio-medical
domain texts or e-recruitment data. They fail to link information together with
a context and require large data for training. The lack of context and less train-
ing data may result in information loss (Gutierrez et al., 2016). Ontology-based
technique can fill this gap (Gutierrez et al., 2016).
2.1.1.4 Ontology based techniques
The ontology based technique cover this gap in information extraction, and mainly
use domain-specific knowledge for extracting meaningful information from unstruc-
tured text (Kiryakov et al., 2004; Maree et al., 2018). Some well known existing
systems that use domain ontology are KIM (Popov et al., 2003) and TextPresso
(Muller et al., 2004). These systems only use information present in domain ontol-
ogy to facilitate entity extraction. TextPresso mainly focuses on entity extraction
in the bio-medical domain. It uses Gene Ontology (GO) during extraction that
comprises approximately 80% of the lexicon. Any new information extracted will
result in information loss. This limitation was addressed by the technique pro-
posed by Vincent et al. (Vicient et al., 2011). According to their technique, the
newly extracted knowledge is merged with existing domain knowledge resulting in
enhanced domain knowledge. Further down the road information extraction has
also been supported with fuzzy ontology (Ali et al., 2015) and used in extracting
CHAPTER 2. LITERATURE REVIEW 15
travelers reviews about hotels, building and designing business intelligence systems
for gathering company intelligence and country/region information (Saggion et al.,
2007), building and designing systems to extract information from clinical docu-
ments such as admission reports, radiology findings and discharge letters (Geibel
et al., 2015), and a framework for retrieval of images from web data (Vijayarajan
et al., 2016).
The technique proposed by Vincent et al.(Vicient et al., 2011) has the limita-
tion of only using the lexical English database WordNet 1 as an external source for
enhancing information. The enhancement was limited to WordNet which suffers
from data a staleness issue. This issue of data staleness has been addressed in
other studies by (Al-Yahya et al., 2014; Vicient et al., 2013; Nabeel Ahmed, 2008)
which have updated domain ontology independently of WordNet thus increasing
the extraction precision. They have used the pattern-based approach and ontol-
ogy for extracting new concepts that are not modeled in domain ontology, thus
enriching the ontology.
2.1.2 Data Extraction in E-Recruitment
E-Recruitment has gained much popularity over time. E-recruitment at current is
one of the most widely used ways to recruit talent for organizations. The rapid
growth of Internet has paved the way for many online Human Resource (HR) job
systems, such as Indeed 2, Monster 3, Personforce and hundreds of others.
PROSPECT (Singh et al., 2010) is a domain dependent research initiative to
extract data from e-recruitment content. The main aim of the PROSPECT system
is job candidates screening. It builds facets on resumes and then screen candidates
1https://wordnet.princeton.edu/2https://www.indeed.com3https://www.monster.com/
CHAPTER 2. LITERATURE REVIEW 16
on bases of these facets. The job posting created by PROSPECT includes role,
job category, skills, skill experience, location, number of positions, and total expe-
rience. The PROSPECT system does not define and follow any defined ontology
model to represent this information; rather a job is posted using existing job post-
ing channels. Resumes are collected using the same channel through which a job
is advertised and then processed for screening.
SCREENER(Sen et al., 2012) is also a domain dependent research initiative
to facilitate the e-recruitment process by extracting information only from the re-
sumes. It identifies text segments that have a probability of having a specific set
of information that includes skills, education, experience, and other related infor-
mation. The extracted information is then indexed using Lucene 4 for searching
and ranking all applicants for a given job opening. The authors (Sen et al., 2012)
claims that this automated process makes the screening task more straightforward
and more efficient.
JobOlize(Buttinger et al., 2008) and WoLMIS (Boselli et al., 2018) extracts
structured information from unstructured job documents. JobOlize utilizes a hy-
brid approach combining existing Natural Language Processing (NLP) techniques
with the new form of context-driven extraction techniques for extracting layout,
structure and content information of a job description. WoLMIS system aim is to
collect and classify multilingual Web job descriptions with respect to a standard
taxonomy of occupations.
The existing information extraction approaches are unable to extract domain-
specific e-recruitment entities for job description due to unavailability of their
context and inter and intra document linkages.
4http://lucene.apache.org/
CHAPTER 2. LITERATURE REVIEW 17
2.2 Ontology Design
The current section discusses the work done in the area of ontology design and
development. Ontology design and development is generally carried out by a team
of people such as domain experts, ontology engineers, and pedagogues. The main
factor behind ontology design and development, as mentioned by (T.R.Grubber,
1995) is to share a common understanding of knowledge and information structure
among people or applications. It also enables reuse of domain knowledge, thus
becoming a significant enabler in the current increase in ontology research, design
and development. Ontology design frameworks define steps and guide domain
experts, ontology engineers, and pedagogues for building better models.
The current section will review work in ontology, and development from various
aspects, that include (1) existing design frameworks, (2) ontologies from various
domains, and (3) ontology design for e-recruitment.
2.2.1 Ontology Design Frameworks
Various ontology design frameworks exist, that are: (1) Cyc, (2) TOVE, (3)
Uschold and Kings, (4) Methontology and others. Here a review of these famous
four is presented.
Cyc (Elkan and Greiner, 2006) is an ontology as well as ontology engineer-
ing methodology. The main intent behind this methodology is to make common
sense knowledge accessible and processable for computers. Key steps to develop
ontology-based on Cyc methodology are:
1. Manual identification of common sense knowledge
2. Computer-assisted extraction of common sense knowledge
CHAPTER 2. LITERATURE REVIEW 18
3. Computer managed extraction of common sense knowledge
TOVE (TOronto Virtual Enterprise) ontology (Gruninger and Fox, 1995) pro-
posed another ontology engineering methodology. This methodology makes use of
story-driven cases which provide informal semantics for objects and relations.
Uschold and Kings Enterprise methodology (Uschold and King, 1995; Uschold
and Gruninger, 1996) provides guidelines for building ontologies based on enter-
prise modeling approach. This approach was developed as a part of the enterprise
project by AIAI at the University of Edinburgh in collaboration with IBM, Lloyds
Registers, Logica UK Limited and Unilever (Uschold and King, 1995; Uschold and
Gruninger, 1996).
All existing methodologies provide guidelines for ontology design whereas Methon-
tology (Gomez-Perez, 1996; Gomez-Perez, 1999) addresses the ontology mainte-
nance. This framework facilitates construction of ontology in a systematic way
and is compatible with software development process and knowledge engineering
methodologies like RUP (Rational Unified Process) (Shahid et al., 2009). The
life-cycle of this methodology is based on evolving prototypes.
Various domain ontologies have been developed based on these frameworks,
that are discussed next.
2.2.2 Domain Ontology
Ontology development has influenced various domains. Education is one of such
domains that has multiple ontologies ranging from course ontology (Ameen et al.,
2012), university ontology (Malik et al., 2010a), and others. These ontologies
comprehensively model the respective domain. Cultural heritage is another domain
that has been described using the ontology (Pattuelli, 2011). The authors of
CHAPTER 2. LITERATURE REVIEW 19
paper describes the core concepts of cultural heritage and their interconnections.
The domain experts evaluated ontology coverage and use-age in cultural heritage
applications.
Another significant work related to ontology construction is in Botany, i.e.,
Plants. The Plant Ontology (Li et al., 2016) provides a comprehensive details
related to plants. It discusses its concepts and relationships for classifications,
construction, growth details, multiple names, flora details and possibly use-age of
plants mainly in the medical domain.
2.2.3 E-Recruitment Ontology
Multiple initiatives and research work have been carried out for defining a schema
for human resource management. One such project is SEEMP Project (Gomez-
Perez et al., 2007) by European Union (EU). Under this project, existing human
resource management standards are reused to build a common language called Ref-
erence Ontology. Reference Ontology includes compensation ontology, economic
activity ontology, occupation ontology, education ontology, skill ontology, job of-
fer ontology, and some other ontologies. These modular ontologies are combined
to form a comprehensive human resource ontology. One major problem with this
model is that HR-XML mainly influences it, so these models inherit any shortcom-
ing of HR-XML. The job offer model is not comprehensive to handle job posting
domain as a whole. Requirements are in raw plain text (Gomez-Perez et al., 2007);
entities are not identified, such as skills, experience, and expertise level.
CHAPTER 2. LITERATURE REVIEW 20
2.3 Information Enrichment
This section discusses existing work carried out in the domain of information en-
richment, i.e., a process of enhancing, refining or improving existing data. The
enrichment process increases the data value (Weichselbraun et al., 2014). Work
has been carried out in various domains, such as cultural heritage, scientific pub-
lications, question answering, and others to enhance, refine and improve existing
data by adding more additional knowledge from external sources. The process is
gaining much popularity with time (Silvello et al., 2017). Significant work has been
carried out on scientific experimental evaluation data (Silvello et al., 2017). Global
large scale campaigns produce a large quantity of scientific and experimental data.
This data is a fundamental pillar of the scientific and technological advancement of
information retrieval (Silvello et al., 2017). The proposed system semantically an-
notates and interlink the data. The data is shared using Linked Open Data (LOD)
cloud. The interconnections of data in LOD provides a mean of data enrichment,
i.e., the depth and breadth of LOD graph increase. Another work carried out
by (Candela et al., 2017) to enrich data from the cultural heritage domain. The
cultural heritage institutions are now progressing towards sharing the knowledge
via LOD. Sharing data using LOD increases value. The current system connects
the Biblioteca Virtual Miguel de Cervantes records (200,00) to other data sources
on the web. At current there, the focus is to enrich location and date information
with additional knowledge. The current system uses the GeoNames API to link
the data.
The existing work in enrichment mostly uses static dictionaries to enrich in-
formation. Cultural heritage data currently use LOD to link artefacts information
with one available in LOD. No work has been carried out to enrich e-recruitment
entities/concepts from LOD that would increase its value many folds.
CHAPTER 2. LITERATURE REVIEW 21
2.4 Natural Language Queries
The remarkable advancement of web applications is attracting more and more
users. Therefore along with trained and technical web users, novice users are also
increasing at a higher pace. Making web data explorable for everyone in a non-
technical way becomes inevitable. According to to (Copestake and Jones, 1990),
the most flexible and convenient method of communication with software is Natural
Language (NL). Although NL based systems are the most intuitive way of user
communication, they are much more challenging to implement than they were
expected in the past (Copestake and Jones, 1990). The central problem with
NL based Question Answering (QA) system is the identification of user intents
by disambiguation of the concepts and their mutual relationship in a particular
domain.
This section discusses the work done on Natural Language Queries or Interface
(NLI). NLI basic purpose is to bring ease to users in the data query. NLI generally
falls under two categories that are: Open domain NLI and restricted domain NLI.
Open domain NLI answers questions using general ontologies and information
available on web (Strzalkowski and Harabagiu, 2006). Since it targets general
web resources, therefore the answer reliability may be doubtful as information
could be outdated, conflicted or wrong. In terms of information reliability, closed
domain QA systems win over the open domain as their information resorts are
comparatively smaller and more specific (Frank et al., 2007). Some of the recent
work on NLIs include: QACID (Oscar et al., 2009), PANTO (Wang et al., 2007),
ORAKEL (Philipp et al., 2008) and AquaLog (Lopez et al., 2005).
QACID (Oscar et al., 2009) designed for cinema domain. It stores question
patterns in a query formulation database and statically binds a SPARQL query
with each pattern, while an entailment engine is designed to find a match for an
CHAPTER 2. LITERATURE REVIEW 22
input query within the query formulation database.
AquaLog (Lopez et al., 2005) and PANTO (Wang et al., 2007) both initially
generate Query triples and then convert them into onto-triple. Both of the tech-
niques, as mentioned above, are using ontologies as their knowledge base. Aqua-
Log supports scalability, learning through user interaction and portability more
than PANTO, but only support 23 question categories. On the other hand,
PANTO supports more questions, but it lacks all other features of AquaLog. PRE-
CISE (Popescu et al., 2003) maps NL questions to SQL queries and has shown
100%precision for semantically tractable questions. Answering only semantically
tractable questions leave the not tractable questions unanswered.
ORAKEL (Philipp et al., 2008) is another ontological QA system. It supports
factoid questions, but it requires a lexicon engineer for mapping query lexicon to
ontological relations. Some NLI based QA systems including (Bernstein and Kauf-
mann, 2006), (Funk et al., 2007), and (Popescu et al., 2003) work with controlled
language, to overcome the ambiguity and vagueness of natural language questions.
Since controlled natural language is a subset of the representative language un-
derstandable for the system (Fuchs et al., 2006), therefore an end user is required
to learn it and be trained enough to express all types of questions using its sup-
ported constructs. Machine learning models based QA systems include: (Zelle and
Mooney, 1996) and (Thompson et al., 1999).
All approaches modeled so far exhibit one or many of the following weaknesses,
that are: (1) require a considerable amount of domain-specific training data to
achieve high accuracy, (2) require better interpretation of semantics in context
of question domain, (3) need to provide complete or about complete coverage of
questions, (4) the system should be capable of handling a large amount of data, (5)
involve end-user efforts, (6) not flexible enough to adapt the unrecognized question
CHAPTER 2. LITERATURE REVIEW 23
patterns and new information while maintaining system accuracy.
2.5 Critical Analysis
Table 2.1, Table 2.2 and Table 2.3 shows gap analysis for three broad domains
that overall encapsulate our work.
Table 2.1: Gap analysis for information extraction
Feature / Sys-tems
Prospect(2010)
Screener(2012)
JobOlize(2008)
KIM(2003-to-date)
TextPresso(2004)
OpenCalais(2017-to-date)
AlchemyAPI(2005-to-date)
ContextualEntities
- - - - - Partial -
Basic Enti-ties
YES YES YES YES YES Partial Partial
ContextBuilding
- - - - - - -
Inter and In-tra documentlinkages
- - - YES YES - -
Enrichment - - - - - YES YESRDF Repre-sentation
- - - YES YES - -
Table 2.2: Gap analysis for ontology design
Feature / Systems HR-XML(1999-to-date)
Prospect(2010)
Schema.org(2011-to-date)
Indeed(2004-to-date)
Hierarchical Relation-ship
Partial - - -
Associative Relation-ship
- - - -
Expressiveness Low Low Medium Low
CHAPTER 2. LITERATURE REVIEW 24
Table 2.3: Gap analysis for question answering
Feature / Systems QACID(2009)
PANTO(2007)
ORAKEL(2008)
AQUALOG(2005)
Domain SpecificTraining
YES YES YES YES
Complete Coverage ofQuestion
YES - - -
End-user Efforts - - YES -Adapt to new ques-tions
- YES - -
handle Large Data - YES - YES
Summarizing the researchers contributions, following research gaps exist based
on Table 2.1, Table 2.2 and Table 2.3;
1. Information loss in the extraction of domain-specific e-recruitment entities
for job description due to unavailability of their context and inter and intra
document linkages
2. Information loss due to the absence of a comprehensive domain ontology
in e-recruitment for building relationship among extracted entities for jobs
descriptions
3. Usage of static sources for data enrichment (expansion and hurdles in match-
ing) resulting in data staleness
4. Deficiency in translating natural language queries for machine understand-
able data due to inappropriately identifying entities and their context
Chapter 3
SExEnT Framework
The current chapter works as a bridge in explaining the proposed Semantic Ex-
traction, Enrichment and Transformation (SExEnT) framework for retrieval of
information in e-recruitment domain. This chapter integrates logical divisions in
the proposed framework for clear understanding.
3.1 Introduction
The proposed framework aims to provide a comprehensive solution for solving is-
sues related to information extraction, contextual representation of the extracted
information, enriching the extracted information, domain representation in a machine-
understandable format and then a solution to use NLQ for retrieving the informa-
tion.
The proposed framework has three logical divisions, that are, (1) extraction
and enrichment, (2) job description ontology, and (3) jobs query transformation.
Fig 3.1 shows a pictorial representation of the framework.
25
CHAPTER 3. SEXENT FRAMEWORK 26
Figure 3.1: Proposed SExEnT framework
The purpose of information extraction and enrichment is to extract, enrich
and build context from the plain text of job descriptions. The extraction and
context building make use of the job description ontology (Ahmed et al., 2016).
Job description domain ontology provides relationships among the entities. Once
the relation among the entities is defined, which result in building context. The
knowledge-base stores the contextual e-recruitment information. Users can query
the knowledge-base using Natural Language Query (NLQ). The user query is trans-
formed into the machine-understandable format, i.e., SPARQL 1 to fetch desired
results.
The section 3.2 presents the overview of the extraction and enrichment, section-
3.3 presents the overview of the job description ontology and section 3.4 presents
the overview of the job search.
1https://www.w3.org/TR/rdf-sparql-query/
CHAPTER 3. SEXENT FRAMEWORK 27
3.2 Information Extraction and Enrichment
The extraction and enrichment process starts with the job description submitted
as raw plain English text. At the first step, the text classified into segments,
such as job title, job requirements, job responsibilities, and others. A rule-based
dictionary contains rules for segmentation as well as entities extraction. Once the
text has been segmented using a rule(s) from the dictionary, the next step is entities
extraction. The entities extraction process is preceded with sentence splitting and
POS tagging using Penn TreeBank tag set 2. After the text is segmented, sentences
and POS tags marked, the next important thing is to extract entities. Entity
extraction is an essential and critical process.
During entity extraction, information, such as places, organizations, money,
email address and others are extracted. Besides, these generic entities, domain-
specific entities are also extracted, such as job title, expertise level, career level,
skills, job requirements, job responsibilities and others. The dictionary helps in
the extraction of domain-specific entities. The dictionary contains patterns/action
rules developed using the feature, such as POS tags, words dictionary, and simple
pattern/action rules. The rules also incorporate priorities for determining the
execution order. The execution order effects the input/output for subsequent
pattern/action rules. After entities identification, the next step is to connect
entities for building context. The hierarchical and associative relationships among
entities of a job description define its context. Job description ontology plays a vital
role in building context among the extracted entities. The details of extraction,
enrichment and context building is present in Chapter 4.
2http://www.anc.org/oanc/penn.html
CHAPTER 3. SEXENT FRAMEWORK 28
3.3 Job Description Domain Ontology
The job description ontology is represented using Web Ontology Language (OWL)
as domain knowledge. The job description ontology provides a semantics and struc-
ture for representing job description in a machine-understandable format. The core
schema classes are Job Description, Job Title, Requirements, Responsibilities, Ex-
pertise Level, Education, Career Level, and Job Type. Some of the core properties
are the job description, requirements, job type, education, job title, expertise level,
job position. Logically, the job description ontology has two parts, i.e., job descrip-
tion and job position. One job description can be a part of multiple job positions.
This logical segregation is incorporated to increase reuse-ability of a single job
description. The details of job description are discussed in Chapter 5.
3.4 Job Query Transformation
Users post natural language queries for the retrieval of relevant jobs. NLQ queries
are not machine-understandable as they are raw plain English text queries. The
NLQ queries are transformed into the SPARQL format for execution on machine-
understandable knowledge-base. The proposed job search solution is designed
to answer user questions posed in natural language against an RDF store of job
descriptions. It enables users to explore the RDF annotated data without knowing
SPARQL or the underlying job ontology. Sem-QAS (Semantic Question Answering
Solution) is designed using a hybrid of pattern storage approach and dynamic
SPARQL triple generation approach. It makes use of the ontologies and linguistic
analysis on input query. The most differentiating features are (1) use of atomic
filtering constraints to generate SPARQL query triple patterns without depending
on back end question database and (2) semantic association of generated triple
CHAPTER 3. SEXENT FRAMEWORK 29
patterns according to users intent for dynamic generation of complex SPARQL
queries.
The transformation process for NLQ starts with the identification of named
entities, such as skills, location, experience, expertise level, and others. The iden-
tification of named entities uses the same dictionary as discussed in the previous
section 3.2. After the identification and extraction of named entities, the next step
is to match the entities with respective query template(s). The identified named
entity replaces the ontological concept with the actual value. The details for job
queries transformation and retrieval are discussed in detail in Chapter 6.
3.5 Summary
The current chapter briefly discussed the logical divisions of the proposed frame-
work. An overview of each component was presented with details to follow them
up in subsequent chapters.
Chapter 4
SAJ Framework
The purpose of this chapter is to discuss in detail the proposed extraction, enrich-
ment and context building framework for Job Description named as SAJ. At first,
each component of SAJ framework is discussed in detailed followed by evaluation.
4.1 Introduction
SAJ extracts, enriches and builds the context of information that exists in Job
descriptions in e-recruitment by exploiting Linked Open Data, job description do-
main ontology, and domain-specific dictionary. The SAJ enriches extracted infor-
mation to minimize the information loss in the extraction process. SAJ is an over-
all framework that encapsulates various processes together to achieve extraction,
enrichment and context building of data from the job description in e-recruitment.
Fig 4.1 shows pictorial representation of extraction, enrichment and context
building approach. At first, the raw plain English text is segmented into predefined
categories using a self-generated dictionary. Linguistic analysis and dictionary
help in identification of entities in the text. The extracted entities are processed
30
CHAPTER 4. SAJ FRAMEWORK 31
Figure 4.1: High level block diagram for the proposed framework SAJ
in parallel by context builder and enrichment. The knowledge-base stores the
context-aware and enriched entities. Extracting annotations from unstructured
text is non-trivial and challenging work (Malik et al., 2010b). The SAJ technique
not merely extracts entities from job description text but also makes them enriched
entities contrary to the existing e-recruitment systems (Buttinger et al., 2008)
(Roman et al., 2015). Following sections discuss the SAJ framework in detail.
4.2 Segmentation
Segmentation is a process to categorize text in a job description. The primary
objective of segmentation is to ensure that the extracted entities are correct and
belong to the correct text segment. The starting and ending index location for the
text segment is marked.
The segments are identified in the job description text, such as job title, require-
ments, responsibilities, career level, and others. At current, a dictionary-based ap-
proach (Sen et al., 2012) is adopted for identification of text categorization. The
dictionary contains an extensive list of possible rules and headings values that can
occur in a job description. The rules in the dictionary ensure that split is correct
CHAPTER 4. SAJ FRAMEWORK 32
along with its category. Fig 1.1 shows the categories, such as location, job title,
requirements, and responsibilities of a job description that are identified by the
segmentation process. The next section discusses dictionary details.
4.3 Dictionary
The purpose of the dictionary is to assist in text segmentation and entities extrac-
tion. The dictionary is a combination of rules designed for identifying segments
and entities in a job description. The rules are written using the grammatical
syntax of Java Annotation Pattern Engine (JAPE). Table 4.1 shows sample rules
for segmentation and entities extraction along with detail in natural language for
comprehension.
Table 4.1: Sample rules for segmentation and extraction with description
Rules DescriptionSegmentation
text.sentence.index == 1 Job title as first line of texttext.sentence.token ¡ 4 heading line has no other text
Extraction
Rule:expDurationForSkill
({Token.kind==number}
{Token.string=="+"}{SpaceToken}
({Token.string=="years"}
|{Token.string=="yrs"})):exp -->
:exp.ExpDuration
= {rule = "expDurationForSkill"}
rule detects the experience fora skill, e.g., 2+ years of expe-rience is required in Java
The rules in the dictionary are manually designed using JAPE grammar. Do-
main experts have validated these rules. The dictionary comprises two types of
files, (1) the JAPE rule files and (2) the gazetteer lists. Each rule in a JAPE file
CHAPTER 4. SAJ FRAMEWORK 33
has two parts, i.e., Left (L.H.S) and Right (R.H.S). L.H.S contains inputs that are
the identified annotation patterns. The annotation pattern comprise of regular
expression and operators (e.g. *, ?, +). R.H.S is the rule outcome that is one or
multiple annotations to be created based on L.H.S. All rules in the dictionary are
not applied once rather in order of priority as mentioned by priority parameter of
rule. This process reduces the chances of false positive and also provides a way to
verify any error that arises during requirements of boundary identification. The
file extension of rule files is .jape. The gazetteer is specialized lists of concepts that
are input to the rules for inferring or calculating output values. These values can
be input for another rule or a concept classification, such as job requirement, job
responsibilities, and expertise level. The rule priority defines the rules execution
order.
4.4 Linguistic Analysis and Extraction
The linguistic analysis and entities extraction component has two purposes, that
are, (1) analysis of text for identification of sentences, Part of Speech (POS) and
compound words, and (2) identification of entities.
4.4.1 Linguistic Analysis
Linguistic analysis is a process of understanding the text. It identifies sentences,
words/tokens, lemmatization and clearing, and part of speech tagging (POS). Sen-
tence’s extraction is carried out from the job description text after segmentation.
After splitting text into sentences, mark each word/token with particular POS
tags such as noun, adjective, verb, and others. POS uses the Penn TreeBank 1
1https://www.ling.upenn.edu/courses/Fall 2003/ling001/penn treebank pos.html
CHAPTER 4. SAJ FRAMEWORK 34
tag set for identification of POS tags, such as JJ for an adjective, NN for noun
singular and others. Identification of POS tags for word/token in sentences builds
a ground for identification of compound words (Nabeel Ahmed, 2008), such as
software development.
Table 4.2: Compound words identification and extraction rulesRules Description
∨a, a ∈ N every compound words have noun∨a, a ∈ N , ifsucceed( a, a)→ join( a, a)
A noun succeeded by a noun, termsare joined
∨a,b,wherea ∈ N ∧ b ∈ A,ifsucceed( a, b) → term( a),∧drop( a)
A noun succeeded by an adjective,the noun term is saved and adjectiveis compared with next token
Compound word extraction uses a set of rules. The rules are designed using
knowledge of words construction from English dictionaries and in-depth analysis
of scientific and technical English texts. English literature experts validated the
rules. Few rules are shown in Table 4.2 with explanations, and detailed rules are
available in (Awan, 2009).
Besides the identification of compound words, POS tags also help in identifying
cardinals, such as ’1’ or ’three’. The identification of cardinals has an impact on
text search results, e.g. java with two years of experience. Here, 2 is a cardinal
which represents the level of experience for the skill Java. Identification of a
cardinal helps to match the experience mentioned in the job description with
a user query/profile. After marking the text with POS tags, compound words,
cardinals, and entity extraction are next in the pipeline.
4.4.2 Entities Extraction
Entity extraction is a process of extracting important information from unstruc-
tured text, such as places, organizations, names, money. Fig 4.2 shows domain-
CHAPTER 4. SAJ FRAMEWORK 35
specific entities for e-recruitment from a job description, such as, expertise level,
skills.
Figure 4.2: Extracting context-aware requirement entity from a job description
The extraction of entities from a job description is non-trivial and challenging
work. The extraction of entities face challenges, such as similar information being
described with various nomenclature or existing of contextual entities. Table 4.3
shows a requirement for a skill java with 2 or more years of experience has been
represented with various nomenclature.
Table 4.3: Sample text showing nomenclature variation in job description1 . 2+ years o f expe r i ence r equ i r ed in Java .2 . Must have worked at l e a s t 2 years in java development .3 . Exper ience o f 2+ years i s r equ i r ed in java development .
This type of variation in the text makes it a challenge to extract information
with minimal information loss. The other problem, as discussed, is of contextual
entities. Entities extracted from a job description are of two types, that are,
(1) entities that are directly identified from the text, such as job title, skills,
organization, location etc, and (2) entities that are determined based on occurrence
of other entities such as job requirements, job responsibilities, and job and skill
experience. Fig 4.2 shows domain specific named entities for e-recruitment in job
description. In the figure Skill and Expertise Level are directly identified entities
where as requirement is contextual entity identified based on skill and expertise
level. The steps to cater for the problems as mentioned earlier are next.
CHAPTER 4. SAJ FRAMEWORK 36
4.4.2.1 Basic Information Extraction
The entities that are easily and readily available for extraction are basic entities,
such as, job title, location, career level, and organization are the basic entities for
a job description. Table 4.4 shows each of these entities with examples.
Table 4.4: Basic entities along with examples from a job description in SAJEntity Example.
Job Title Java Software EngineerLocation St. Louis. MO
Career Level Mid-LevelOrganization Google Inc.
A job must have a job title as a mandatory entity, but others are optional.
The extraction of basic information is carried out via a hybrid approach using
heuristics and rules from the dictionary. In case of a job title, when muktiple job
titles are detected, then the first line heuristic as mentioned before in Table 4.1 is
applied. The position of the job title plays a vital role. Another aspect that needs
much deliberation is the use of the special character in job title, thus creating an
issue with wrong boundary detection of a job title entity.
4.4.2.2 Requirements Extraction
Requirements are contextual entities that define essential skills or capabilities that
an employer seeks in a potential candidate. For example the Job Requirements
segment in Fig 1.1 shows requirements in the job description. The extraction of
requirements is vital due to its significance for both employers and candidates.
Requirements are not just basic rather basic entities in a specific context, e.g., a
skill java has various expertise levels, such as novice, proficiency. Here skill and
its expertise level make a single requirement as they occur in a specific context, as
shown in Fig 4.2.
CHAPTER 4. SAJ FRAMEWORK 37
In this process, there are two main steps: (1) identification of requirement
boundary, (2) identification of entities. The requirement boundary is the start
and end of a requirement. Table 4.5 shows a sample rule that marks the bound-
ary for a requirement. The rule uses POS tags in combination with words in a
sentence to mark a boundary for the requirement. After identification of require-
ment boundary, the next step is to identify the actual requirement. An essential
aspect of the rule is also setting its priority. Priorities are set in rules using JAPE
inherited property Priority.
Table 4.5: Sample rule for boundary detection for requirement using JAPE in SAJ
Rule Description
Rule:requirementboundarymarker
Priority: 100
{Lookup.majorType==Req_BeginKeywords}
({SpaceToken})[0,2]
({Token.category==IN}| {Token.category==TO}|
{Token.category==VBG}| {Token.category==VB}|
{Token.category==VBZ}|{Token.category==DT})?
((({SpaceToken})[0,3] ({Token.kind==word,
!Lookup.majorType==Req_NotAfterKeywords}
|{Token.kind==symbol}| {Token.kind==number}|
{Token.kind==punctuation,
!Token.string=="."}) )+)
: req --> :req.Requirement =
{rule = "requirementboundarymarker"}
This rule detectsthe boundary ofthe requirement.It detects thetoken categoriesas POS. The to-kens are eitherverbs (VBZ, VBG,VB), determinersor prepositions.Besides POSrequirement key-words placement insentence is verified.
The primary purpose of setting a priority is to define the execution order of
rules. The result obtained from one rule is input to the next rule. A higher value
of priority defines a higher order of execution of the rule. The rule in Table 4.5
has priority set to 100, meaning it is the first rule that will be executed and will
detect a start and end boundary for a requirement.
CHAPTER 4. SAJ FRAMEWORK 38
Table 4.6: Sample rule for job requirement using JAPE in SAJ
Rule Description
Phase: requirementSubParts
Input: RequirementsBeg Token
RequirementsNot RequirementsMid
RequirementsEnd Skill Split
ToolsAndTechnology OperatingSystem
Database Course TechnicalLanguage Protocol
ExpertiseLevel MandatoryConditionTrue
MandatoryConditionFalse ExpDuration
Options: control = appelt
Rule:requirementSubPartsStart
Priority: 50
{RequirementsBeg} (((({Token})* | {Skill}
| {ToolsAndTechnology} | {OperatingSystem}
| {Database} {Course} | {TechnicalLanguage}|
{Protocol} |{ExpertiseLevel}|
{MandatoryConditionTrue} |
{MandatoryConditionFalse}|
{ExpDuration} |{ExpertiseLevel}))+{Split} )
:req --> :req.Requirements =
{rule ="requirementSubPartsStart"}
This rule ap-plies variouslists in dictio-nary, such asToolAndTech-nology, Oper-atingSystem,Database,Course, Tech-nicalLanguageand others todetect the en-tities. Besidesextracting enti-ties the rule alsodetect the expe-rience durationand expertiselevel. The ruleis dynamic, i.e.,placement ofthese entities ina sentence willnot affect therule.
The contextual entity requirements are extracted from a job description using
pattern/action rules defined in the dictionary. The rules identify entities from
unstructured text that constitute the requirements for a job description. Con-
sider the requirement, Proficiency in Object Oriented Programming in Java and
Groovy+Grails for Web-based application. The rule in Table 4.5 will mark the
boundary of the requirement, Proficiency in Object Oriented Programming in Java,
as it is a high priority rule with value 100. Table 4.6 shows a pattern/action rule
with priority 50. The rule in Table 4.6 has a priority of 50. This rule extracts
CHAPTER 4. SAJ FRAMEWORK 39
skill, such as Java, and is executed after the rule with priority 100 mentioned in
Table 4.5.
The dictionary defines the domain knowledge for requirement identification.
The rule in Table 4.6 has a priority of 50 and use various lists, such as skill,
database, course, and technical knowledge for the extraction of requirements. A
sentence not satisfying the rule is plunged.
4.4.2.3 Responsibilities Extraction
Responsibilities are the duties that an employee performs during his stay in an
organization, as shown in Fig 1.1. It is a non-mandatory text segment of a job
description. Sometimes it is described along with job requirements. Sometimes
responsibilities are defined using similar entities, such as, must have a knowledge
of AWS cloud and he will manage AWS cloud. In the example, the first statement
is a job requirement, whereas the second statement is the job responsibility. It
is difficult to draw a clear line of distinction between job responsibility and job
requirement. After detailed analysis and experimentation on real-world data-set,
SAJ was successful in segregating the job requirements from job responsibilities.
CHAPTER 4. SAJ FRAMEWORK 40
Table 4.7: Sample rule for detecting responsibilities from a job description in SAJ
Rule Description
Phase: Responsibility
Input: Lookup Token SpaceToken
Options: control = appelt
debug=true
Rule:keywordResponsibility
Priority: 10
{Lookup.majorType==Responsibilty_BeginKeywords}
({SpaceToken})[0,2] ({Token.category==IN}|
{Token.category==TO}|{Token.category==VBG}|
{Token.category==VB}|{Token.category==DT} |
{Token.category==NN}|{Token.category==NNS})?
((({SpaceToken})[0,3]({Token.kind==word,
!Lookup.majorType==Res_NotAfterKeywords} |
{Token.kind==symbol}|{Token.kind==number}|
{Token.kind==punctuation,
!Token.string=="."}))*)
:req --> :req.Responsibility =
{rule = "keywordResponsibility"}
This ruledetects theboundaryof the re-sponsibility.It detectsthe tokencategoriesas POS.The tokensare eitherverbs (VBZ,VBG, VB),determines orprepositions.Besides POSrequirementkeywordsplacement ina sentence isverified.
Table 4.7 shows a rule for detection of responsibilities boundary from job
description. The sample rule extracts responsibilities using domain background
knowledge from a dictionary and morphological sentence structure. The rule has
a low priority of 10. It uses Responsibilty BeginKeywords lists in dictionary along
with POS tags and word kinds to identify the responsibility boundaries.
CHAPTER 4. SAJ FRAMEWORK 41
Figure 4.3: Sample of educational requirement in a job description
4.4.2.4 Education Extraction
Education defines mandatory or minimal qualification required for a job.
Table 4.8: A sample rule for education extraction from a job description in SAJ
Rule Description
Rule:degreeextractioninfull
Priority: 40
({Degree}
{Token.string=="in"}({Token.category==NNP,
!Lookup.majorType==date})+{Token.string=="and"}
({Token.category==NNP,!Degree})+ ):Degree
--> :Degree.FullDegree
={rule="degreeextractioninfull"}
This rule ex-tracts edu-cational re-quirementcategorized asDegree. It usePOS tag (NNP)and degree dic-tionaries. Therule also verifiesexistence oftoken ”in”.
Education has four categories that are degree, diploma, training, and certifica-
tion. These categories will be useful during a job matching with a profile. Fig 4.3
shows an educational requirement for a job.
Fig 4.3 shows an education requirement, i.e., BS/MS in Computer Science.
Table 4.8 shows a rule to extract the educational requirement as Degree. The
sample rule identifies educational entities that have a token in. In addition to token
identification, Degree list is used with POS tag to correctly detect the educational
CHAPTER 4. SAJ FRAMEWORK 42
requirement.
4.5 Context Builder
The extracted entities are forwarded to the context builder and enrichment module
in parallel.
Figure 4.4: Graph structure showing entities and connections in SAJ
The context builder creates relationships (both hierarchical and associative)
among extracted entities using a job description ontology, as shown in Fig 5.2.
The job description ontology is designed using job posting schema from schema.org
2 and job description domain studies from various existing job portals discussed
above. HR domain experts evaluated the ontology schema concepts and relation-
ships for validating the domain coverage of job description ontology. The details
of the Job Description Ontology are available in (Ahmed et al., 2016).
The job description ontology provides a schema for structuring and building
the context of extracted entities, as shown in Fig 4.4. The core schema classes
2https://schema.org/JobPosting
CHAPTER 4. SAJ FRAMEWORK 43
are Job Description, Job Title, Requirements, Education, Career Level, and Job
Type. Some of the core properties are job description, requirements, job type, edu-
cation, and job title. The ontology defines not only hierarchical relationships but
also define associative relationships, such as skos:altlabel, owl:sameAs and others.
Fig 4.4 represents the requirements of a job description in an ontological model
along with all its semantics. A relationship exists between a skill and an expertise
level in the requirement. The relationships are not automatically extracted from
the job description text; instead, the job description ontology already defines these
relationships. The context builder uses entity types, such as skill, job requirement,
expertise level, career level and others for identification of relationships.
For example S1 is an instance of Skills class which is an intermediary node to
connect Skill instance Object Oriented Programming and Expertise Level instance
Proficiency. The intermediary node S1 is then connected to R1 which is an
instance of Requirement class, connected to a Job Description instance JD1.
4.6 Enrichment
Enrichment is the process of adding additional knowledge to existing entities.
The enrichment of job description entities helps in increasing the search space
and better job-profile matching. The enrichment process receives its input from
the entity extraction process as a list of entities. The enrichment only processes
skill entities at current. The enrichment of skills has been performed to cater
the variation of nomenclatures for skills. The primary aim of processing skills
is to have all alternate forms, e.g., Object-Oriented Programming as OOP. The
enrichment will help SAJ in identifying Object Oriented Programming and OOP
as the same skill. The process achieves this by using Linked Open Data, as shown
CHAPTER 4. SAJ FRAMEWORK 44
in Fig 4.5. The main aim to use Linked Open Data for enrichment is to have up-
to-date information related to the terms being enriched. The enrichment process
via Linked Open Data will not suffer from the traditional data staleness problem.
The open source community is responsible for updating the LOD data.
Fig. 4.4 shows a pictorial representation of inter-document concepts enrich-
ment. The concept object oriented programming linked to two job descriptions
with different job titles. A requirement that needs to search all jobs that have
object oriented programming as a requirement will get precise results.
Figure 4.5: Entities enrichment process using LOD in SAJ
Fig 4.5 shows a pictorial representation of enrichment process using Linked
Open Data. Using Linked Open Data for enrichment provides an up-to-date in-
formation for entities. The process do not suffer from traditional data staleness
problem as LOD data is regularly updated by open source community.
The enrichment process receives new labels from LOD based on the properties
rdfs:label, rdfs:altLabel and a condition of lang=en. The rdfs:label, rdfs:altLablea
CHAPTER 4. SAJ FRAMEWORK 45
and lang=en filter are standard Web Ontology Language 3 properties. The simi-
larity is computed among the new labels fetched from LOD and entities extracted
from a job description. If number of returned entities from LOD is less then five,
then all returned entities are stored, but if the number exceeds five, then the simi-
larity is calculated using the Cosine Similarity (Thada and Jaglan, 2013). DISCO
API (Kolb, 2008) facilitates the calculation of cosine similarity. In addition to
cosine similarity, a distributional similarity is also calculated using DISCO API.
DISCO API allows for calculating the semantic similarity between arbitrary words
and phrases. DISCO API calculates the similarities using the Wikipedia 4 SIM
type data set. The data-set used for computing similarities has been published
in April 2013 5. The selecetd entities after computing similarity are stored in
knowledge-base using skos:altLabel.
4.6.1 Knowledge Base
The knowledge base is responsible for storing the data. It receives data from the
context builder and enrichment process. After integrating data the Knowledge-
base stores information as a graph structure using job description ontology. Fig 4.6
shows N3 notation of a job description in a knowledge-base.
3https://www.w3.org/OWL/4https://www.wikipedia.org/5https://www.linguatools.de/disco/discodownload en.html
CHAPTER 4. SAJ FRAMEWORK 46
Figure 4.6: N3 notation of a job description in knowledge-base
Fig 4.6 represents a single job description. The snapshot of Fig 4.4 visualize
two job descriptions JD1 and JD2. Both job descriptions have the same Object
Oriented Programming as skill requirement where their expertise levels are differ-
ent. The graph structure representation of the knowledge base will now connect
the same instance Object Oriented Programming to all Skill instances with differ-
ent expertise levels. This structure of knowledge base becomes more resourceful
when exploring a query in a graph, such as find all jobs which have a requirement
of object-oriented programming.
4.7 Evaluation
The evaluation rationale of the proposed system originates from its primary ob-
jectives, i.e., to design an extraction and transformation methodology for identifi-
cation of entities and compound words from e-recruitment content and build in-
formation context and to design an enrichment methodology to enrich entities and
compound words from Linked Open Data to cater data staleness in e-recruitment.
The information extracted by the SAJ should have minimal information loss, larger
search space and adhere to the Linked Open Data principles. The current evalua-
tion tries to achieve all the aspects mentioned above.
CHAPTER 4. SAJ FRAMEWORK 47
Table 4.9: Statistics of jobs collected from various e-recruitment systemsSource Descriptions.
Personforce.com 6 101DBWorld 7 139
Indeed.com 8 620Total 860
4.7.1 Data-set Acquisition
At current, no standard gold data-set exists for job descriptions. The date-set is
self-collected from various e-recruitment systems and community mailing list. A to-
tal of 860 job descriptions have been collected. Table 4.9 shows sources along with
the statistics of collected job descriptions from each source. The self-developed au-
tomatic crawler collected data from Indeed and DBWorld. Indeed provides REST
API to fetch data. Personforce provided data as an industrial partner.
The collected jobs descriptions belong to multiple categories, as shown in the
Table 4.10. These categories range from information technology to management
to health care. The job descriptions are collected at random and then placed in
these predefined categories. The random selection was carried out to ensure that
data-set is not biased instead contains jobs from multiple domains and disciplines.
The collected data-set was evaluated by Human Resource (HR) Experts who
had more than five years of experience working in the area of human recruitment
and staffing. The primary entities selected after discussion with HR experts for
evaluation were a job title, job responsibilities, job requirements, job category
and education, such as degree, diploma, training or certification. These selected
entities have a pivotal and vital role in the job description(s). The results of the
entity extraction from job descriptions are compared with the manually verified
data from HR experts.
CHAPTER 4. SAJ FRAMEWORK 48
Table 4.10: Statistics of job description in various job categories collected randomly
Job Category Count
Engineering and Technical Services 55Business Operations 20
Computer and Information Technology 125Internet 73
Project Management 85Health-care and Safety 9
Arts, Design and Entertainment 26Sales and Marketing 38
Office Support and Administrative 203Architecture and Engineering 10Construction and Production 9
Customer Care 21Management and Executive 22
Financial Services 9Government and Policy 6
Post-doctoral 45Research and Teaching 66
Others 38Total 860
4.7.2 Evaluation Metrics
The evaluation has been carried out using standard metrics of recall, precision
and F1-measure as shown in equations 4.1, 4.2 and 4.3 ,as well as an error analysis
(Powers, 2011) is also provided for more solid grounds of evaluation.
Recall =relevant − jobs ∩ retrieved − jobs
relevant − job(4.1)
Precision =relevant − jobs ∩ retrieved − jobs
retrieved − jobs(4.2)
F 1−Measure =2 · precision · recall
precision + recall(4.3)
CHAPTER 4. SAJ FRAMEWORK 49
Besides evaluating recall, precision and F1-measure, overall system accuracy
and error is also calculated. Error rate defines the in-accurate extractions, i.e.,
One minus the total accurate extractions.
Accuracy =tp + tn
tp + tn + fp + fn(4.4)
ErrorRate = 1− Accuracy (4.5)
4.7.3 Evaluation Results
The Table 4.11 shows results for entities extraction process. The table shows
recall, precision and F1-measure values of various entity types. These values are
computed by comparison against the gold standard, manually verified data-set by
HR experts. Education has the highest recall, i.e., 99.9% whereas job title has the
highest precision of 99.9%. Overall job title had the highest F1-measure value of
95.60%. This table shows only the proposed system evaluation results against the
gold standard. Next section discusses comparison with other systems.
Table 4.11: Results of entities extraction from job description in SAJ
S.No. Entity Type Precision Recall F-Measure
1 Requirements 90.5 87.90 88.762 Responsibilities 76.14 75.00 75.763 Education 38 99.9 55.054 Job Title 99.9 90.67 95.005 Job Category 79.24 97.67 87.50
Besides making a comparison on the bases of standard parameters of precision,
recall and F1-measure, an accuracy vs error comparison is also performed by SAJ
to have a clear idea of how good or bad the SAJ performs. From the graph in
CHAPTER 4. SAJ FRAMEWORK 50
Fig 4.7 it is quite evident that education has a low error rate of Zero. The 99.9%
accuracy is only due to low variation in education entity. The system overall has
an accuracy of 94% and an error rate of 6%.
Figure 4.7: Evaluation of extraction comparison of accuracy vs error for SAJ
4.7.4 Comparative Analysis
This sub-section presents the result comparison among SAJ, OpenCalais 9 and
Alchemy API 10. Both OpenCalais and Alchemy API are industry leaders for
information extraction.
All systems were able to extract job titles, as shown in Fig 4.8. The comparison
parameters are precision, recall, and f-measure.
9http://www.opencalais.com/about10http://www.alchemyapi.com/about-us
CHAPTER 4. SAJ FRAMEWORK 51
Figure 4.8: SAJ, Alchemy API and OpenCalais extraction comparison for job titles
From the graph, it is evident that SAJ performs well as compared OpenCalais
and Alchemy API. SAJ has achieved an overall precision of 98.1% as compared
to OpenCalais 39% and alchemy API 34.32%. The other entity that OpenCalais
was able to extract was requirements. Alchemy API was unable to extract re-
quirements. The graph in Fig 4.9 shows a comparative analysis of requirement
entity between SAJ and OpenCalais. From the graph in Fig 4.9, it is evident
that SAJ has a much higher precision that is 90.5% as compared to OpenCalais
42.78%. OpenCalais has a recall of 76.1% whereas SAJ has a recall of 87.09%.
OpenCalais and Alchemy API were not able to extract education, responsibilities
and job category.
CHAPTER 4. SAJ FRAMEWORK 52
Figure 4.9: SAJ and OpenCalais extraction comparison for requirements
Therefore, no comparison is present for these named entities as OpenCalais
and Alchemy API were no able to extract them, as those being domain specific
entities. Fig 4.10 shows comparison of evaluation metrics, that are precision, recall
and f1-measure with the build ground truth for extraction of various entity types.
Figure 4.10: Comparison of precision, recall and f1-measure with ground truth
CHAPTER 4. SAJ FRAMEWORK 53
4.8 Summary
In this research, the SAJ extracts context-aware information from job descriptions
by exploiting Linked Open Data, job description domain ontology, and domain-
specific dictionaries. SAJ enriches and builds context among extracted entities to
minimize the information loss in the extraction process. SAJ encapsulates various
processes together to achieve context-aware information extraction and enrichment
from the job description in e-recruitment. SAJ segments the text into predefined
categories using a self-generated dictionary. Natural Language Processing (NLP)
and dictionary help in identification of entities. The extracted entities are enriched
using Linked Open Data, and job context is built using a job description domain
ontology. The knowledge-base stores the enriched and context-aware information
built using Linked Open Data principles. The data-set comprises of 860 jobs.
HR experts have verified the data-set. The initial assessment is carried out by
comparing manually verified data and system extracted entities. SAJ framework
achieved an overall F1-measure of 87.83 % in entities extraction.
In comparison with other techniques, such as OpenCalais and alchemy API,
the SAJ performed better against the two systems. OpenCalais was able to extract
job titles and job requirements while alchemy API was only able to extract job
titles. Both OpenCalais and Alchemy API were not able to extract education,
responsibilities and job category, as those being domain specific entities. SAJ can
facilitate in searching and retrieval, scoring and ranking of human candidates.
Chapter 5
Job Description Ontology
The purpose of this chapter is to discuss in detail the proposed job description
ontology. The job description ontology describes the underlying semantics and
structure of a job. The job description ontology provides a comprehensive schema
for defining the relationships among concepts to logically connect them.
Section 5.1 defines the ontology design methodology, Section-5.2 discusses the
expressiveness of the ontology, Section-5.3 discusses in detail the job description
ontology and at end evaluation is discussed in Section-5.4.
5.1 Ontology Design Methodology
The job description ontology has been designed by following the Uschold and Kings
Enterprise Methodology (Uschold and King, 1995). Uschold and King Enterprise
Methodology presents in detail the way to build an ontology effectively and effi-
ciently. The designed ontology has its focus for ontology developers and engineers.
The Uschold and Kings Enterprise Methodology comprises following steps.
54
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 55
1. Define purpose: The ontology purpose defines the scope and granularity of
the ontology. Various aspects to cover here are vocabulary definition, meta-
level specification, and discuss ontology re-use.
2. Build ontology: During this step, the ontology developer/engineer focuses on
identification of key concepts, their relationship, defines ontology in a formal
language, and if required, integrates existing ontologies.
Figure 5.1: Uschold and Kings enterprise methodology
3. Document ontology: This means formally documenting the ontology in some
language, i.e., RDF/RDF(S) or OWL. Formally defining the ontology will
facilitate ontology sharing among the community.
4. Evaluate ontology: It is a process to measure the enactment of an ontology.
During the evaluation, all requirement specifications are carefully examined
with respect to Ontology ability to answer questions for the purpose it is
built.
Fig 5.1 presents the pictorial representation of the Uschold and King Enterprise
Methodology. In the subsequent Section- 5.3, a detailed discussion on each of the
Uschold and King Enterprise Methodology step will be discussed, with respect to
job description ontology.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 56
5.2 Ontology Expressiveness
Expressiveness 1 is a way to define a concept more effectively. Ontology’s speci-
fication languages mainly focus on abstraction away from data structure and im-
plementations. They mostly focus on the semantic level of information, preferably
a logical/physical level of information.
Table 5.1: DL basic expressive labels along with details
Label Description
AL Attributive language. This is the base language which allows:1. Atomic negation (negation of concept names that do not
appear on the left-hand side of axioms)2. Concept intersection3. Universal restrictions4. Limited existential quantification
FL Frame based description language[3] allows:• Concept intersection• Universal restrictions• Limited existential quantification• Role restriction
EL Existential language allows:• Concept intersection• Existential restrictions (of full existential quantification)
Ontology’s expressiveness is defined using the Description Logic (DL) 2 which
is more expressive then propositional logic 3 but lesser expressive than First-Order
Language 4 (FOL). DL uses a different formalism and naming convention then
FOL.
The Table 5.1 shows the basic allowed DL expressive labels and Table 5.2 shows
1https://www.dictionary.com/browse/expressiveness2https://en.wikipedia.org/wiki/Description logic3https://en.wikipedia.org/wiki/Propositional calculus4https://en.wikipedia.org/wiki/First-order logic
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 57
extension DL along with their details. The operator encodes its expressiveness.
Table 5.2: DL extension expressivity labels along with details
Label Description
F Functional properties.E Full existential qualification.U Concept union.C Complex concept negation.H Role hierarchy.R Complex role inclusion.O Nominals.I Inverse properties.N Cardinality restrictions.Q Qualified cardinality restrictions.
(D) Use of datatype properties, data values or data types.
Based on the DL expressiveness labels, the job description ontology has ex-
pressiveness of ALCHOF(D)
5.3 Job Description Ontology
5.3.1 Identify Purpose
Job description ontology aims to provide granular details and relationships of
the concepts. Besides providing granular details, it also provides concept sub-
classifications in order to have a better understanding at a granular level. Besides
providing concept granularity, the homogeneous and comprehensive schema will
resolve the semantic heterogeneity that exists in describing the job descriptions and
will provide a common ground for sharing the job descriptions. Job description
ontology serves multiple purposes, that is:
1. Defines the semantic structure for job description
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 58
2. Provides common grounds for knowledge sharing
3. Provides concept hierarchies using generalization/specialization
4. Defines the relationship among concepts for context building
Existing e-recruitment platforms, such as Indeed 5, Monster 6, Personforce 7,
Angel.co 8, LinkedIn 9, Career Builder 10, Glassdoor 11, SimplyHired and many
other use heterogeneous schema for describing a job description. No interoper-
ability of concepts exists among these schema. The job description ontology will
provide a homogeneous and comprehensive schema for representing the concepts
and relationships, and for sharing the job description among various e-recruitment
platforms.
5.3.2 Build Ontology
The motivation of the job description ontology has been adopted from schema.org
model of a job position. The job position schema defines an outline of elements
that must exist in defining a job position. The elements are not linked to-gather
instead they are presented in a flat structure in schema.org model. In addition to
that, essential elements, such as education, requirements, responsibilities are just
plain text instead of being defined by absolute concepts, such as requirements being
defined by skills, experience and expertise levels. The improved and comprehensive
model of job description ontology is a result of in-depth HR recruitment process
domain study and review of ontology design methodologies.
5https://www.indeed.com6https://www.monster.com/7https://www.personforce.com/8https://angel.co/9https://www.linkedin.com
10https://www.careerbuilder.com/11https://www.glassdoor.com/
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 59
Figure 5.2: Job description ontology
The proposed job description ontology, as shown in Fig 5.2 has two logical
divisions, that are (1) Job description, and (2) Job position. A job description
defines aspects related to a core job, such as job title, requirements, responsibil-
ities, education whereas a job position determines elements, such as post date,
last-apply date, organization name, and available positions. The primary advan-
tage of designing a job description ontology in such a way is that the same job
descriptions can be posted multiple times with significant variations in the value of
properties mentioned in the job positions. Fig 5.2 shows a pictorial representation
of the job description ontology. The Job Description and Job Position are the two
main concepts of the ontology. A job description comprises necessary information,
requirements, responsibilities, and education. Whereas, Job position is comprised
of organization and a job opening. One job description can be posted in multiple
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 60
job positions, thus increasing its reuse-ability.
5.3.2.1 Basic Information
Basic ontology concepts are derived from basic entities, as discussed in Section 4.4.2.1.
These concepts include job title, career level, type, experience, salary and others,
as shown in Table 5.3.
Table 5.3: Important concepts in job description ontology
Concept Values
title Software Engineer, Web Devel-oper.
occupationalCategory Internet, Health-care.careerLevel Manager, Entry Level.
type Full Time, Permanent.experienceRequirements 1 Year, 2 Years.
salary It can me monthly or hourly.
Figure 5.3: Job description basic entities as N3 notation
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 61
Fig 5.3 shows a N3 representation of basic entities in job description ontology.
5.3.2.2 Requirements
Requirements are contextual entities, as discussed in the previous chapter. Ta-
ble 5.4 show details of requirements present in the job description ontology. Re-
quirements define the baseline criteria that is to be judged by the employer for
acceptance in the organization.
Table 5.4: Job requirements properties and description
Concept Values
skill Java, HTML 5.expertiseLevel Expert, Novice
mandatory True or False.
Fig 5.4 shows a N3 representation of requirements contextual entity in a job
description ontology.
Figure 5.4: Job description requirements entity as N3 notation
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 62
5.3.2.3 Responsibilities
Responsibilities are the duties that employees perform on a job, such as:
1. responsible for managing daily sales
2. make daily reports and dispatch them to the head quarter
Fig 5.5 shows a N3 representation of contextual entity responsibility in a job
description ontology.
Figure 5.5: Job description responsibilities as N3 notation
5.3.2.4 Education
Education defines the minimum qualification required by a job. Table 5.5 show
properties present in the job description ontology associated with education con-
cept.
Table 5.5: Job Responsibilities properties and description
Concept Values
educationTitle BS, MS.educationType Degree, Certificate.
postEducationExperiance 1 year, 2 year.
Fig 5.6 shows a N3 representation of education entity in the job description
ontology.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 63
Figure 5.6: Job description education as N3 notation
A job position defines actual job advertised for hiring candidates. The infor-
mation associated with a job position is shown in Table 5.6.
Table 5.6: Job position properties and descriptions
Concepts Values.
hiringOrganization Google Inc.datePosted Morning, Evening.jobLocation 9 - 5, 10 - 7.
positions 1,2.workingShift morning,night.workTiming 9 - 6, 17 - 4.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 64
Fig 5.7 shows a N3 representation the job position in a job description ontology.
Figure 5.7: Job description profile as N3 notation
5.3.3 Ontology Development and Documentation
The job description ontology is developed using Protege tool. Fig 5.8 shows a sam-
ple N3 representation of the job description ontology. The ontology is documented
using Protege annotation properties that include rdfs:versionInfo, rdfs:comment,
rdfs:label, rdfs:seeAlso, rdfs:priorVersion and other properties.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 65
Figure 5.8: A sample N3 representation of the job description ontology
5.4 Evaluation
The job description ontology is evaluated using two approaches, that are (1) do-
main coverage and (2) application driven evaluation. In the e-recruitment domain,
there exists no gold standard for evaluation of the job description ontology and
also there is no existing application for its reuse.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 66
5.4.1 Domain Coverage
Domain coverage evaluation is a comparison of the job description ontology con-
cepts against concepts from different existing models in the same domain. The
domain concepts comparison was evaluated against the schema.org, indeed, HR-
XML, and PROSPECT. The comparative evaluation was based on how well on-
tology defines the domain. Table 5.7 shows a comparison of the job description
ontology with other existing domain models. The job description ontology has
coverage of 96% whereas schema.org had 75%, indeed had 38%, the prospect had
21%, and HR-XML had 46% coverage based on the concepts identified for eval-
uation in consultation with HR experts. From the comparison, it is evident that
the job description ontology is more comprehensive in concepts coverage than any
other model.
Besides, comparing ontology concepts of various existing model with job de-
scription ontology, the ontology was also presented to 6 HR experts from various
national and international organization for feedback. A positive feedback was
received from them in-terms of ontology comprehensiveness and domain represen-
tation.
5.4.2 Application based Evaluation
An application is built on top of the job description ontology for the applica-
tion based evaluation. The application analyzes how well it captures real-world
scenarios. The application was developed in Java, with graph store in GraphDB.
The evaluation is performed on a data-set of 101 job description collected
from 15 different domains, such as information technology, management, finance
and others as to ensure that ontology captures all domains. The job description
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 67
were stored in knowledge base using an application. Queries were executed on
knowledge-base and results were analyzed. Table 5.8 shows the queries used in the
evaluation.
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 68
Tab
le5.
7:D
omai
nco
vera
geof
job
des
crip
tion
onto
logy
Con
cepts
Sch
ema.
org
Job
Des
crip
tion
Indee
d.c
omP
RO
SP
EC
TH
R-X
ML
bas
eSal
ary
xx
xx
dat
ePos
ted
xx
xx
educa
tion
Req
uir
emen
tsx
xx
xdeg
reeR
equir
emen
tsx
cert
ifica
tion
Req
uir
emen
tsx
trai
nin
gReq
uir
emen
tsx
dip
lom
aReq
uir
emen
tsx
emplo
ym
entT
yp
ex
xx
exp
erie
nce
Req
uir
emen
tsx
xx
xx
hir
ingO
rgan
izat
ion
xx
xx
ince
nti
veC
omp
ensa
tion
xx
jobB
enefi
tsx
xin
dust
ryx
xjo
bL
oca
tion
xx
xx
occ
upat
ional
Cat
egor
yx
xx
xqual
ifica
tion
sx
xre
spon
sibilit
ies
xx
sala
ryC
urr
ency
xx
skills
xx
xx
spec
ialC
omm
itm
ents
xti
tle
xx
xx
xva
lidT
hro
ugh
xx
xw
orkH
ours
xx
xsk
ills
Exp
erti
seL
evel
xx
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 69
Table 5.8: Job description evaluation queries categorization
Job Titles Requirements Career Level
Product Manager Word with 3+ years ManagerFull Time Writer Microsoft with 3+ years Director
CTO MySQL Entry LevelAccount Manager AJAX ExecutiveProgram Manager Java Management
For each search query, jobs were manually categorized, and counts have been
calculated. After that, the same queries have been applied to the job descriptions
graph store for evaluation. The queries have been written in SPARQL language,
as shown in Fig 5.9. The primary purpose is to evaluate system precision. The
application compares the number of application retrieved job descriptions with
manually retrieved job descriptions. Table 5.9 shows aggregated number of results
for each query category.
Figure 5.9: A sample SPARQL query to retrieve job title labels after execution
The results are promising as shown in Table 5.9 for manually and system
retrieved job descriptions using the designed application.
Table 5.9: Job Description user retrieval summary
Category Manual System Retrieved
Job Titles 25 25Requirements 33 33Career Level 45 45
CHAPTER 5. JOB DESCRIPTION ONTOLOGY 70
5.5 Summary
The current chapter presented the proposed job description ontology for e-recruitment
domain. The ontology design has its inspiration from the job position schema from
schema.org 12. The ontology segregates two key concepts that are job description
and job position. The job description ontology was evaluated for domain coverage.
Alongside domain coverage, an application-based evaluation was also carried out.
For the domain coverage method, the base criteria were a set of concepts identified
by HR expert team. The application based evaluation was carried out by designing
a small in-house application for storing and retrieving job descriptions using the
ontology model.
12https://schema.org/JobPosting
Chapter 6
Sem-QA Framework
The focus of the current chapter is to present a semantic query translation frame-
work SEM-QA. The focus of SEM-QA is to translate natural language queries for
searching machine-understandable data. The natural language queries will provide
ease to end user in defining their search requirements in a format that they can
best describe. The solution will handle all the under-laying transformation and
search complexities.
6.1 The Sem-QA Framework
Sem-QA (Semantic Query Translation Framework) is a comprehensive solution
for the transformation of natural language queries into a format that is machine-
understandable. The most differentiating features are (i) use of atomic filtering
constraint to generate SPARQL query triple pattern without depending on back
end question database and, (ii) Semantic association of generated triple patterns
according to user intents for dynamic generation of complex SPARQL queries.
The research contributions of the presented technique are as follows:
71
CHAPTER 6. SEM-QA FRAMEWORK 72
• flexibility to handle grammatically incorrect and incomplete queries
• use of atomic question patterns to dynamically generated complex question
patterns
• semantic processing of broadening and narrowing terms, for instance: at
least, at most and 3+.
• produce structured output per user demands
The Sem-QA is a comprehensive solution for the transformation of NLQ to
SPARQL queries to be executed over machine-understandable data. Sem-QA
query has three modules (i) Linguistic Analysis and, (ii) Query Template Match-
ing and, (iii) SPARQL Query Generation . The proposed technique is extensively
tested using two different data sets: (i) Mooney job data set (Mooney, 2016) and,
(ii) queries posted on the Personforce (Personforce, 2016) job portal by different
job seekers. The evaluation results show that the proposed methodology success-
fully translate user queries into valid SPARQL with high accuracy.
The rest of the chapter is structured as follows: Section 6.2 discusses linguis-
tic analysis on user queries, Section 6.3 discusses template matching, Section 6.4
discusses the SPARQL query generation, Section 6.5 discusses a working example,
and Section 6.6 shows the evaluation of the system.
6.2 Semantic Linguistic Analysis
Linguistic analysis is a process of understanding the text. It identifies sentences,
words/tokens, lemmatization and clearing, and part of speech tagging (POS). De-
tailed discussion on linguistic analysis has already been provided in Chapter 4.4.1.
The only significant difference between Chapter 4.4.1 analysis and current analysis
CHAPTER 6. SEM-QA FRAMEWORK 73
is of type of text analyzed. The text of NLQ queries posted by the users are not
very long, such as mobile development jobs, jobs in new york. Besides query being
too short, they also may have one or more of the following weakness: grammatical
and spelling mistakes, incomplete questions, use of jargon’s and some users type
keywords. Fig 6.1 shows sample queries.
Figure 6.1: A set of sample queries from Mooney data set
To serve a larger group of users, Sem-QA in Linguistic Analysis, handles most
of the errors as mentioned earlier. It uses a dictionary for mapping question
words (entities) to similar ontology concepts. Since NL questions are short and do
not have a significant pattern for entity detection nor the contextual information
could be used for it; therefore the designed dictionary is used to identify entities
that cannot be identified using pattern analysis. Dictionary for NLQ consists of
nearly 80 rule files used for the detection of different entities that include: Career
Level, Expertise level, Organization, Job Title, Skills, Experience, Job type, Job
Category, Person Name and many more. Table 6.1 shows examples of entities
extracted from job queries.
Table 6.1: Examples of entities detected from natural language job queries
Sample Query Detected Entities
Are their any jobs for odbcspecialist?
Skill: odbcExpertise Level: specialist
Java jobs in houston Skill: JavaPlace: Houston
CHAPTER 6. SEM-QA FRAMEWORK 74
6.3 Query Template Matching
This module is designed using pattern matching approach as discussed in (Oscar
et al., 2009). According to the technical speculation each question is composed
of Filter Constraint (FC) and Desired Information (DI). FC specifies user prior-
ities for a job search, while DI specifies the search intent. A question may have
null FC, for instance, All the jobs please, while another query may have mul-
tiple FC, such as Are there any jobs at dell that require no experience and pay
50000. Therefore correct answering demands: (i) identification of all user speci-
fied FC and, (ii) correct association of FC while generating a formal query and,
(iii) identification of negation of an FC and, (iv) special processing of broadening
and narrowing terms for instance, at least, at most and 3+. Mooney (Mooney,
2016) and Personforce (Personforce, 2016) data-sets have been analyzed to cater
for a maximum possible number of FC. All atomic FC from the data set queries
have been identified and are converted into generalized expressions, called FC
Expression (FCExp). A FCExp is a well defined generalized expression, that spec-
ifies an atomic filter constraint, in terms of ontology concepts. A user question
may consist of any number of FC. The template matcher maps an input text FC
to FCExp. Instead of generating and storing all possible combinations, of FCExp
complex FCExp are dynamically generated using the existing atomic FCExp.
6.4 Query Generation
This module is designed using pattern matching approach as discussed in Sec-
tion 6.3 and dynamic triple generation approach (Lopez et al., 2005) and (Wang
et al., 2007). It maps previously identified FCExp to FC Templates (FCTemp).
Each FCExp is bonded to an FCTemp. An FCTemp is a string template specifying
CHAPTER 6. SEM-QA FRAMEWORK 75
a SPARQL group graph pattern. The SPARQL group graph pattern in an ontol-
ogy concept representing the subject and object parts of a triple pattern. It takes
as an argument a hash map of key-value pairs. The key-value pair is made of the
ontological concepts as (keys), such as skill, job title, city, country values identi-
fied during the semantic linguistic analysis phase. In the invoked string template,
the ontological concepts (keys) are replaced with their actual values, to generate
question specific triple patterns. In a similar manner for all identified FCExp,
an FCTemp is invoked to generate SPARQL query triple patterns. These triple
patterns are then joined using a SPARQL operator, that is semantically equivalent
to the user-specified connecting words.
6.5 Working Example
The working of proposed technique is demonstrated with an example query from
the Mooney data set. The input query q1 is: List the companies that desire
’c++’ experience?. Initially q1 is processed for Semantic Linguistic Analysis.
Query q1 is annotated, checked for negations and special words and then FC
and DI are determined. In the example query, FC include: [Skill ] = [c + +]
and [ExpertiseLevel ] = [not null ], while DI is [Organization]. In Query Template
Matching phase a matching FCExp is searched for each of the two atomic FC
found in the input query, as shown in Figure 6.2. The matched FCExp are
used to invoke FCTemp. The query under discussion invokes three FCTemp
to form WHERE clause of SPARQL query. The SPARQL Query Generator
also looks for the appropriate association operators, along with the task of in-
voking FCTemp. In q1, a user is looking for [Organization], that satisfies two
atomic FC : [Skill ] = [c + +] and [ExpertiseLevel ] = [not null ]. Therefore the
CHAPTER 6. SEM-QA FRAMEWORK 76
Semantic Job Store
DI=[ Organization: ? ]FC=[ Skill:c++ ,
ExpertiseLevel:not mentioned ]
User Query q1=List the companies that desire 'c++' experience?
[job][organization]
[JobTitle][WorkLocation][ExpertiseLevel]
IBMApple
Microsoft
Organization: ?Skill:c++
ExpertiseLevel:not null
Matched Template
SPARQL Query
Annotated NLQ
Structured Results
SPARQL Query Generation
?jID jdo:jobTitle ?title .?jID jdo:publishedBy ?organization .
?jID jdo:hasSkill ?skill .FILTER regex(str(?skil), "<entity_list.Skill>", "i" )
?jID jdo:hasExpertiseLevel ?expertise .FILTER (bound(?expertise))
SELECT ?organization
Figure 6.2: A sample query processing representation of Mooney data set
generated triple patterns are associated using SPARQL Dot operator. After gen-
erating the WHERE clause, it also adds the SELECT clause to the final SPARQL
query. SELECT clause is generated using DI : [Organization]. Generated FCExp
and FCTemp are shown in Fig 6.2.
CHAPTER 6. SEM-QA FRAMEWORK 77
6.6 Evaluation and Results
6.6.1 Experimental Setup
The Sem-QA generated SPARQL queries are compared with the manually gener-
ated SPARQL queries, written by domain experts. A translated query is correct
if it is equivalent to the manually generated one. Tests have been conducted to
evaluate Sem-QA correctness and efficacy of translation. A discussion on data set
and evaluation results is provided in the Subsections 6.6.2 and 6.6.3.
6.6.2 Data Set Specification
Sem-QA evaluation is performed on two data-sets. One is standard benchmark
data set (Mooney, 2016) and the second is a real-world user queries posted on
the Personforce (Personforce, 2016) job portal. Fig 6.1 shows sample queries.
Table 6.2 shows the data set queries count. Both data-sets contain sample job
queries in plain English.
Table 6.2: Query count for Mooney and Perforce data set
Data-set Total Queries
Mooney 620Personforce 500
Most of the NLI based QA systems have not discussed the processing of ques-
tion words, such as at least, negation, at most, each, outside and inside , known
as scope specifiers. We performed a statistical analysis on the Mooney job data
set. It shows, questions with some scope modifier constitute 16.29% of total job-
related queries. Although they are not the major part of job-related questions,
the occurrence of scope specifier in NL questions is quite often; therefore, they can
CHAPTER 6. SEM-QA FRAMEWORK 78
not be ignored. According to Mooney geographical data set statistics discussed
in (Cimiano and Minock, 2010), one-quarter questions are missed by NLI based
QA systems, because of the missing or under processing of scope modifiers. There-
fore along with with 83.71% questions, Sem-QAS has paid particular attention to
the processing of 16.29% questions, that involve scope modifiers.
Another less focused part is the correct association of multiple FCs, while
the precision of the result is dependent on it. In the question Show me jobs
using lisp that require a bscs and desire a msee, the basic filtering criteria is bscs,
while msee needs to be added as an OPTIONAL condition. If the second filtering
constraint msee is associated with the first bscs using AND, it will miss all jobs
that require only bscs and have not mentioned msee as a requirement. Therefore
the incorrect association of filtering constraints may cause opportunity loss to the
user and is unbearable. To show correct association of FCs in Sem-QAS. We
have translated questions involving multiple FCs, that is associated using different
operators.
Table 6.3: Query Categorization based on number of Filter Constraints
Category Category Description
QC 1 Questions without any FCQC 2 Questions with single FCQC 3 Questions with 2 different FCs, both of them are manda-
toryQC 4 Questions with 3 different FCs, 2 are mandatory, and
the 3rd FC is OPTIONALQC 5 Questions with 3 different FCs, 1 is mandatory, and the
other 2 are associated with 1st one using OR
Table 6.3 shows categorization of evaluation queries based on number of Fil-
ter Constraints. Experimental results for the queries mentioned in Table 6.3 are
discussed in Section 6.6.4.
CHAPTER 6. SEM-QA FRAMEWORK 79
6.6.3 Evaluation Results
Correctness is measured in terms of a system recall and precision. The recall is
the ratio of correctly answered questions to the total number of questions in the
data-set, while precision is the measures of correctly answered question divided by
the total number of questions answered by the system (Damljanovic et al., 2010).
Table 6.4: Comparative analysis of Mooney and Personforce data-setTotalQueries
Sem-QASTranslatedQueries
PrecisionRecallF1-measure
Mooney Job Data Set Results620 619 100 99.84 1
Personforce Job Data Set Results500 500 100 100 1
The F1-measure results for two different data sets are shown in Table 6.4,
proves the correctness of Sem-QAS technique.
6.6.4 System Performance for Semantic Association of Atomic
FC
Another important NL based QA system feature is translation efficiency; it must
perform considerably well for complex NL questions. The increasing number of
FCs is also a measure of input query complexity. An NLI based QA system must
maintain efficacy while processing complex queries. To show this side of Sem-QAS
five query categories, as discussed earlier, are translated. Each category involves
queries of varying complexity with different association operators. The values
shown in Fig 6.3 represent the mean processing time of different queries belonging
to the same category. The results confirm that translation time remains almost
linear irrespective of the input query increasing complexity.
CHAPTER 6. SEM-QA FRAMEWORK 80
ht]
QC_1 QC_2 QC_3 QC_4 QC_585
86
87
88
89
90
91
92
93
94
95
Queries
Tim
e (M
illi
seco
nd
)
Figure 6.3: Time comparison between various Filter Constraints queries
6.7 Summary
The current chapter discussed the solution for the transformation of NLQ into
SPARQL queries. The solution uses query pattern templates and dynamic invo-
cation of string templates for answering natural language question from an un-
derlying RDF store. Its distinguishing features are special processing of scope
modifiers and use of atomic filtering constraints to generate complex queries. The
query matching and dynamic query generation techniques are evaluated using two
data sets. The results show a high recall and precision of the proposed technique.
Chapter 7
Conclusion
The focus of the current chapter is to summarize the extraction, enrichment and
transformation framework called SExEnT. SExEnT extracts knowledge from the
unstructured text of job descriptions, enriches entities and compound words, builds
context, transforms them in a machine-understandable format and stores them in a
knowledge-base. Beside this SExEnT also transforms the natural language English
query into a machine-understandable format so that jobs can be retrieved.
The chapter is organized as follows: Section 7.1 summarizes the contribution
of research, Section 7.3 discusses research contributions and Section 7.4 discuss
limitations and future work.
7.1 Research Description
In this research, an extraction and transformation methodology for identification
of entities and compound words from a job description is proposed for building
information context. The proposed framework extracts context-aware information
from a job description by exploiting Linked Open Data. The extracted informa-
81
CHAPTER 7. CONCLUSION 82
tion has been represented using a comprehensive proposed web ontology for job
descriptions in e-recruitment in order to resolve semantic heterogeneity. Along-
side resolving semantic heterogeneities an enrichment methodology is proposed for
enriching entities and compound words from Linked Open Data to cater for data
staleness in e-recruitment. The enrichment process enriches and builds context
between extracted entities to minimize the information loss in the extraction pro-
cess. Additionally, it transforms the job description Natural Language Queries
from plain English text to machine-understandable format. It combines various
processes to achieve context-aware information extraction and enrichment from
the job description in e-recruitment. The framework segments the text into pre-
defined categories using a self-generated dictionary. The entities are extracted
using Natural Language Processing (NLP) and dictionary. The extracted enti-
ties are enriched using Linked Open Data, and job context is built using the job
description domain ontology. The knowledge-based stores enriched and context-
aware information using Linked Open Data principles 1. The user searches the
context-aware information stored in the knowledge-base using Natural Language
Queries (NLQ). The transformation process encapsulates various processes to-
gether to achieve machine-understandable and context-aware information.
The evaluation has been performed on a data-set of 860 jobs, verified by HR
experts. Initially a comparison of manually verified data and system extracted en-
tities have been carried out. The SAJ framework achieved an overall F1-measure
of 87.83 %. In comparison with other techniques, OpenCalais and alchemy API,
SAJ performed the best. OpenCalais was able to extract job titles and job require-
ments while Alchemy API was only able to extract job titles, as it is a domain
dependent entity. SAJ can facilitate searching and retrieval, scoring and ranking
1https://www.ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/
CHAPTER 7. CONCLUSION 83
of job candidates. The evaluation of transformation from NLQ to SPARQL has
been performed on two data-sets, i.e., Mooney Data-set and Job Portal Data-set.
The results are promising and show a high recall and precision of the proposed
technique.
7.2 Application Areas
The proposed SExEnT framework can also be applied on below mentioned domains
other then e-recruitment:
1. Legal domain, such as court orders, legal proceedings.
2. Health-care data, such as textual discharge summaries.
3. Scientific documents, such as research articles, reports.
7.3 Research Contributions
The current research has the following contributions:
1. A framework, named as SAJ for extraction, transformation and context
building of information from job descriptions. The core focus is to extract
contextual entities, such as job requirements and job responsibilities using a
dictionary comprising of JAPE rules and seed tokens.
2. A framework for enriching skill entities using Linked Open Data, a continu-
ously growing data source managed by an open source community.
CHAPTER 7. CONCLUSION 84
3. A job description ontology defines concepts and relationships among those
concepts. The job description ontology provides hierarchical and associative
relationships among the concepts.
4. A framework, named as Sem-QA for the transformation of natural language
plain English queries into a machine-understandable format, i.e., SPARQL.
The transformation helps a layman in searching for machine-understandable
data content.
7.4 Limitations and Future Work
The current work is unable to:
1. Automatically generates pattern/action extraction rules for unstructured
text. The rules would be learned from the text based of some predefined
features, such as word/token, POS tags, named entities and others.
2. Extend search queries matching to profile matching for job recommendation
based on profiles. User profile would be used as search query which is complex
in nature as compared to existing job search queries. The results produced
would be ranked with respect to the user profile.
3. Generate profile score based on matching with a job description. After
matching user profile with the job descriptions ranked results would be re-
trieved thus facilitating the decision process.
4. Automatic learning of extraction dictionary to enhance entities extraction.
Automatic dictionary learning will enhance extraction by addition of new
rules learned from the text.
Bibliography
(2017). Introduction to the principles of linked open data. Accessed: 2018-12-01.
Ahmed, N., Khan, S., and Latif, K. (2016). Job description ontology. In Fron-
tiers of Information Technology (FIT), 2016 International Conference on, pages
217–222. IEEE.
Al-Yahya, M., Aldhubayi, L., and Al-Malak, S. (2014). A pattern-based approach
to semantic relation extraction using a seed ontology. In Semantic Computing
(ICSC), 2014 IEEE International Conference on, pages 96–99. IEEE.
Ali, F., Kim, E. K., and Kim, Y. (2015). Type-2 fuzzy ontology-based opinion min-
ing and information extraction: A proposal to automate the hotel reservation
system. Appl. Intell., 42(3):481–500.
Ameen, A., Khan, K. U. R., and Rani, B. P. (2012). Creation of ontology in edu-
cation domain. In 2012 IEEE Fourth International Conference on Technology
for Education, T4E 2012, Hyderabad, India, July 18-20, 2012, pages 237–238.
Arocena, G. O. and Mendelzon, A. O. (1999). WebOQL: restructuring documents,
databases, and webse. Theory and Practice of Object Systems, 5(3):127–141.
Awan, M. N. A. (2009). Extraction and generation of semantic annotations from
digital documents. Master’s thesis, NUST School of Electrical Engineering &
Computer Science.
Bernstein, A. and Kaufmann, E. (2006). Gino–a guided input natural language
85
BIBLIOGRAPHY 86
ontology editor. In The Semantic Web-ISWC, pages 144–157. Springer.
Bhagia, L. (2015). The evolution of web technologies.
Bijalwan, V., Kumar, V., Kumari, P., and Pascual, J. (2014). Knn based machine
learning approach for text and document mining. International Journal of
Database Theory and Application, 7(1):61–70.
Bogh, C. (2012). The evolution of web technologies.
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M. A., Maynard, D., and
Aswani, N. (2013). Twitie: An open-source information extraction pipeline for
microblog text. In RANLP, pages 83–90.
Boselli, R., Cesarini, M., Marrara, S., Mercorio, F., Mezzanzanica, M., Pasi, G.,
and Viviani, M. (2018). Wolmis: a labor market intelligence system for classi-
fying web job vacancies. Journal of Intelligent Information Systems, 51(3):477–
502.
Buttinger, C., Prll, B., Palkoska, J., Retschitzegger, W., Schauer, M., and Immler,
R. (2008). Jobolize-headhunting by information extraction in the era of web 2.0.
In Proceedings of the 7th International Workshop on Web-Oriented Software
Technologies, IWWOST.
Candela, G., Escobar, P., and Marco-Such, M. (2017). Semantic enrichment on
cultural heritage collections: A case study using geographic information. In
Proceedings of the 2nd International Conference on Digital Access to Textual
Cultural Heritage, pages 169–174. ACM.
Cimiano, P. and Minock, M. (2010). Natural language interfaces: What is the
problem? a data-driven quantitative analysis. In Natural Language Processing
and Information Systems, volume 5723, pages 192–206. Springer Berlin Heidel-
berg.
Copestake, A. and Jones, K. S. (1990). Natural language interfaces to databases.
BIBLIOGRAPHY 87
Number 187. Cambridge Univ Press.
Damljanovic, D., Agatonovic, M., and Cunningham, H. (2010). Natural lan-
guage interfaces to ontologies: Combining syntactic analysis and ontology-
based lookup through the user interaction. In The Semantic Web: Research
and Applications, pages 106–120. Springer.
Elkan, C. and Greiner, R. (2006). Building large knowledge-based systems: rep-
resentation and inference in the cyc project. Artificial Intelligence, (1):41–52.
Flesca, S., Masciari, E., and Tagarelli, A. (2011). A fuzzy logic approach to wrap-
ping pdf documents. Knowledge and Data Engineering, IEEE Transactions on,
23(12):1826–1841.
Frank, A., Krieger, H.-U., Xu, F., Uszkoreit, H., Crysmann, B., Jorg, B., and
Schafer, U. (2007). Question answering from structured knowledge sources.
Journal of Applied Logic, 5(1):20–48.
Fuchs, N. E., Kaljurand, K., and Schneider, G. (2006). Attempto controlled english
meets the challenges of knowledge representation, reasoning, interoperability
and user interfaces. In FLAIRS Conference, volume 12, pages 664–669.
Funk, A., Tablan, V., Bontcheva, K., Cunningham, H., Davis, B., and Handschuh,
S. (2007). Clone: Controlled language for ontology editing. In The Semantic
Web, pages 142–155. Springer.
Gautam, G. and Yadav, D. (2014). Sentiment analysis of twitter data using ma-
chine learning approaches and semantic analysis. In Contemporary computing
(IC3), 2014 seventh international conference on, pages 437–442. IEEE.
Geibel, P., Trautwein, M., Erdur, H., Zimmermann, L., Jegzentis, K., Bengner, M.,
Nolte, C. H., and Tolxdorff, T. (2015). Ontology-based information extraction:
Identifying eligible patients for clinical trials in neurology. J. Data Semantics,
4(2):133–147.
BIBLIOGRAPHY 88
Gomez-Perez, A. (1996). Towards a framework to verify knowledge sharing tech-
nology. Expert Systems with Applications, 11(4):519–529.
Gomez-Perez, A. (1999). Ontological engineering: A state of the art. Expert
Update: Knowledge Based Systems and Applied Artificial Intelligence, 2(3):33–
43.
Gomez-Perez, A., Ramırez, J., and Villazon-Terrazas, B. (2007). An ontology for
modelling human resources management based on standards. In International
Conference on Knowledge-Based and Intelligent Information and Engineering
Systems, pages 534–541. Springer.
Graupner, S., Nezhad, H. R. M., and Basu, S. (2017). Generating machine-
understandable representations of content. US Patent 9,633,332.
Gregory, M. L., McGrath, L., Bell, E. B., O’Hara, K., and Domico, K. (2011). Do-
main independent knowledge base population from structured and unstructured
data sources. In Proceedings of the Twenty-Fourth International Florida Arti-
ficial Intelligence Research Society Conference, May 18-20, 2011, Palm Beach,
Florida, USA.
Gruninger, M. and Fox, M. (1995). Methodology for the design and evaluation of
ontologies. International Joint Conference on Artificial Inteligence (IJCAI95),
Workshop on Basic Ontological Issues in Knowledge Sharing.
Guenther, N., Schonlau, M., et al. (2016). Support vector machines. Stata Journal,
16(4):917–937.
Gupta, Y. (2016). Literature review on e-recruitment: A step towards paperless
hr. International Journal, 4(1).
Gutierrez, F., Dou, D., Fickas, S., Wimalasuriya, D., and Zong, H. (2016). A
hybrid ontology-based information extraction system. Journal of Information
Science, 42(6):798–820.
BIBLIOGRAPHY 89
Jayram, T. S., Krishnamurthy, R., and Raghavan, S. (2006). Avatar Information
Extraction System. IEEE Data Engineering Bulletin, 29(1):40–48.
Karkaletsis, V., Fragkou, P., Petasis, G., and Iosif, E. (2011). Ontology Based
Information Extraction from Text. Knowledge-Driven Multimedia Information
Extraction and Ontology Evolution, 6050:89–109.
Kiryakov, A., Popov, B., Terziev, I., Manov, D., and Ognyanoff, D. (2004). Se-
mantic Annotation, Indexing, and Retrieval. Web Semantics: Science, Services
and Agents on the World Wide Web, 2(1):49 – 79.
Kolb, P. (2008). Disco: A multilingual database of distributionally similar words.
Proceedings of KONVENS-2008, Berlin, 156.
Li, X., Zhang, Y., Wang, J., and Pu, Q. (2016). A preliminary study of plant
domain ontology. In 2016 IEEE 14th Intl Conf on Dependable, Autonomic and
Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing,
2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and
Technology Congress, DASC/PiCom/DataCom/CyberSciTech 2016, Auckland,
New Zealand, August 8-12, 2016, pages 109–112.
Lopez, V., Pasin, M., and Motta, E. (2005). Aqualog: An ontology-portable ques-
tion answering system for the semantic web. In The Semantic Web: Research
and Applications, pages 546–562. Springer.
Malik, S. K., Prakash, N., and Rizvi, S. (2010a). Developing an university ontology
in education domain using protege for semantic web. International Journal of
Science and Technology, 2(9):4673–4681.
Malik, S. K., Prakash, N., and Rizvi, S. (2010b). Semantic annotation frame-
work for intelligent information retrieval using kim architecture. International
Journal of Web & Semantic Technology (IJWest), 1(4):12–26.
Maree, M., Kmail, A. B., and Belkhatir, M. (2018). Analysis and shortcom-
BIBLIOGRAPHY 90
ings of e-recruitment systems: Towards a semantics-based approach addressing
knowledge incompleteness and limited domain coverage. Journal of Information
Science, page 0165551518811449.
Matthew Jeffery, A. M. (2011). A vision for the future of recruitment: Recruitment
3.0.
McConell, I. (2014). Web 3.0 and what it means for the future of recruitment.
Mingsheng, H., Zhijuan, J., and Xiangyu, Z. (2012). An approach for text extrac-
tion from web news page. In Robotics and Applications (ISRA), 2012 IEEE
Symposium on, pages 562–565. IEEE.
Mooney, R. (2016). Owl test data. https://www.ifi.uzh.ch/en/ddis/research/talking/OWL-
Test-Data.html.
Muller, H.-M., Kenny, E. E., and Sternberg, P. W. (2004). Textpresso: an ontology-
based information retrieval and extraction system for biological literature. PLoS
biology, 2(11).
Mykowiecka, A., Marciniak, M., and Kupsc, A. (2009). Rule-based information
extraction from patients’ clinical data. Journal of Biomedical Informatics,
42(5):923–936.
Nabeel Ahmed, Sharifullah Khan, K. L. A. M. (2008). Extracting semantic an-
notation and their correlation with document. In 4th International Conference
on Emerging Technologies., pages 32–37.
Oscar, F., Ruben, I., Sergio, F., and Jose, Luis, V. (2009). Addressing ontology-
based question answering with collections of user queries. Information Process-
ing & Management, 45(2):175–188.
Owoseni, A. T., Olabode, O., and Ojokoh, B. (2017). Enhanced e-recruitment
using semantic retrieval of modeled serialized documents.
Pattuelli, M. C. (2011). Modeling a domain ontology for cultural heritage re-
BIBLIOGRAPHY 91
sources: A user-centered approach. J. Am. Soc. Inf. Sci. Technol., 62(2):314–
342.
Personforce (2016). User job queries. http://www.personforce.com/.
Philipp, C., Peter, H., Jorg, H., Matthias, M., and Rudi, S. (2008). Towards
portable natural language interfaces to knowledge bases the case of the orakel
system. Data and Knowledge Engineering, 65(2):325 – 354.
Popescu, A.-M., Etzioni, O., and Kautz, H. (2003). Towards a theory of natu-
ral language interfaces to databases. In Proceedings of the 8th international
conference on Intelligent user interfaces, pages 149–157. ACM.
Popov, B., Kiryakov, A., Kirilov, A., and Manov, D. (2003). KIM A Semantic
Annotation Platform. In International Semantic Web Conference, pages 834–
848.
Poria, S., Cambria, E., Ku, L., Gui, C., and Gelbukh, A. F. (2014). A rule-
based approach to aspect extraction from product reviews. In Proceedings of
the Second Workshop on Natural Language Processing for Social Media, So-
cialNLP@COLING 2014, Dublin, Ireland, August 24, 2014, pages 28–37.
Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc,
informedness, markedness and correlation.
Ramakrishnan, C., Mendes, P. N., Wang, S., and Sheth, A. P. (2008). Unsuper-
vised discovery of compound entities for relationship extraction. In Gangemi,
A. and Euzenat, J., editors, Knowledge Engineering: Practice and Patterns,
pages 146–155, Berlin, Heidelberg. Springer Berlin Heidelberg.
Rocktaschel, T., Singh, S., and Riedel, S. (2015). Injecting logical background
knowledge into embeddings for relation extraction. In NAACL HLT 2015,
The 2015 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Denver, Colorado,
BIBLIOGRAPHY 92
USA, May 31 - June 5, 2015, pages 1119–1129.
Roman, D., Kopecky, J., Vitvar, T., Domingue, J., and Fensel, D. (2015). Wsmo-
lite and hrests: Lightweight semantic annotations for web services and restful
apis. Web Semantics: Science, Services and Agents on the World Wide Web,
31:39–58.
Saggion, H., Funk, A., Maynard, D., and Bontcheva, K. (2007). Ontology-based
information extraction for business intelligence. In The Semantic Web, 6th
International Semantic Web Conference, 2nd Asian Semantic Web Conference,
ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007., pages 843–
856.
Sen, A., Das, A., Ghosh, K., and Ghosh, S. (2012). Screener: a system for extract-
ing education related information from resumes using text based information
extraction system. In International Conference on Computer and Software
Modeling, volume 54, pages 31–35.
Shahid, N., Khan, O. A., Anwar, S. K., and Pirzada, U. T. (2009). Rational
unified process. Online Notes on RUP. http://ovais. khan. tripod. com/paper-
s/Rational Unified Pro cess. pdf.
Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., and Re, C. (2015). Incre-
mental knowledge base construction using deepdive. Proceedings of the VLDB
Endowment, 8(11):1310–1321.
Silvello, G., Bordea, G., Ferro, N., Buitelaar, P., and Bogers, T. (2017). Seman-
tic representation and enrichment of information retrieval experimental data.
International Journal on Digital Libraries, 18(2):145–172.
Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., and Kambhatla, N.
(2010). Prospect: a system for screening candidates for recruitment. In Proceed-
ings of the 19th ACM international conference on Information and knowledge
BIBLIOGRAPHY 93
management, pages 659–668. ACM.
Strzalkowski, T. and Harabagiu, S. M. (2006). Advances in open domain question
answering. Springer Heidelberg.
Tang, B., Kay, S., and He, H. (2016). Toward optimal feature selection in naive
bayes for text categorization. IEEE Transactions on Knowledge and Data En-
gineering, 28(9):2508–2521.
Thada, V. and Jaglan, V. (2013). Comparison of jaccard, dice, cosine similarity
coefficient to find best fitness value for web retrieved documents using genetic
algorithm. International Journal of Innovations in Engineering and Technology,
2(4):202–205.
Thompson, C. A., Califf, M. E., and Mooney, R. J. (1999). Active learning for
natural language parsing and information extraction. In Machine Learning
Conference, pages 406–414. Citeseer.
T.R.Grubber (1995). Toward principles for the design of ontologies used for knowl-
edge sharing. International Journal of Human-Computer Studies, 43(4-5):907–
928.
Uschold, M. and Gruninger, M. (1996). Ontologies: Principles, methods and
applications. The knowledge engineering review, 11(02):93–136.
Uschold, M. and King, M. (1995). Towards a methodology for building ontologies.
In Workshop on basic ontological issues in knowledge sharing, volume 74.
Valle, E. D., Cerizza, D., Celino, I., Estublier, J., Vega, G., Kerrigan, M., Ramırez,
J., Villazon-Terrazas, B., Guarrera, P., Zhao, G., and Monteleone, G. (2007).
SEEMP: an semantic interoperability infrastructure for e-government services
in the employment sector. In The Semantic Web: Research and Applications,
4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria,
June 3-7, 2007, Proceedings, pages 220–234.
BIBLIOGRAPHY 94
Vicient, C., Sanchez, D., and Moreno, A. (2011). Ontology-based feature extrac-
tion. In Proceedings of the 2011 IEEE/WIC/ACM International Conferences
on Web Intelligence and Intelligent Agent Technology - Volume 03, WI-IAT
’11, pages 189–192, Washington, DC, USA. IEEE Computer Society.
Vicient, C., Sanchez, D., and Moreno, A. (2013). An automatic approach for
ontology-based feature extraction from heterogeneous textualresources. Engi-
neering Applications of Artificial Intelligence, 26(3):1092–1106.
Vijayarajan, V., Dinakaran, M., Tejaswin, P., and Lohani, M. (2016). A generic
framework for ontology-based information retrieval and image retrieval in web
data. Human-centric Computing and Information Sciences, 6(1):18.
Wang, C., Xiong, M., Zhou, Q., and Yu, Y. (2007). Panto: A portable natural
language interface to ontologies. In The Semantic Web: Research and Appli-
cations, pages 473–487. Springer.
Wang, W., Do, D. B., and Lin, X. (2005). Term graph model for text classification.
In ADMA, pages 19–30. Springer.
Weichselbraun, A., Gindl, S., and Scharl, A. (2014). Enriching semantic knowledge
bases for opinion mining in big data applications. Knowledge-based systems,
69:78–85.
Zelle, J. M. and Mooney, R. J. (1996). Learning to parse database queries using
inductive logic programming. In Proceedings of the National Conference on
Artificial Intelligence, pages 1050–1055.