Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
bioPDFX: preparing PDF scientific articles for biomedical text
mining
Shitij Bhargava 1 , Tsung-Ting Kuo 2 , Ankit Goyal 1 , Vincent Kuri 1 , Gordon Lin 1 , Chun-Nan Hsu Corresp. 2
1 Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California, United States
2 Health System Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California, United States
Corresponding Author: Chun-Nan Hsu
Email address: [email protected]
Background. There is huge amount of full-text biomedical literatures available in public repositories like
PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format
(PDF) and do not provide plain text format ready for text mining and natural language processing (NLP).
Although there exist many PDF-to-text converters, they still suffer from several challenges while
processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting
references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and
word boundaries.
Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with
strengths of multiple state-of-the-art methods and then applies machine learning methods to address all
issues above
Results. The experiment results on publications of Genome Wide Association Studies (GWAS)
demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-
to-XML converter, leading to a biomedical database more suitable for text mining.
Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of
PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well.
The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central
IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same
URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
1
bioPDFX: preparing PDF scientific articles for biomedical 1
text mining 2
3
Shitij Bhargava1, Tsung-Ting Kuo
2, Ankit Goyal
1, Vincent Kuri
1, Gordon Lin
1, Chun-Nan Hsu
2 4
5
1 Department of Computer Science and Engineering, Jacobs School of Engineering, 6
University of California, San Diego, La Jolla, California, United States 7
2 Health System Department of Biomedical Informatics, School of Medicine, 8
University of California, San Diego, La Jolla, California, United States 9
10
Corresponding Author: 11
Chun-Nan Hsu 12
9500 Gilman Drive, La Jolla, CA 92093, United States 13
Email address: [email protected] 14
15
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
2
Abstract 16
Background. There is huge amount of full-text biomedical literatures available in public 17
repositories like PubMed Central (PMC). However, a substantial number of the papers are in 18
Portable Document Format (PDF) and do not provide plain text format ready for text mining and 19
natural language processing (NLP). Although there exist many PDF-to-text converters, they still 20
suffer from several challenges while processing biomedical PDFs, such as the correct transcription 21
of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors 22
(the wrong order of the text), and word boundaries. 23
24
Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with 25
strengths of multiple state-of-the-art methods and then applies machine learning methods to 26
address all issues above. 27
28
Results. The experiment results on publications of Genome Wide Association Studies (GWAS) 29
demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-30
art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. 31
32
Discussion. Overall, the whole pipeline developed in this paper makes the published literature in 33
form of PDF files much better suited for text mining tasks, while slightly improving the overall 34
text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000. 35
A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is 36
available for download at the same URL. The instructions of how to run the service with a PubMed 37
ID are described in Supplemental File 2. 38
39
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
3
1. Introduction 40
PubMed Central (PMC) (2015e) contains full-text of about 4 million biomedical literatures, 41
and is one of the most important freely available data sources for the biomedical natural language 42
processing (NLP) research field. Although many new publications now provide plain text along 43
with the Portable Document Format (PDF), a substantial number of them do not. However, text 44
mining algorithms work more effectively on text-based formats such as plain text, XML or HTML 45
documents. Therefore, NLP researchers usually need to transcribe the biomedical literatures from 46
PDF to text as a preprocessing step of the NLP pipeline. 47
However, transcribing PDF to text accurately is not trivial. The design goals of the PDF 48
standard target ease of human readability rather than electronic consumption of data by other 49
software tools. That is, the PDF standard does not attempt to encode any semantic connection 50
between characters in a word or between paragraphs, but characters are “painted" individually at 51
specific 2-dimensional coordinates. Berg et. al. outlined some challenges encountered in 52
transcribing PDF documents to text accurately, such as varied reading orders, sectional formatting 53
and number of columns (Berg 2011). These issues create a variability in text quality in the 54
generated outputs and impedes accurate text mining. 55
Although there are already a variety of tools available today for PDF-to-text transcription 56
(such as Apache PDFBox Mozilla PDF.js (2015d) Adobe Acrobat SDK (2015a) Tesseract 57
Optical Character Recognition (OCR) (Smith 2007) (Constantin et al. 2013)several 58
challenges to be solved for converting biomedical literature PDF to XML for text mining purpose:: 59
1. Extracting title and abstracts correctly. In case of manuscripts which might have contents 60
other than title or abstract on their first page, conversion tools like PDFX struggle to 61
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
4
identify them correctly. 62
2. Identifying reference and acknowledgement sections in various format. The format of 63
references varies considerably among different publications ranging from a separate 64
section towards the end to individual references in footnotes continuing alongside the 65
content body. Take PDFX as an example, as a purely rule-based system, it often falters in 66
recognizing references when the common pattern of having references in the end with a 67
heading is not present. Similar problems were noticed in identifying acknowledgements as 68
well. 69
3. Recognizing special characters (such as '!' and '@'). PDFX and all PDF to text converters 70
usually suffer from the problem of recognizing the special characters depending on the 71
publishing of the PDF files. 72
4. Detecting and fixing jumbling errors (the wrong order of text while converting PDF to 73
text). Often most PDF conversion tools jumbles the work order in or across sentences due 74
to mistakes in reading order detection/segmentation. An example is given in Figure 1 where 75
the exponents of numbers in a table are extracted as a separate column and hence are jumbled 76
with respect to the actual text. 77
5. Correcting word boundaries. Often words straddling the column boundary have a hyphen 78
inserted in between them (e.g., “cancer” might become “can-” and “cer” if it is the last 79
word of a row). Although this is not an error due to conversion, we would still like to 80
correct it and merge the hyphenated parts as a single word wherever possible to support 81
NLP afterwards. 82
83
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
5
Figure 1. A kind of jumbling error in a table. 84
The table at the top is what is present in the PDF. The lower portion shows the transcribed XML. 85
The exponents marked in the red-colored box are incorrectly extracted as a separate column in the 86
XML. 87
88
89
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
6
To address these issues, we proposed bioPDFX, a novel tool that integrates current state-of-90
the-art PDF conversion methods to transcribe biomedical PDF articles to high quality text in XML 91
format. The bioPDFX tool leverages the following four tools to transcribe text of the given PDF: 92
PDFX (Constantin et al. 2013), PMC Entrez e-utilities API (Sayers et al. 2011), Tesseract OCR 93
(Smith 2007), and Apache PDFBox (2015b). The input of bioPDFX includes a biomedical 94
literature PDF file and its corresponding PubMed ID, while the output is the converted XML file. 95
The overall processing pipeline of bioPDFX consists of the following five components, each 96
address an abovementioned challenge: 97
1. Title and Abstract Correction. We used the PMC Entrez e-utilities API to retrieve the 98
correct title and abstract from PubMed. 99
2. Reference and Acknowledgment Correction. To detect whether a paragraph is reference, 100
acknowledgement, or none of them, we designed two binary classifiers, one for reference 101
and the other for acknowledgement. For reference detection, we extracted features such as 102
density of year (e.g., “2016”), density of “et. al.”, and density of numbers followed by the 103
dot (e.g., “12.”. We normalized these features and train a classifier to predict whether a 104
paragraph is a reference text. For acknowledgment detection, we used a classifier with 105
features being TF-IDF followed by Latent Semantic Analysis (Deerwester et al. 1990) to 106
perform the detection. 107
3. Special Character Correction. The basic idea is to compare the results from PDFX and 108
that from Tesseract OCR to identify suspicious characters (i.e., different identified 109
characters within the same n-gram), and then apply Hidden Markov Model (HMM) (Baum 110
& Petrie 1966) with Viterbi inference algorithm (Viterbi 1967) and language model to 111
choose the “most probable” candidate character for the mismatch character by maximizing 112
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
7
the overall likelihood to recover the n-gram. 113
4. Jumbling Error Correction. We used the text extracted using Apache PDFBox to correct 114
the possible jumbling errors from PDFX. We first detected jumbled n-gram, and then 115
replaced them by the non-jumbled version from Apache PDFBox. 116
5. Word Boundary Correction. We adopted a simple dictionary lookup method by comparing 117
different parts (e.g., “can-” and “cer”) with the English dictionary in the Enchant 118
Spellchecking System (2015c) and merging them as required. 119
To evaluate the bioPDFX tool, we randomly extracted 100 biomedical literatures related to 120
Genome Wide Association Study (GWAS) (Hindorff et al. 2009; Welter et al. 2014) from PMC, 121
and used XML versions (i.e., NXMLs) of those 100 articles provided in PMC as gold standards. 122
We compare bioPDFX with the state-of-the-art PDFX tool (Constantin et al. 2013). As a use case, 123
we also compare the results of converting important information (p-value and number) and overall 124
text quality in the GWAS papers using bioPDFX and PDFX. We evaluate both individual 125
correction steps and overall conversion results. The experiment results show significant 126
improvement for all five types of correction tasks, especially for abstract, 127
reference/acknowledgement, special character, and jumbling error corrections. 128
Our contributions of this study are three-fold: 129
• We identified five issues (title/abstract, reference/acknowledgment, special character, 130
jumbling error, and word boundary) of the state-of-the-art PDF-to-XML tools as the very 131
first step of biomedical NLP pipeline. 132
• We integrated the output of four popular tools (PDFX, PMC Entrez e-utilities API, 133
Tesseract OCR, and Apache PDFBox) in bioPDFX and designed five corresponding 134
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
8
correction steps to solve the abovementioned issues. 135
• We evaluated bioPDFX and compare it to PDFX for each correction steps as well as a real-136
world GWAS information extraction task, and the results demonstrated that bioPDFX can 137
improve all the five issues and increase the extraction accuracy for GWAS literatures. 138
139
Related Work. The rest of the section reviews the related studies along with the mention of tools 140
that we use in our approach. PDF text extraction involves extracting out all textual content from 141
the PDF file in an order that makes sense even without the formatting of the PDF. A variety of 142
tools are available that parse and convert PDF to text such as Apache PDFBox (2015b), Mozilla 143
PDF.js (2015d), and Adobe Acrobat SDK (2015a). Most text extractors use a page segmentation 144
algorithm to determine all the textual areas of the PDF based on width of the whitespace and from 145
these boundaries also determine a reading order. A major problem associated with the text based 146
extraction is because of variability of PDF publishing software. Some tools embed a special 147
character that is not included in a standard font by drawing a vector graphics over it. Thus, although 148
it might look perfectly correct to a reader, the information is not present in the text layer of the 149
PDF for text extractors to extract. In our bioPDFX tool, we integrated text extraction from Apache 150
PDFBox with other tools such as PDFX and optical character recognition tool Tesseract OCR to 151
overcome challenges like this. 152
XML transcription of a PDF file involves extraction of text, tables, images, etc. and then 153
tagging them with appropriate sections/regions to make an XML. The exact format of the 154
transcribed XML depends on the specific tool and might have a tag for title, abstract, introduction, 155
tables, etc. Some tools like PDFX (Constantin et al. 2013) use a schema similar to JATS/NLM 156
document-type definition (DTD) format which is a specific format for scientific articles and are 157
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
9
highly optimized around the organization of such PDF files. PDFX is a freely available rule-based 158
system which converts scholarly PDF articles to XMLs which have detailed sectional information 159
such as title, abstract, references, tables, and acknowledgment. It derives transcription parameters 160
from the relative font sizes, style and spacing of articles with respect to individual PDF files. In 161
our work, we integrate PDFX with other extraction tools to improve the extraction results for texts 162
like title, abstract, reference, and acknowledgement. 163
Text post-processing is a step that improves the correctness of the generated text. Several 164
tools, such as Tesseract OCR (Smith 2007), deploy techniques from Optical Character Recognition 165
(OCR) to increase the accuracy in the post-processing stage. Because most OCR engines classify 166
each character independent of other characters, it is important to post-process the text using some 167
contextual information such as statistical language models or syntactic and semantic rules. One 168
such method suggested by Zhuang and Zhu (Zhuang & Zhu 2005) reduces the candidate character 169
search space for characters by using the candidate distance information by an OCR engine in 170
conjunction with an n-gram based language model and a semantic lexicon. In another method 171
proposed by Velagapudi (Velagapudi 1999), contextual information was extracted by modeling 172
words/n-grams as sequences of characters and a Hidden Markov Model (HMM) was subsequently 173
used to decide the best sequence, using the given OCR output as the observation. Transition 174
probabilities of one character to another are calculated from a corpus and emission probabilities 175
are calculated from the output of the OCR engine on a corpus, to statistically record the error 176
patterns of the OCR engine. At the word level, Tong and Evans (Tong & Evans 1996) corrected 177
OCR text by modeling the text as sequence bigrams and then using HMM to compute the best 178
word sequence. The system learns the character level confusion probabilities for a specific OCR 179
engine and uses it to achieve better performance. In our study, we follow a similar approach to 180
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
10
post-process the output from PDFX using the Tesseract OCR engine. 181
In addition to above methodologies, several systems extract metadata such as authors, year of 182
publication, journal name, and bibliographical information from natural language text or PDFs 183
directly. CERMINE (Tkaczyk et al. 2014) uses supervised and unsupervised machine learning 184
techniques to extract parsed bibliographic references and metadata directly from PDFs. FLUX-185
CiM (Cortez et al. 2007) uses unsupervised, non-template methods to extract citation components 186
from articles. Other PDF transcribers include LA-PDFText.(Ramakrishnan et al. 2012) Although 187
the abovementioned tools are available for PDF-to-text transcription, several issues are yet to be 188
dealt with, such as how to extract title and abstracts correctly, identify reference acknowledgement 189
sections in various format, recognize special characters correctly, detect and fix jumbling errors, 190
and correct word boundaries. PDFJailbreak (Garcia et al. 2013) is a communal project to create an 191
architecture and shared API to support open community development of semantic information 192
extractor from biomedical literature in PDF form. 193
194
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
11
2. Materials and Methods 195
The goal of this study is to extract high-quality XMLs from a biomedical literature PDFs to 196
make it better suited for our text mining tasks. Given a PDF file and its corresponding PubMed ID, 197
our system integrates the following four tools to generate output XML files: 198
199
• PDFX (Constantin et al. 2013) to transcribe XML of the given PDF 200
• PMC Entrez e-utilities API (Sayers et al. 2011) to retrieve text from PubMed 201
• Tesseract OCR (Smith 2007) to extract text extracted from the PDF file 202
• Apache PDFBox (2015b) to transcribe text of the given PDF 203
204
As shown in Figure 2, there are five correction stages (title/abstract, 205
reference/acknowledgment, special character, jumbling error, and word boundary corrections) in 206
the bioPDFX pipeline. Each stage is a filter and accepts an XML as input, giving a corrected XML 207
as output. We start from the XML generated by PDFX, correct title and abstract using PMC Entrez 208
e-utilities API, identify reference and acknowledgement texts, exploit Tesseract OCR to fix special 209
characters, utilize Apache PDFBox to correct jumbling errors, and recover word-boundaries as our 210
final stage. It should be noted that the pipeline design of bioPDFX allows different order of 211
correction stages, as well as adding new correction stages. The detail of each of the correction 212
stages is discussed in the following subsections. 213
214
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
12
Figure 2. A flow diagram showing the stages of correction and their inputs. 215
Stages are arranged in a pipelined fashion and take an XML as input and return a corrected XML 216
as output. 217
218
219
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
13
2.1 Title and Abstract Correction 220
The flow diagram illustrating this step is given in Figure 3. As the input of this step, PDFX 221
performs well in detecting title and abstract for regular research articles; but in case of manuscripts 222
which might have contents other than title or abstract on their first page, PDFX struggles to identify 223
them correctly. To overcome this, we used the PMC Entrez e-utilities API (Sayers et al. 2011) to 224
retrieve the correct title and abstract (using PubMed ID of the paper) and substituted them for 225
whatever was present in PDFX XML. It should be noted that we applied PMC Entrez e-utilities 226
API because our focus is the biomedical literatures archived in PubMed Central. 227
228
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
14
Figure 3. A flow diagram showing the title and abstract correction. 229
The title and abstract correction pipeline, with PDFX XML as input and Title/Abstract corrected 230
XML as output. 231
232
233
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
15
2.2 Reference and Acknowledgement Correction 234
The format of references varies considerably among different publications ranging from a 235
separate section towards the end to individual references in footnotes continuing alongside the 236
content body. As PDFX is a purely rule-based system, it often falters in recognizing references 237
when the common pattern of having references in the end with a heading is not present. Similar 238
problems were noticed in identifying acknowledgements as well. 239
A key observation was that whenever PDFX identifies references/acknowledgement in a 240
paper, it is generally correct and correction was needed only in the case when it does not identify 241
either. We realized that the text content in references and acknowledgment showed clear patterns 242
in language, and because for our purpose, we did not need to extract individual bibliographic items 243
in the references section like PDFX does, classifying each section as a reference/acknowledgment 244
was enough. Also, it is important to note that the only requirement was to detect reference text, 245
which is much coarser grained and simpler than complete metadata extraction from journals, or 246
metadata extraction for each citation. 247
Although references and acknowledgment are usually clumped together, in this correction 248
step we designed two separate classifiers, as the patterns they show are different: 249
(1) For reference detection, we used PDFX XMLs which had a PDFX “reference” tag to train the 250
classifier (there are 451 such XMLs in our experiment), by taking reference text as positive 251
examples and rest of the text from the paper as negative examples. Some of the most prominent 252
features were density of year like numbers, density of “et. al” and similar strings, density of 253
numbers followed by the dot and so on. For each of the above features, the ratio of the numbers 254
found versus total number of tokens in the text was used instead of the raw numbers. We 255
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
16
normalized these features and train a Random Forest classifier (Breiman 2001) to predict 256
whether a paragraph is a reference text. 257
(2) For acknowledgment detection, we used PDFX XMLs which had a PDFX “acknowledgment” 258
tag to train the reference classifier (there are 451 such XMLs in our experiment). We exploited 259
another Random Forest classifier with features being TF-IDF followed by Latent Semantic 260
Analysis (Deerwester et al. 1990) (Truncated Singular Value Decomposition (Hansen 1987; 261
Kolda & O'leary 1998) in our experiment) to perform the detection. The features we adopted 262
are TF-IDF of the phrases such as “funded by” and “express gratitude”. 263
264
2.3 Special Characters Correction 265
PDFX and other PDF-to-text converters usually suffer from the problem of recognizing the 266
special characters (such as '!' and '@') depending on the publishing of the PDF files. The basic idea 267
for the special character correction step is that given a possibly erroneous n-gram from PDFX in 268
the token/word level, we recognized the corresponding n-gram in Tesseract OCR and then aligned 269
the two n-grams by their characters. Then, for differing characters we make a decision about what 270
character should be present in that position using Hidden Markov Model (HMM) (Baum & Petrie 271
1966) with Viterbi inference algorithm (Viterbi 1967) and language models. The flowchart for this 272
pipeline is summarized in Figure 4. 273
274
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
17
Figure 4. A flow diagram for special character correction. 275
This pipeline shows the steps taken in correcting special character errors in an n-gram. 276
277
278
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
18
2.3.1 Suspicious N-Gram and Character Identification 279
Before applying HMM, we first select similar character-level n-grams (we treat each character, 280
including the space, as unigram) from the output of OCR and PDFX, align the n-grams and then 281
mark suspicious characters in the n-grams. To select the most similar n-gram from OCR and PDFX 282
text, we used Levenshtein Distance (Levenshtein 1966) normalized for length to measure the 283
similarity between the two n-grams. The alignment of the two n-grams is done at the character 284
level and is based on a gene sequence alignment mechanism which is widely used (Hsu et al. 2008). 285
If the n-gram in OCR text is at least 0.6 (in Levenshtein Distance) as like the PDFX n-gram then 286
the characters that need to be replaced are marked. This is done by looking at the characters that 287
differ in the two n-grams after the alignment and selecting those characters which are not aligned 288
with whitespaces or digits. These mismatch characters are then fed to a HMM to predict top 289
possible correct character. The example in Figure 5 show the OCR, PDFX, and gold standard n-290
grams. In this example, the mismatch characters in OCR n-gram are “α”, “u”, and “3”, while 291
that in PDFX n-gram are “a”, “a” and “s”. 292
293
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
19
Figure 5. An example showing three types of n-gram: OCR, PDFX, and gold standard. 294
The mismatch characters are shown in bold and underlined style. 295
296
297
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
20
2.3.2 Prediction of Top-N Possible Correct N-Gram Lists for OCR and PDFX 298
After marking the suspicious characters in the n-gram from OCR and the one from PDFX, we 299
predict the list of the most probable correct n-grams for OCR and the list for PDFX. Our solution 300
is to train an HMM (16) using a training corpus of aligned OCR, PDFX and gold standard texts, 301
and infer the “most probable” n-gram by the Viterbi algorithm. 302
While formulating this problem in terms of HMM Viterbi inference, one of our assumptions 303
is that characters in the words of meaningful sentences follow the Markov assumption to a good 304
extent, something which is already well proven by handwriting and speech recognition applications, 305
which have used HMMs to achieve good performance (10) (19). The other assumption is that 306
OCR and PDFX have a statistical pattern or bias in the character level errors that they make, and 307
that these patterns can be learned through a training corpus. The same technique of modeling 308
words as sequences of characters by HMM has been used to boost OCR accuracy in (24). 309
To formally state the problem, we first define a character xt as the t-th letter in an OCR n-gram. 310
Therefore, each OCR n-gram can be represented as X = x1x2...xN , which is a concatenation of N 311
characters. It should be noted that the character whitespace (i.e., “ ”) is also treated as a character. 312
In Figure 5, the number of characters in the OCR n-gram “the α-value wa3 less than 0.01” is 30. 313
Next, we compute an ambiguity dictionary, which is a dictionary that records occurrences 314
of observed mismatched characters in the corpus. Specifically, the ambiguity dictionary is a list 315
of tuples of the form: (observed character, actual character, frequency). For example, a tuple (“α”, 316
“p”, 30) means that for 30 times when we saw “α” in OCR / PDFX text corpus, the actual 317
character in gold standard was “p”. Multiple tuples for an observed character, like “α” in our 318
example can be aggregated and written as: 319
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
21
320
“α”: [(“p”, 150), (“α”,30), (“a”, 5)], 321
322
which means that for “α” in OCR / PDFX n-gram, the true character should be a “p” 150 times, 323
“α” 30 times and “a”, 5 times. 324
Thus, each mismatch character (from OCR n-gram) xt corresponds to a candidate-character set 325
δt , which can be computed from the ambiguity dictionary. For non-mismatch characters, the 326
candidate character set δt = {xt}, contains only one element which is exactly xt . In our OCR n-327
gram example, the candidate-token set of the first character “t” is “t”, while that of the fifth 328
character “α” may be {“α”, “a”, “p”, “o”, “0”}, for example. Therefore, our problem can be 329
regarded as to choose the “most probable” candidate character for each mismatch character in 330
OCR n-gram, which can maximize the overall likelihood to recover the gold standard n-gram. 331
In other words, we want to find Y* = y1 ∗ y2 ∗ ...yN = argmaxY P(Y|X ), where y1 ∗ y2 ∗ ...yN are 332
the “most probable” candidate characters given X , the pair of OCR and PDFX n-grams. We 333
simplify the notation by using P(Y) as the target probability that we want to maximize. 334
We apply Hidden Markov Model (HMM) to solve our problem. Based on the Markov 335
assumption, the probability we want to maximize can be written as P(Y) = P(y1y2...yN) = 336
P(y1)P(y2|y1)P(y3|y2)...P(yN|yN − 1). The model of our HMM is defined as follows: 337
338
• Observation 339
σ = p1, p2, ..., pM , which is a set of all possible M values for all characters. 340
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
22
341
• State 342
343
which is a set of all possible K values for all candidate characters. 344
345
• Initialization Probabilities 346
Π = {πi|πi = P(yt = qi)}, which is a set of initial probabilities for each state. 347
Therefore, |Π|= K. Suppose L1(qi) is the conditional probability given by 1-gram 348
language model (trained from the gold standard n-grams) for each candidate-character 349
qi (e.g., “α”), we compute πi as follows: 350
πi = L1(qi) 351
Note that 352
353
354
• Transition Probabilities 355
A = {ai j|ai j = P(yt+1 = q j|yt = qi)}, which is a set of probabilities transiting from the 356
t-th candidate character yt with state qi, to the (t + 1)-th candidate-character yt+1 with 357
state qj. Therefore, |A| = K2. Suppose L2(qi, q j) is the conditional probability 358
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
23
given by 2-gram language model (trained from the gold standard n-grams) for two 359
consecutive candidate-characters qi and q j (for example, “α-”), we compute ai j as 360
follows: 361
ai j = L2(qi, q j) 362
363
Note that 364
365
366
• Emission Probabilities 367
B = {bik|bik = P(xt = pk|yt = qi)}, which is a set of probability for the t-th 368
candidate-character yt with state qi, to emit the t-th character xt with observation 369
pk.Therefore, |B| = K*M. Suppose C(pk, qi) is the count that a candidate-character qi 370
(e.g., “p” from gold standard n-gram) is recognized as a character pk (e.g., “α” from 371
OCR n-gram), and λ is a regularization constant, we compute bik as follows: 372
373
Note that, for all states i =1, 2, ..., K. In our example ambiguity dictionary for 374
character “α” = [(“p”, 150), (“α”, 30), (“a”, 5)], the emission probability from “p” to 375
“α” (i.e., the probability that we observe “α” in OCR n-gram, but the true character 376
in gold standard n-gram is “p”) can be estimated as ((30 + λ) / (150 + 30 + 5 + λ)). 377
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
24
378
Finally, we apply the Viterbi algorithm with our HMM model to decode the N most probable 379
Y*. In this way, we have the top N most probable n-grams given an OCR n-gram as the 380
observation. 381
We follow the same process for PDFX as well, that is, treating the PDFX n-gram as the 382
observation and using OCR n-gram to see which characters are not in consensus. Similarly, the 383
ambiguity dictionary for PDFX was derived by applying the same method, except that we record 384
ambiguities for PDFX instead of OCR. Hence, we can compute another list of the top N most 385
probable PDFX n-grams along with their probabilities. 386
387
2.3.3 Combination of Top-N Possible Correct N-Grams Lists 388
Next, we select the most probable n-gram as the n-gram common in both lists (with 389
respect to marked characters only) which has the maximum combined probability, using the idea 390
of integrating gene mention tagging models (9). This is illustrated in Figure 6. 391
392
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
25
Figure 6. The example showing two top-N list of the most probable NXML n-grams with 393
their log probabilities given by Viterbi inference. 394
Candidate number 5 in the left list (where PDFX n-gram is the observation) matches candidate 395
number 0 in the right list (where OCR n-gram is the observation), which is the “chosen best”. 396
397
398
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
26
The PDFX n-gram in Figure 6 is “rs2252931 max P = 2.2610 29” and the OCR n-gram is: 399
“rs2252931 max P=2.2x10-9”. The characters highlighted in green/blue are the characters marked 400
for correction. We won’t apply correction to other characters, but instead just keep whatever we 401
observe in the PDFX n-gram because other mismatches are pairs of digits. The left list of 10 402
candidates is from treating the PDFX n-gram as the observation, while the right side shows the 403
candidates from treating the OCR n-gram as the observation. The numbers in negative are the log 404
probabilities for each sequence as given by the Viterbi algorithm. 405
The candidate numbered 5 in the left list matches the candidate numbered 0 in the right list 406
(for the highlighted characters), and this combination has the highest joint probability (although 407
other combinations also match like 5, 1 and 5, 4). Thus, we choose this candidate as the most 408
probable n-gram. 409
There can be a case where the two lists do not share a common candidate with respect to 410
marked characters. In that case, we use the n-gram language model trained on our corpus to 411
obtain the probabilities for each of these twenty candidates, and select the most probable n-gram 412
candidate. In case there is again a tie among all twenty, we take the top n-gram candidate from 413
the PDFX candidate list as the most probable n-gram. An example of this scenario is shown in 414
Figure 7, where the two lists do not share a candidate in common and the language model chooses 415
the best candidate in the OCR candidate list, as it scores the highest. 416
Finally, in the post-processing step, we pick th n-gram with the highest score, as the “most 417
probable” n-gram as the output of the special character correction step. 418
419
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
27
Figure 7. The example showing two top-N list of the most probable n-grams with their log 420
probabilities given by Viterbi inference. 421
None of the pair of candidates in the lists match. 422
423
424
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
28
2.4 Jumbling Correction 425
The correction of jumbling errors (the wrong order of text while converting PDF to text, as 426
illustrated in Figure 1) in the output XML from PDFX is done by using the textual output as 427
generated by Apache PDFBox. We start by making a non-overlapping n-gram set of PDFX XML 428
text and an overlapping n-gram set for Apache PDFBox text. Then, we use a simple exhaustive 429
method to check if a n-gram contains jumbling error. That is, if an overlapping n-gram is a 430
permutation of another n-gram, we consider that the n-gram contains jumbling error. Next, we 431
replace the suspicious n-gram by the non-jumbled version of n-gram from Apache PDFBox. 432
Finally, we string together the non-overlapping n-grams with a space to construct the jumbling 433
corrected text. 434
435
2.5 Word Boundary Correction 436
To correct word boundary error (e.g., “cancer” might become “can-” and “cer” if it is the last 437
word of a row, thus it should be combined and corrected as “cancer” instead of two words), we 438
adopted a simple dictionary lookup method by comparing different parts (e.g., “can-” and “cer”) 439
with the English dictionary in the Enchant Spellchecking System (2015c) and merging them as 440
required. Specifically, if the hyphenated word is not present in the English dictionary but the word 441
without the hyphen is present in the dictionary, we remove the hyphen from the word. If the word 442
with or without hyphen is not present in the dictionary, we keep it unchanged. 443
444
445
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
29
2.6 Experiment Settings 446
We conduct experiments to compare bioPDFX with state-of-the-art PDFX tool (Constantin et 447
al. 2013), and test the following two hypotheses: 448
449
(a) For each of the individual correction steps, bioPDFX provides better quality of PDF to 450
XML conversion. 451
(b) For overall pipeline, bioPDFX also provides better conversion quality. 452
453
In our experiment, we set n=5 for n-grams. For special character correction, we exploit 454
StochHMM (14) to calculate the top-N possible correct n-Grams lists by Viterbi inference, and 455
KenLM (8) to build language models. We set the regularization parameter to 10-6
. 456
457
2.6.1 Dataset and Gold Standards 458
To test the two hypotheses, National Human Genome Research Institute (NHGRI) provided 459
2,185 biomedical literatures (Jain et al. 2016) related to Genome Wide Association Study (GWAS) 460
(Hindorff et al. 2009; Welter et al. 2014). Among them, 941 are indexed by PubMed Central 461
(PMC), where both PDF and XML versions are available for download as we searched in 2015. 462
From these articles, we randomly sampled 100 pairs of PDF and XML versions (i.e., NXMLs) 463
from these 941 articles as gold standards for four of our correction tasks: title and abstract, special 464
characters, jumbling, and word boundary. For reference and acknowledgement, we built our own 465
gold standard from PDF files directly rather than using those present in NXMLs. The reason behind 466
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
30
this was incomplete reference sections in most of NXMLs and complete omission of 467
acknowledgement sections. For 70 randomly selected PubMed articles, we manually compiled 468
references and acknowledgement text from their PDFs to act as gold standard. For hypothesis (b), 469
we further manually labeled the p-value and numbers (which are import for GWAS literatures) as 470
our gold standard. We randomly split the data in to 50% training and 50% test. 471
472
2.6.2 Evaluation Metrics 473
We developed a scoring metric which measured how a corrected XML from our system 474
performs with respect to the corresponding NXML. In this metric, an n-gram based similarity 475
method is used which is common in plagiarism detection. The idea is to make two sets of n-grams 476
from NXMLs and from our system, and use them to measure the occurrence and correct order of 477
words. We then compute the F1-score of the two n-gram sets as our evaluation metrics for title 478
and abstract, reference and acknowledgement, special characters, and word boundary. We applied 479
macro F1-score for most of the correction tasks except special characters and word boundary (of 480
which micro F1-score is computed). This is because the number of special characters or words 481
with incorrect boundary in an article can vary considerably. For jumbling error, we cannot apply 482
the n-gram similarity methods directly, as the error may appear in the tables (as shown in Figure 483
1). Therefore, we compute the average counts of such errors over all the articles in the dataset as 484
our evaluation metrics. For hypothesis (b), we further compute the micro F1-score for the p-value 485
and numbers (for the same reason as special character and word boundary), and the macro F1-486
score for the overall text quality (which is the n-gram similarity score for the whole document). 487
488
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
31
3. Results 489
The results for each of the five correction steps are shown in Table 1. That is, each step is 490
evaluated independently. As Table 1 depicts, in all correction steps bioPDFX outperforms PDFX, 491
especially for abstract, reference/acknowledgement, special character, and jumbling error 492
corrections. 493
On the other hand, Table 2 shows the scores at the end of the pipeline, where the corrections 494
are run in order of the pipeline as shown in Figure 2. Like the results of individual corrections 495
steps, bioPDFX in general perform better. The additional evaluation tasks (p-value, number and 496
overall text quality) are also included in Table 2. It should be noted that although no correction is 497
aimed at improving p-values, numbers or overall text quality, bioPDFX was still able to provide 498
higher conversion quality for these three tasks. 499
500
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
32
Table 1. Stage wise quality score comparison of raw and corrected XMLs. 501
Correction Step PDFX bioPDFX Metric
Title 0.9135 0.9763 Macro F1 Abstract 0.5428 0.8920 Macro F1 Reference/Acknowledgment 0.6496 0.8026 Macro F1 Special Character 0.7071 0.8860 Micro F1 Jumbling 2.89 1.92 Average Error Word-Boundary 0.7619 0.8053 Micro F1
Jumbling score is the average number of jumbling errors detected per paper (lower is better). 502
503
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
33
Table 2. Final quality score comparison of raw and corrected XMLs at the end of the 504
correction pipeline. 505
Correction Step PDFX bioPDFX Metric
Title 0.9135 0.9695 Macro F1 Abstract 0.5428 0.8877 Macro F1 Reference/Acknowledgment 0.6496 0.8026 Macro F1 Special Character 0.7071 0.8932 Micro F1 Jumbling 2.89 1.95 Average Error Word-Boundary 0.7619 0.8189 Micro F1
P-value 0.5465 0.6615 Micro F1 Number 0.8733 0.8853 Micro F1 Overall Text Quality 0.9006 0.9107 Micro F1
Jumbling score is the average number of jumbling errors detected per paper (lower is better). 506
507
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
34
4. Discussion 508
Based on the experiment results, we show the experimental evidence to support the two 509
hypotheses. Next, we discuss our results for each correction step in detail: 510
511
• Title and Abstract. The correction scores for title and abstract is quite good in case of raw 512
XMLs and increases substantially in the final corrected XMLs. This is an expected 513
behavior as we are retrieving titles and abstracts using the PMC Entrez e-utilities API 514
directly which can be thought of as a gold standard itself. The F1-scores are still not perfect 515
due to syntactic differences between title/abstract text present in NXMLs and the 516
title/abstract text retrieved from e-utilities API. For example, the API does not seem to use 517
Unicode characters like “≤”, “≥” and “∼” and would instead have the phrases “less than or 518
equal”, “greater than or equal” and “approximately” in their place. Furthermore, there 519
were cases where the data from NXMLs was incorrect because of errors from 520
authors of those papers like missing title/abstracts and other similar inconsistencies 521
in punctuation. Apart from this, we can consider the titles and abstracts to be near perfect in 522
the corrected XMLs (assuming e-utilities API gives correct results). Some o ther errors 523
are introduced due to accumulation of errors from previous stages of corrections in the 524
pipeline, but as can be seen by comparing stage-wise and end of pipeline scores in Table 1 525
and Table 2, they are quite small and can be neglected. 526
• Reference and Acknowledgement. Reference/Acknowledgment scores also improved 527
considerably and lead to better scores for the remaining stages of the pipeline, by removing 528
a significant number of the False Positives, as can be observed by comparing the stage-529
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
35
wise-results in Table 1 and final- results in Table 2. 530
• Special Character. We could improve special character F1-score by a considerable margin 531
through our technique of combining HMMs, n-gram alignment, OCR, and language 532
models. 533
• Jumbling Error. The average counts of jumbling errors are fewer for bioPDFX then for 534
PDFX. It should be noted that there is no guarantee that jumbling errors occur only over 5-535
grams, or equivalently within 5 words. In the same document, there might be jumbling 536
errors spanning up to arbitrarily large and varied n. A better approach might be to start with 537
a much larger n and attempt to detect jumbling from there and progressively reduce n one 538
by one, however this approach is computationally intensive. Even operating on 5-grams 539
for jumbling error correction step takes a significant amount of time compared to other 540
correction steps. 541
• Word Boundary. The F1-score of word boundary correction using bioPDFX are also 542
improved modestly comparing to PDFX. We also attempted to use a language model (along 543
with the English dictionary) to increase the range of corrections possible for biomedical 544
terms, but it did not give us any performance improvement. We believe this is because 545
word boundary errors are comparatively infrequent compared to total number of words in 546
an article, and because most words in an article are English words or numbers, most word 547
boundary errors happen in English words, which are already corrected by the English 548
dictionary. 549
Also, the usability of bioPDFX can be largely increased by a multiple-files uploading 550
functionality, so that users can submit many PDF files of interest for conversion instead of 551
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
36
uploading one-by-one. We are currently implementing an interface for such a multiple-uploading 552
feature (Figure 8) for a biocuration tool that will include this feature to maximize the utility of 553
bioPDFX. 554
555
556
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
37
Figure 8. The interface for multiple-files uploading. 557
The usability of bioPDFX can be largely increased by integrating it within a biocuration tool. 558
559
560
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
38
5. Conclusions 561
We have developed a tool to convert PDF files into XML format specifically for biomedical 562
articles, targeting the five challenges of state-of-the-art converters: title/abstract, 563
reference/acknowledgement, special character, jumbling error, and word boundary. Overall, the 564
whole pipeline developed in this paper makes the published literature in form of PDF files much 565
better suited for text mining tasks while slightly improving the overall text quality as well. 566
Although the tool discussed in this paper is trained to be specifically optimized for our target 567
corpus (GWAS), we believe that these techniques can be applied to other kinds of literatures as 568
well. 569
It should be noted that although we use PDFX XMLs as the main text sources, and use Apache 570
PDFBox and OCR text as auxiliary text sources, bioPDFX is general and is not dependent on these 571
text sources. Given a primary text source to correct substitution errors, other secondary text sources 572
can also be exploited in the correction steps. 573
There are some limitations of the approach, but we believe that these can be easily overcome 574
in different domains and by investigation of several different text sources. These limitations are 575
summarized as below: 576
577
• Title and abstract corrections depend on the PMC Entrez e-Utilities API 578
• As we model n-grams as sequences of characters and due to the Markov assumption, we 579
cannot correct errors in which one character is replaced by multiple ones, or is deleted 580
• Jumbling error correction is only an approximation since we fix n=5 581
582
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
39
In the future, we plan to run each of the text sources in parallel further improving the overall 583
performance, for which our initial result was reported in (Goyal et al. 2016). Also, we plan to 584
explore more challenging issues in the PDF to XML conversion process. Finally, we plan to extend 585
our experiments on more diverse types of large-scale corpuses. 586
587
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
40
Acknowledgements 588
We would like to thank Dr. Lucia Hindroff of NHGRI for her generous support of this project by 589
providing us PDF copies of test articles and the curation guidelines of the Catalog of GWAS to 590
make this research possible. We also would like to thank Dr. Helen Parkinson, Dr. Jackie 591
MacArthur and their team members at EBI for their assistance. 592
593
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
41
References 594
2015a. Adobe Acrobat SDK. Available at http://www.adobe.com/devnet/acrobat/overview.html. 595
2015b. Apache PDFBox. Available at https://pdfbox.apache.org/index.html. 596
2015c. Enchant Spellchecking System. 597
2015d. Mozilla PDF.js. Available at https://mozilla.github.io/pdf.js/. 598
2015e. National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), 599
PubMed Central (PMC) Open Access Subset. Available at http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/. 600
Baum LE, and Petrie T. 1966. Statistical inference for probabilistic functions of finite state Markov chains. 601
The annals of mathematical statistics 37:1554-1563. 602
Berg ØR. 2011. High precision text extraction from PDF documents. 603
Breiman L. 2001. Random forests. Machine Learning 45:5-32. 604
Constantin A, Pettifer S, and Voronkov A. 2013. PDFX: fully-automated PDF-to-XML conversion of 605
scientific literature. Proceedings of the 2013 ACM symposium on Document engineering. Florence, 606
Italy: ACM. p 177-180. 607
Cortez E, da Silva AS, Gonçalves MA, Mesquita F, and de Moura ES. 2007. FLUX-CIM: flexible 608
unsupervised extraction of citation metadata. Proceedings of the 7th ACM/IEEE-CS joint 609
conference on Digital libraries: ACM. p 215-224. 610
Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R. 1990. Indexing by latent semantic 611
analysis. Journal of the American Society for Information Science 41:391-407. 10.1002/(sici)1097-612
4571(199009)41:6<391::aid-asi1>3.0.co;2-9 613
Garcia A, Murray-Rust P, Burns G, Stevens R, Tkaczyk D, McLaughlin C, Belin A, Di Iorio A, García L, 614
and Gruson-Daniel C. 2013. PDFJailbreak-a communal architecture for making biomedical PDFs 615
semantic. Proceedings of BioLINK SIG 2013:63. 616
Goyal A, Singh A, Bhargava S, Crawl D, Altintas I, and Hsu C-N. 2016. Natural Language Processing 617
using Kepler Workflow System: First Steps. Procedia Computer Science 80:712-721. 618
Hansen PC. 1987. The truncatedsvd as a method for regularization. BIT Numerical Mathematics 27:534-619
553. 620
Hindorff L, Sethupathy P, Junkins H, Ramos E, Mehta J, Collins F, and Manolio T. 2009. Potential etiologic 621
and functional implications of genome-wide association loci for human diseases and traits. 622
Proceedings of the National Academy of Sciences 106:9362-9367. citeulike-article-id:4806365 623
doi: 10.1073/pnas.0903103106 624
Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, and Chung IF. 2008. Integrating high dimensional 625
bi-directional parsing models for gene mention tagging. Bioinformatics 24:i286-i294. citeulike-626
article-id:12926636 627
Jain S, Tumkur K, Kuo T-T, Bhargava S, Lin G, and Hsu C-N. 2016. Weakly Supervised Learning of 628
Biomedical Information Extraction from Curated Data. BMC Bioinformatics. 629
Kolda TG, and O'leary DP. 1998. A semidiscrete matrix decomposition for latent semantic indexing 630
information retrieval. ACM Transactions on Information Systems (TOIS) 16:322-346. 631
Levenshtein VI. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet physics 632
doklady. p 707. 633
Ramakrishnan C, Patnia A, Hovy E, and Burns GA. 2012. Layout-aware text extraction from full-text PDF 634
of scientific articles. Source code for biology and medicine 7:1. 635
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio 636
M, and Federhen S. 2011. Database resources of the national center for biotechnology information. 637
Nucleic acids research 39:D38-D51. 638
Smith R. 2007. An overview of the Tesseract OCR engine. 639
Tkaczyk D, Szostek P, Dendek PJ, Fedoryszak M, and Bolikowski L. 2014. CERMINE--Automatic 640
Extraction of Metadata and References from Scientific Literature. Document Analysis Systems 641
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017
42
(DAS), 2014 11th IAPR International Workshop on: IEEE. p 217-221. 642
Tong X, and Evans DA. 1996. A statistical approach to automatic OCR error correction in context. 643
Proceedings of the fourth workshop on very large corpora. p 88-100. 644
Velagapudi P. 1999. Using hmms to boost accuracy in optical character recognition. Proceedings of SPIE, 645
27th AIPR Workshop: Advances in Computer-Assisted Recognition: Citeseer. p 96-104. 646
Viterbi A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. 647
IEEE TRANSACTIONS ON INFORMATION THEORY 13:260-269. 648
Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff 649
L, and Parkinson H. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait 650
associations. Nucleic acids research 42:D1001-D1006. citeulike-article-id:12826571 651
doi: 10.1093/nar/gkt1229 652
Zhuang L, and Zhu X. 2005. An OCR post-processing approach based on multi-knowledge. International 653
Conference on Knowledge-Based and Intelligent Information and Engineering Systems: Springer. 654
p 346-352. 655
656
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017