bioPDFX: preparing PDF scientific articles for biomedical ... › preprints › 2993.pdf · 57 (such as Apache PDFBox Mozilla PDF.js (2015d) Adobe Acrobat SDK (2015a) Tesseract 58

bioPDFX: preparing PDF scientific articles for biomedical text

mining

Shitij Bhargava 1 , Tsung-Ting Kuo 2 , Ankit Goyal 1 , Vincent Kuri 1 , Gordon Lin 1 , Chun-Nan Hsu Corresp. 2

1 Department of Computer Science and Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California, United States

2 Health System Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California, United States

Corresponding Author: Chun-Nan Hsu

Email address: [email protected]

Background. There is huge amount of full-text biomedical literatures available in public repositories like

PubMed Central (PMC). However, a substantial number of the papers are in Portable Document Format

(PDF) and do not provide plain text format ready for text mining and natural language processing (NLP).

Although there exist many PDF-to-text converters, they still suffer from several challenges while

processing biomedical PDFs, such as the correct transcription of titles/abstracts, segmenting

references/acknowledgements, special characters, jumbling errors (the wrong order of the text), and

word boundaries.

Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with

strengths of multiple state-of-the-art methods and then applies machine learning methods to address all

issues above

Results. The experiment results on publications of Genome Wide Association Studies (GWAS)

demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-art PDF-

to-XML converter, leading to a biomedical database more suitable for text mining.

Discussion. Overall, the whole pipeline developed in this paper makes the published literature in form of

PDF files much better suited for text mining tasks, while slightly improving the overall text quality as well.

The service is open to access freely at URL: http://textmining.ucsd.edu:9000 . A list of PubMed Central

IDs of the 941 articles (see Supplemental File 1) used in this study is available for download at the same

URL. The instructions of how to run the service with a PubMed ID are described in Supplemental File 2.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2993v1 | CC BY 4.0 Open Access | rec: 26 May 2017, publ: 26 May 2017

http://textmining.ucsd.edu:9000

1

bioPDFX: preparing PDF scientific articles for biomedical 1

text mining 2

3

Shitij Bhargava1, Tsung-Ting Kuo

2, Ankit Goyal

1, Vincent Kuri

1, Gordon Lin

1, Chun-Nan Hsu

2 4

5

1 Department of Computer Science and Engineering, Jacobs School of Engineering, 6

University of California, San Diego, La Jolla, California, United States 7

2 Health System Department of Biomedical Informatics, School of Medicine, 8

University of California, San Diego, La Jolla, California, United States 9

10

Corresponding Author: 11

Chun-Nan Hsu 12

9500 Gilman Drive, La Jolla, CA 92093, United States 13

Email address: [email protected] 14

15


2

Abstract 16

Background. There is huge amount of full-text biomedical literatures available in public 17

repositories like PubMed Central (PMC). However, a substantial number of the papers are in 18

Portable Document Format (PDF) and do not provide plain text format ready for text mining and 19

natural language processing (NLP). Although there exist many PDF-to-text converters, they still 20

suffer from several challenges while processing biomedical PDFs, such as the correct transcription 21

of titles/abstracts, segmenting references/acknowledgements, special characters, jumbling errors 22

(the wrong order of the text), and word boundaries. 23

24

Methods. In this paper, we present bioPDFX, a novel tool which complements weaknesses with 25

strengths of multiple state-of-the-art methods and then applies machine learning methods to 26

address all issues above. 27

28

Results. The experiment results on publications of Genome Wide Association Studies (GWAS) 29

demonstrated that bioPDFX significantly improved the quality of XML comparing to state-of-the-30

art PDF-to-XML converter, leading to a biomedical database more suitable for text mining. 31

32

Discussion. Overall, the whole pipeline developed in this paper makes the published literature in 33

form of PDF files much better suited for text mining tasks, while slightly improving the overall 34

text quality as well. The service is open to access freely at URL: http://textmining.ucsd.edu:9000. 35

A list of PubMed Central IDs of the 941 articles (see Supplemental File 1) used in this study is 36

available for download at the same URL. The instructions of how to run the service with a PubMed 37

ID are described in Supplemental File 2. 38

39


3

1. Introduction 40

PubMed Central (PMC) (2015e) contains full-text of about 4 million biomedical literatures, 41

and is one of the most important freely available data sources for the biomedical natural language 42

processing (NLP) research field. Although many new publications now provide plain text along 43

with the Portable Document Format (PDF), a substantial number of them do not. However, text 44

mining algorithms work more effectively on text-based formats such as plain text, XML or HTML 45

documents. Therefore, NLP researchers usually need to transcribe the biomedical literatures from 46

PDF to text as a preprocessing step of the NLP pipeline. 47

However, transcribing PDF to text accurately is not trivial. The design goals of the PDF 48

standard target ease of human readability rather than electronic consumption of data by other 49

software tools. That is, the PDF standard does not attempt to encode any semantic connection 50

between characters in a word or between paragraphs, but characters are “painted" individually at 51

specific 2-dimensional coordinates. Berg et. al. outlined some challenges encountered in 52

transcribing PDF documents to text accurately, such as varied reading orders, sectional formatting 53

and number of columns (Berg 2011). These issues create a variability in text quality in the 54

generated outputs and impedes accurate text mining. 55

Although there are already a variety of tools available today for PDF-to-text transcription 56

(such as Apache PDFBox Mozilla PDF.js (2015d) Adobe Acrobat SDK (2015a) Tesseract 57

Optical Character Recognition (OCR) (Smith 2007) (Constantin et al. 2013)several 58

challenges to be solved for converting biomedical literature PDF to XML for text mining purpose:: 59

1. Extracting title and abstracts correctly. In case of manuscripts which might have contents 60

other than title or abstract on their first page, conversion tools like PDFX struggle to 61


4

identify them correctly. 62

2. Identifying reference and acknowledgement sections in various format. The format of 63

references varies considerably among different publications ranging from a separate 64

section towards the end to individual references in footnotes continuing alongside the 65

content body. Take PDFX as an example, as a purely rule-based system, it often falters in 66

recognizing references when the common pattern of having references in the end with a 67

heading is not present. Similar problems were noticed in identifying acknowledgements as 68

well. 69

3. Recognizing special characters (such as '!' and '@'). PDFX and all PDF to text converters 70

usually suffer from the problem of recognizing the special characters depending on the 71

publishing of the PDF files. 72

4. Detecting and fixing jumbling errors (the wrong order of text while converting PDF to 73

text). Often most PDF conversion tools jumbles the work order in or across sentences due 74

to mistakes in reading order detection/segmentation. An example is given in Figure 1 where 75

the exponents of numbers in a table are extracted as a separate column and hence are jumbled 76

with respect to the actual text. 77

5. Correcting word boundaries. Often words straddling the column boundary have a hyphen 78

inserted in between them (e.g., “cancer” might become “can-” and “cer” if it is the last 79

word of a row). Although this is not an error due to conversion, we would still like to 80

correct it and merge the hyphenated parts as a single word wherever possible to support 81

NLP afterwards. 82

83


5

Figure 1. A kind of jumbling error in a table. 84

The table at the top is what is present in the PDF. The lower portion shows the transcribed XML. 85

The exponents marked in the red-colored box are incorrectly extracted as a separate column in the 86

XML. 87

88

89


6

To address these issues, we proposed bioPDFX, a novel tool that integrates current state-of-90

the-art PDF conversion methods to transcribe biomedical PDF articles to high quality text in XML 91

format. The bioPDFX tool leverages the following four tools to transcribe text of the given PDF: 92

PDFX (Constantin et al. 2013), PMC Entrez e-utilities API (Sayers et al. 2011), Tesseract OCR 93

(Smith 2007), and Apache PDFBox (2015b). The input of bioPDFX includes a biomedical 94

literature PDF file and its corresponding PubMed ID, while the output is the converted XML file. 95

The overall processing pipeline of bioPDFX consists of the following five components, each 96

address an abovementioned challenge: 97

1. Title and Abstract Correction. We used the PMC Entrez e-utilities API to retrieve the 98

correct title and abstract from PubMed. 99

2. Reference and Acknowledgment Correction. To detect whether a paragraph is reference, 100

acknowledgement, or none of them, we designed two binary classifiers, one for reference 101

and the other for acknowledgement. For reference detection, we extracted features such as 102

density of year (e.g., “2016”), density of “et. al.”, and density of numbers followed by the 103

dot (e.g., “12.”. We normalized these features and train a classifier to predict whether a 104

paragraph is a reference text. For acknowledgment detection, we used a classifier with 105

features being TF-IDF followed by Latent Semantic Analysis (Deerwester et al. 1990) to 106

perform the detection. 107

3. Special Character Correction. The basic idea is to compare the results from PDFX and 108

that from Tesseract OCR to identify suspicious characters (i.e., different identified 109

characters within the same n-gram), and then apply Hidden Markov Model (HMM) (Baum 110

& Petrie 1966) with Viterbi inference algorithm (Viterbi 1967) and language model to 111

choose the “most probable” candidate character for the mismatch character by maximizing 112


7

the overall likelihood to recover the n-gram. 113

4. Jumbling Error Correction. We used the text extracted using Apache PDFBox to correct 114

the possible jumbling errors from PDFX. We first detected jumbled n-gram, and then 115

replaced them by the non-jumbled version from Apache PDFBox. 116

5. Word Boundary Correction. We adopted a simple dictionary lookup method by comparing 117

different parts (e.g., “can-” and “cer”) with the English dictionary in the Enchant 118

Spellchecking System (2015c) and merging them as required. 119

To evaluate the bioPDFX tool, we randomly extracted 100 biomedical literatures related to 120

Genome Wide Association Study (GWAS) (Hindorff et al. 2009; Welter et al. 2014) from PMC, 121

and used XML versions (i.e., NXMLs) of those 100 articles provided in PMC as gold standards. 122

We compare bioPDFX with the state-of-the-art PDFX tool (Constantin et al. 2013). As a use case, 123

we also compare the results of converting important information (p-value and number) and overall 124

text quality in the GWAS papers using bioPDFX and PDFX. We evaluate both individual 125

correction steps and overall conversion results. The experiment results show significant 126

improvement for all five types of correction tasks, especially for abstract, 127

reference/acknowledgement, special character, and jumbling error corrections. 128

Our contributions of this study are three-fold: 129

• We identified five issues (title/abstract, reference/acknowledgment, special character, 130

jumbling error, and word boundary) of the state-of-the-art PDF-to-XML tools as the very 131

first step of biomedical NLP pipeline. 132

• We integrated the output of four popular tools (PDFX, PMC Entrez e-utilities API, 133

Tesseract OCR, and Apache PDFBox) in bioPDFX and designed five corresponding 134


8

correction steps to solve the abovementioned issues. 135

• We evaluated bioPDFX and compare it to PDFX for each correction steps as well as a real-136

world GWAS information extraction task, and the results demonstrated that bioPDFX can 137

improve all the five issues and increase the extraction accuracy for GWAS literatures. 138

139

Related Work. The rest of the section reviews the related studies along with the mention of tools 140

that we use in our approach. PDF text extraction involves extracting out all textual content from 141

the PDF file in an order that makes sense even without the formatting of the PDF. A variety of 142

tools are available that parse and convert PDF to text such as Apache PDFBox (2015b), Mozilla 143

PDF.js (2015d), and Adobe Acrobat SDK (2015a). Most text extractors use a page segmentation 144

algorithm to determine all the textual areas of the PDF based on width of the whitespace and from 145

these boundaries also determine a reading order. A major problem associated with the text based 146

extraction is because of variability of PDF publishing software. Some tools embed a special 147

character that is not included in a standard font by drawing a vector graphics over it. Thus, although 148

it might look perfectly correct to a reader, the information is not present in the text layer of the 149

PDF for text extractors to extract. In our bioPDFX tool, we integrated text extraction from Apache 150

PDFBox with other tools such as PDFX and optical character recognition tool Tesseract OCR to 151

overcome challenges like this. 152

XML transcription of a PDF file involves extraction of text, tables, images, etc. and then 153

tagging them with appropriate sections/regions to make an XML. The exact format of the 154

transcribed XML depends on the specific tool and might have a tag for title, abstract, introduction, 155

tables, etc. Some tools like PDFX (Constantin et al. 2013) use a schema similar to JATS/NLM 156

document-type definition (DTD) format which is a specific format for scientific articles and are 157


9

highly optimized around the organization of such PDF files. PDFX is a freely available rule-based 158

system which converts scholarly PDF articles to XMLs which have detailed sectional information 159

such as title, abstract, references, tables, and acknowledgment. It derives transcription parameters 160

from the relative font sizes, style and spacing of articles with respect to individual PDF files. In 161

our work, we integrate PDFX with other extraction tools to improve the extraction results for texts 162

like title, abstract, reference, and acknowledgement. 163

Text post-processing is a step that improves the correctness of the generated text. Several 164

tools, such as Tesseract OCR (Smith 2007), deploy techniques from Optical Character Recognition 165

(OCR) to increase the accuracy in the post-processing stage. Because most OCR engines classify 166

each character independent of other characters, it is important to post-process the text using some 167

contextual information such as statistical language models or syntactic and semantic rules. One 168

such method suggested by Zhuang and Zhu (Zhuang & Zhu 2005) reduces the candidate character 169

search space for characters by using the candidate distance information by an OCR engine in 170

conjunction with an n-gram based language model and a semantic lexicon. In another method 171

proposed by Velagapudi (Velagapudi 1999), contextual information was extracted by modeling 172

words/n-grams as sequences of characters and a Hidden Markov Model (HMM) was subsequently 173

used to decide the best sequence, using the given OCR output as the observation. Transition 174

probabilities of one character to another are calculated from a corpus and emission probabilities 175

are calculated from the output of the OCR engine on a corpus, to statistically record the error 176

patterns of the OCR engine. At the word level, Tong and Evans (Tong & Evans 1996) corrected 177

OCR text by modeling the text as sequence bigrams and then using HMM to compute the best 178

word sequence. The system learns the character level confusion probabilities for a specific OCR 179

engine and uses it to achieve better performance. In our study, we follow a similar approach to 180


10

post-process the output from PDFX using the Tesseract OCR engine. 181

In addition to above methodologies, several systems extract metadata such as authors, year of 182

publication, journal name, and bibliographical information from natural language text or PDFs 183

directly. CERMINE (Tkaczyk et al. 2014) uses supervised and unsupervised machine learning 184

techniques to extract parsed bibliographic references and metadata directly from PDFs. FLUX-185

CiM (Cortez et al. 2007) uses unsupervised, non-template methods to extract citation components 186

from articles. Other PDF transcribers include LA-PDFText.(Ramakrishnan et al. 2012) Although 187

the abovementioned tools are available for PDF-to-text transcription, several issues are yet to be 188

dealt with, such as how to extract title and abstracts correctly, identify reference acknowledgement 189

sections in various format, recognize special characters correctly, detect and fix jumbling errors, 190

and correct word boundaries. PDFJailbreak (Garcia et al. 2013) is a communal project to create an 191

architecture and shared API to support open community development of semantic information 192

extractor from biomedical literature in PDF form. 193

194


11

2. Materials and Methods 195

The goal of this study is to extract high-quality XMLs from a biomedical literature PDFs to 196

make it better suited for our text mining tasks. Given a PDF file and its corresponding PubMed ID, 197

our system integrates the following four tools to generate output XML files: 198

199

• PDFX (Constantin et al. 2013) to transcribe XML of the given PDF 200

• PMC Entrez e-utilities API (Sayers et al. 2011) to retrieve text from PubMed 201

• Tesseract OCR (Smith 2007) to extract text extracted from the PDF file 202

• Apache PDFBox (2015b) to transcribe text of the given PDF 203

204

As shown in Figure 2, there are five correction stages (title/abstract, 205

reference/acknowledgment, special character, jumbling error, and word boundary corrections) in 206

the bioPDFX pipeline. Each stage is a filter and accepts an XML as input, giving a corrected XML 207

as output. We start from the XML generated by PDFX, correct title and abstract using PMC Entrez 208

e-utilities API, identify reference and acknowledgement texts, exploit Tesseract OCR to fix special 209

characters, utilize Apache PDFBox to correct jumbling errors, and recover word-boundaries as our 210

final stage. It should be noted that the pipeline design of bioPDFX allows different order of 211

correction stages, as well as adding new correction stages. The detail of each of the correction 212

stages is discussed in the following subsections. 213

214


12

Figure 2. A flow diagram showing the stages of correction and their inputs. 215

Stages are arranged in a pipelined fashion and take an XML as input and return a corrected XML 216

as output. 217

218

219


13

2.1 Title and Abstract Correction 220

The flow diagram illustrating this step is given in Figure 3. As the input of this step, PDFX 221

performs well in detecting title and abstract for regular research articles; but in case of manuscripts 222

which might have contents other than title or abstract on their first page, PDFX struggles to identify 223

them correctly. To overcome this, we used the PMC Entrez e-utilities API (Sayers et al. 2011) to 224

retrieve the correct title and abstract (using PubMed ID of the paper) and substituted them for 225

whatever was present in PDFX XML. It should be noted that we applied PMC Entrez e-utilities 226

API because our focus is the biomedical literatures archived in PubMed Central. 227

228


14

Figure 3. A flow diagram showing the title and abstract correction. 229

The title and abstract correction pipeline, with PDFX XML as input and Title/Abstract corrected 230

XML as output. 231

232

233


15

2.2 Reference and Acknowledgement Correction 234

The format of references varies considerably among different publications ranging from a 235

separate section towards the end to individual references in footnotes continuing alongside the 236

content body. As PDFX is a purely rule-based system, it often falters in recognizing references 237

when the common pattern of having references in the end with a heading is not present. Similar 238

problems were noticed in identifying acknowledgements as well. 239

A key observation was that whenever PDFX identifies references/acknowledgement in a 240

paper, it is generally correct and correction was needed only in the case when it does not identify 241

either. We realized that the text content in references and acknowledgment showed clear patterns 242

in language, and because for our purpose, we did not need to extract individual bibliographic items 243

in the references section like PDFX does, classifying each section as a reference/acknowledgment 244

was enough. Also, it is important to note that the only requirement was to detect reference text, 245

which is much coarser grained and simpler than complete metadata extraction from journals, or 246

metadata extraction for each citation. 247

Although references and acknowledgment are usually clumped together, in this correction 248

step we designed two separate classifiers, as the patterns they show are different: 249

(1) For reference detection, we used PDFX XMLs which had a PDFX “reference” tag to train the 250

classifier (there are 451 such XMLs in our experiment), by taking reference text as positive 251

examples and rest of the text from the paper as negative examples. Some of the most prominent 252

features were density of year like numbers, density of “et. al” and similar strings, density of 253

numbers followed by the dot and so on. For each of the above features, the ratio of the numbers 254

found versus total number of tokens in the text was used instead of the raw numbers. We 255


16

normalized these features and train a Random Forest classifier (Breiman 2001) to predict 256

whether a paragraph is a reference text. 257

(2) For acknowledgment detection, we used PDFX XMLs which had a PDFX “acknowledgment” 258

tag to train the reference classifier (there are 451 such XMLs in our experiment). We exploited 259

another Random Forest classifier with features being TF-IDF followed by Latent Semantic 260

Analysis (Deerwester et al. 1990) (Truncated Singular Value Decomposition (Hansen 1987; 261

Kolda & O'leary 1998) in our experiment) to perform the detection. The features we adopted 262

are TF-IDF of the phrases such as “funded by” and “express gratitude”. 263

264

2.3 Special Characters Correction 265

PDFX and other PDF-to-text converters usually suffer from the problem of recognizing the 266

special characters (such as '!' and '@') depending on the publishing of the PDF files. The basic idea 267

for the special character correction step is that given a possibly erroneous n-gram from PDFX in 268

the token/word level, we recognized the corresponding n-gram in Tesseract OCR and then aligned 269

the two n-grams by their characters. Then, for differing characters we make a decision about what 270

character should be present in that position using Hidden Markov Model (HMM) (Baum & Petrie 271

1966) with Viterbi inference algorithm (Viterbi 1967) and language models. The flowchart for this 272

pipeline is summarized in Figure 4. 273

274


17

Figure 4. A flow diagram for special character correction. 275

This pipeline shows the steps taken in correcting special character errors in an n-gram. 276

277

278


18

2.3.1 Suspicious N-Gram and Character Identification 279

Before applying HMM, we first select similar character-level n-grams (we treat each character, 280

including the space, as unigram) from the output of OCR and PDFX, align the n-grams and then 281

mark suspicious characters in the n-grams. To select the most similar n-gram from OCR and PDFX 282

text, we used Levenshtein Distance (Levenshtein 1966) normalized for length to measure the 283

similarity between the two n-grams. The alignment of the two n-grams is done at the character 284

level and is based on a gene sequence alignment mechanism which is widely used (Hsu et al. 2008). 285

If the n-gram in OCR text is at least 0.6 (in Levenshtein Distance) as like the PDFX n-gram then 286

the characters that need to be replaced are marked. This is done by looking at the characters that 287

differ in the two n-grams after the alignment and selecting those characters which are not aligned 288

with whitespaces or digits. These mismatch characters are then fed to a HMM to predict top 289

possible correct character. The example in Figure 5 show the OCR, PDFX, and gold standard n-290

grams. In this example, the mismatch characters in OCR n-gram are “α”, “u”, and “3”, while 291

that in PDFX n-gram are “a”, “a” and “s”. 292

293


19

Figure 5. An example showing three types of n-gram: OCR, PDFX, and gold standard. 294

The mismatch characters are shown in bold and underlined style. 295

296

297


20

2.3.2 Prediction of Top-N Possible Correct N-Gram Lists for OCR and PDFX 298

After marking the suspicious characters in the n-gram from OCR and the one from PDFX, we 299

predict the list of the most probable correct n-grams for OCR and the list for PDFX. Our solution 300

is to train an HMM (16) using a training corpus of aligned OCR, PDFX and gold standard texts, 301

and infer the “most probable” n-gram by the Viterbi algorithm. 302

While formulating this problem in terms of HMM Viterbi inference, one of our assumptions 303

is that characters in the words of meaningful sentences follow the Markov assumption to a good 304

extent, something which is already well proven by handwriting and speech recognition applications, 305

which have used HMMs to achieve good performance (10) (19). The other assumption is that 306

OCR and PDFX have a statistical pattern or bias in the character level errors that they make, and 307

that these patterns can be learned through a training corpus. The same technique of modeling 308

words as sequences of characters by HMM has been used to boost OCR accuracy in (24). 309

To formally state the problem, we first define a character xt as the t-th letter in an OCR n-gram. 310

Therefore, each OCR n-gram can be represented as X = x1x2...xN , which is a concatenation of N 311

characters. It should be noted that the character whitespace (i.e., “ ”) is also treated as a character. 312

In Figure 5, the number of characters in the OCR n-gram “the α-value wa3 less than 0.01” is 30. 313

Next, we compute an ambiguity dictionary, which is a dictionary that records occurrences 314

of observed mismatched characters in the corpus. Specifically, the ambiguity dictionary is a list 315

of tuples of the form: (observed character, actual character, frequency). For example, a tuple (“α”, 316

“p”, 30) means that for 30 times when we saw “α” in OCR / PDFX text corpus, the actual 317

character in gold standard was “p”. Multiple tuples for an observed character, like “α” in our 318

example can be aggregated and written as: 319


21

320

“α”: [(“p”, 150), (“α”,30), (“a”, 5)], 321

322

which means that for “α” in OCR / PDFX n-gram, the true character should be a “p” 150 times, 323

“α” 30 times and “a”, 5 times. 324

Thus, each mismatch character (from OCR n-gram) xt corresponds to a candidate-character set 325

δt , which can be computed from the ambiguity dictionary. For non-mismatch characters, the 326

candidate character set δt = {xt}, contains only one element which is exactly xt . In our OCR n-327

gram example, the candidate-token set of the first character “t” is “t”, while that of the fifth 328

character “α” may be {“α”, “a”, “p”, “o”, “0”}, for example. Therefore, our problem can be 329

regarded as to choose the “most probable” candidate character for each mismatch character in 330

OCR n-gram, which can maximize the overall likelihood to recover the gold standard n-gram. 331

In other words, we want to find Y* = y1 ∗ y2 ∗ ...yN = argmaxY P(Y|X ), where y1 ∗ y2 ∗ ...yN are 332

the “most probable” candidate characters given X , the pair of OCR and PDFX n-grams. We 333

simplify the notation by using P(Y) as the target probability that we want to maximize. 334

We apply Hidden Markov Model (HMM) to solve our problem. Based on the Markov 335

assumption, the probability we want to maximize can be written as P(Y) = P(y1y2...yN) = 336

P(y1)P(y2|y1)P(y3|y2)...P(yN|yN − 1). The model of our HMM is defined as follows: 337

338

• Observation 339

σ = p1, p2, ..., pM , which is a set of all possible M values for all characters. 340


22

341

• State 342

343

which is a set of all possible K values for all candidate characters. 344

345

• Initialization Probabilities 346

Π = {πi|πi = P(yt = qi)}, which is a set of initial probabilities for each state. 347

Therefore, |Π|= K. Suppose L1(qi) is the conditional probability given by 1-gram 348

language model (trained from the gold standard n-grams) for each candidate-character 349

qi (e.g., “α”), we compute πi as follows: 350

πi = L1(qi) 351

Note that 352

353

354

• Transition Probabilities 355

A = {ai j|ai j = P(yt+1 = q j|yt = qi)}, which is a set of probabilities transiting from the 356

t-th candidate character yt with state qi, to the (t + 1)-th candidate-character yt+1 with 357

state qj. Therefore, |A| = K2. Suppose L2(qi, q j) is the conditional probability 358


23

given by 2-gram language model (trained from the gold standard n-grams) for two 359

consecutive candidate-characters qi and q j (for example, “α-”), we compute ai j as 360

follows: 361

ai j = L2(qi, q j) 362

363

Note that 364

365

366

• Emission Probabilities 367

B = {bik|bik = P(xt = pk|yt = qi)}, which is a set of probability for the t-th 368

candidate-character yt with state qi, to emit the t-th character xt with observation 369

pk.Therefore, |B| = K*M. Suppose C(pk, qi) is the count that a candidate-character qi 370

(e.g., “p” from gold standard n-gram) is recognized as a character pk (e.g., “α” from 371

OCR n-gram), and λ is a regularization constant, we compute bik as follows: 372

373

Note that, for all states i =1, 2, ..., K. In our example ambiguity dictionary for 374

character “α” = [(“p”, 150), (“α”, 30), (“a”, 5)], the emission probability from “p” to 375

“α” (i.e., the probability that we observe “α” in OCR n-gram, but the true character 376

in gold standard n-gram is “p”) can be estimated as ((30 + λ) / (150 + 30 + 5 + λ)). 377


24

378

Finally, we apply the Viterbi algorithm with our HMM model to decode the N most probable 379

Y*. In this way, we have the top N most probable n-grams given an OCR n-gram as the 380

observation. 381

We follow the same process for PDFX as well, that is, treating the PDFX n-gram as the 382

observation and using OCR n-gram to see which characters are not in consensus. Similarly, the 383

ambiguity dictionary for PDFX was derived by applying the same method, except that we record 384

ambiguities for PDFX instead of OCR. Hence, we can compute another list of the top N most 385

probable PDFX n-grams along with their probabilities. 386

387

2.3.3 Combination of Top-N Possible Correct N-Grams Lists 388

Next, we select the most probable n-gram as the n-gram common in both lists (with 389

respect to marked characters only) which has the maximum combined probability, using the idea 390

of integrating gene mention tagging models (9). This is illustrated in Figure 6. 391

392


25

Figure 6. The example showing two top-N list of the most probable NXML n-grams with 393

their log probabilities given by Viterbi inference. 394

Candidate number 5 in the left list (where PDFX n-gram is the observation) matches candidate 395

number 0 in the right list (where OCR n-gram is the observation), which is the “chosen best”. 396

397

398


26

The PDFX n-gram in Figure 6 is “rs2252931 max P = 2.2610 29” and the OCR n-gram is: 399

“rs2252931 max P=2.2x10-9”. The characters highlighted in green/blue are the characters marked 400

for correction. We won’t apply correction to other characters, but instead just keep whatever we 401

observe in the PDFX n-gram because other mismatches are pairs of digits. The left list of 10 402

candidates is from treating the PDFX n-gram as the observation, while the right side shows the 403

candidates from treating the OCR n-gram as the observation. The numbers in negative are the log 404

probabilities for each sequence as given by the Viterbi algorithm. 405

The candidate numbered 5 in the left list matches the candidate numbered 0 in the right list 406

(for the highlighted characters), and this combination has the highest joint probability (although 407

other combinations also match like 5, 1 and 5, 4). Thus, we choose this candidate as the most 408

probable n-gram. 409

There can be a case where the two lists do not share a common candidate with respect to 410

marked characters. In that case, we use the n-gram language model trained on our corpus to 411

obtain the probabilities for each of these twenty candidates, and select the most probable n-gram 412

candidate. In case there is again a tie among all twenty, we take the top n-gram candidate from 413

the PDFX candidate list as the most probable n-gram. An example of this scenario is shown in 414

Figure 7, where the two lists do not share a candidate in common and the language model chooses 415

the best candidate in the OCR candidate list, as it scores the highest. 416

Finally, in the post-processing step, we pick th n-gram with the highest score, as the “most 417

probable” n-gram as the output of the special character correction step. 418

419


27

Figure 7. The example showing two top-N list of the most probable n-grams with their log 420

probabilities given by Viterbi inference. 421

None of the pair of candidates in the lists match. 422

423

424


28

2.4 Jumbling Correction 425

The correction of jumbling errors (the wrong order of text while converting PDF to text, as 426

illustrated in Figure 1) in the output XML from PDFX is done by using the textual output as 427

generated by Apache PDFBox. We start by making a non-overlapping n-gram set of PDFX XML 428

text and an overlapping n-gram set for Apache PDFBox text. Then, we use a simple exhaustive 429

method to check if a n-gram contains jumbling error. That is, if an overlapping n-gram is a 430

permutation of another n-gram, we consider that the n-gram contains jumbling error. Next, we 431

replace the suspicious n-gram by the non-jumbled version of n-gram from Apache PDFBox. 432

Finally, we string together the non-overlapping n-grams with a space to construct the jumbling 433

corrected text. 434

435

2.5 Word Boundary Correction 436

To correct word boundary error (e.g., “cancer” might become “can-” and “cer” if it is the last 437

word of a row, thus it should be combined and corrected as “cancer” instead of two words), we 438

adopted a simple dictionary lookup method by comparing different parts (e.g., “can-” and “cer”) 439

with the English dictionary in the Enchant Spellchecking System (2015c) and merging them as 440

required. Specifically, if the hyphenated word is not present in the English dictionary but the word 441

without the hyphen is present in the dictionary, we remove the hyphen from the word. If the word 442

with or without hyphen is not present in the dictionary, we keep it unchanged. 443

444

445


29

2.6 Experiment Settings 446

We conduct experiments to compare bioPDFX with state-of-the-art PDFX tool (Constantin et 447

al. 2013), and test the following two hypotheses: 448

449

(a) For each of the individual correction steps, bioPDFX provides better quality of PDF to 450

XML conversion. 451

(b) For overall pipeline, bioPDFX also provides better conversion quality. 452

453

In our experiment, we set n=5 for n-grams. For special character correction, we exploit 454

StochHMM (14) to calculate the top-N possible correct n-Grams lists by Viterbi inference, and 455

KenLM (8) to build language models. We set the regularization parameter to 10-6

. 456

457

2.6.1 Dataset and Gold Standards 458

To test the two hypotheses, National Human Genome Research Institute (NHGRI) provided 459

2,185 biomedical literatures (Jain et al. 2016) related to Genome Wide Association Study (GWAS) 460

(Hindorff et al. 2009; Welter et al. 2014). Among them, 941 are indexed by PubMed Central 461

(PMC), where both PDF and XML versions are available for download as we searched in 2015. 462

From these articles, we randomly sampled 100 pairs of PDF and XML versions (i.e., NXMLs) 463

from these 941 articles as gold standards for four of our correction tasks: title and abstract, special 464

characters, jumbling, and word boundary. For reference and acknowledgement, we built our own 465

gold standard from PDF files directly rather than using those present in NXMLs. The reason behind 466


30

this was incomplete reference sections in most of NXMLs and complete omission of 467

acknowledgement sections. For 70 randomly selected PubMed articles, we manually compiled 468

references and acknowledgement text from their PDFs to act as gold standard. For hypothesis (b), 469

we further manually labeled the p-value and numbers (which are import for GWAS literatures) as 470

our gold standard. We randomly split the data in to 50% training and 50% test. 471

472

2.6.2 Evaluation Metrics 473

We developed a scoring metric which measured how a corrected XML from our system 474

performs with respect to the corresponding NXML. In this metric, an n-gram based similarity 475

method is used which is common in plagiarism detection. The idea is to make two sets of n-grams 476

from NXMLs and from our system, and use them to measure the occurrence and correct order of 477

words. We then compute the F1-score of the two n-gram sets as our evaluation metrics for title 478

and abstract, reference and acknowledgement, special characters, and word boundary. We applied 479

macro F1-score for most of the correction tasks except special characters and word boundary (of 480

which micro F1-score is computed). This is because the number of special characters or words 481

with incorrect boundary in an article can vary considerably. For jumbling error, we cannot apply 482

the n-gram similarity methods directly, as the error may appear in the tables (as shown in Figure 483

1). Therefore, we compute the average counts of such errors over all the articles in the dataset as 484

our evaluation metrics. For hypothesis (b), we further compute the micro F1-score for the p-value 485

and numbers (for the same reason as special character and word boundary), and the macro F1-486

score for the overall text quality (which is the n-gram similarity score for the whole document). 487

488


31

3. Results 489

The results for each of the five correction steps are shown in Table 1. That is, each step is 490

evaluated independently. As Table 1 depicts, in all correction steps bioPDFX outperforms PDFX, 491

especially for abstract, reference/acknowledgement, special character, and jumbling error 492

corrections. 493

On the other hand, Table 2 shows the scores at the end of the pipeline, where the corrections 494

are run in order of the pipeline as shown in Figure 2. Like the results of individual corrections 495

steps, bioPDFX in general perform better. The additional evaluation tasks (p-value, number and 496

overall text quality) are also included in Table 2. It should be noted that although no correction is 497

aimed at improving p-values, numbers or overall text quality, bioPDFX was still able to provide 498

higher conversion quality for these three tasks. 499

500


32

Table 1. Stage wise quality score comparison of raw and corrected XMLs. 501

Correction Step PDFX bioPDFX Metric

Title 0.9135 0.9763 Macro F1 Abstract 0.5428 0.8920 Macro F1 Reference/Acknowledgment 0.6496 0.8026 Macro F1 Special Character 0.7071 0.8860 Micro F1 Jumbling 2.89 1.92 Average Error Word-Boundary 0.7619 0.8053 Micro F1

Jumbling score is the average number of jumbling errors detected per paper (lower is better). 502

503


33

Table 2. Final quality score comparison of raw and corrected XMLs at the end of the 504

correction pipeline. 505

Correction Step PDFX bioPDFX Metric

Title 0.9135 0.9695 Macro F1 Abstract 0.5428 0.8877 Macro F1 Reference/Acknowledgment 0.6496 0.8026 Macro F1 Special Character 0.7071 0.8932 Micro F1 Jumbling 2.89 1.95 Average Error Word-Boundary 0.7619 0.8189 Micro F1

P-value 0.5465 0.6615 Micro F1 Number 0.8733 0.8853 Micro F1 Overall Text Quality 0.9006 0.9107 Micro F1

Jumbling score is the average number of jumbling errors detected per paper (lower is better). 506

507


34

4. Discussion 508

Based on the experiment results, we show the experimental evidence to support the two 509

hypotheses. Next, we discuss our results for each correction step in detail: 510

511

• Title and Abstract. The correction scores for title and abstract is quite good in case of raw 512

XMLs and increases substantially in the final corrected XMLs. This is an expected 513

behavior as we are retrieving titles and abstracts using the PMC Entrez e-utilities API 514

directly which can be thought of as a gold standard itself. The F1-scores are still not perfect 515

due to syntactic differences between title/abstract text present in NXMLs and the 516

title/abstract text retrieved from e-utilities API. For example, the API does not seem to use 517

Unicode characters like “≤”, “≥” and “∼” and would instead have the phrases “less than or 518

equal”, “greater than or equal” and “approximately” in their place. Furthermore, there 519

were cases where the data from NXMLs was incorrect because of errors from 520

authors of those papers like missing title/abstracts and other similar inconsistencies 521

in punctuation. Apart from this, we can consider the titles and abstracts to be near perfect in 522

the corrected XMLs (assuming e-utilities API gives correct results). Some o ther errors 523

are introduced due to accumulation of errors from previous stages of corrections in the 524

pipeline, but as can be seen by comparing stage-wise and end of pipeline scores in Table 1 525

and Table 2, they are quite small and can be neglected. 526

• Reference and Acknowledgement. Reference/Acknowledgment scores also improved 527

considerably and lead to better scores for the remaining stages of the pipeline, by removing 528

a significant number of the False Positives, as can be observed by comparing the stage-529


35

wise-results in Table 1 and final- results in Table 2. 530

• Special Character. We could improve special character F1-score by a considerable margin 531

through our technique of combining HMMs, n-gram alignment, OCR, and language 532

models. 533

• Jumbling Error. The average counts of jumbling errors are fewer for bioPDFX then for 534

PDFX. It should be noted that there is no guarantee that jumbling errors occur only over 5-535

grams, or equivalently within 5 words. In the same document, there might be jumbling 536

errors spanning up to arbitrarily large and varied n. A better approach might be to start with 537

a much larger n and attempt to detect jumbling from there and progressively reduce n one 538

by one, however this approach is computationally intensive. Even operating on 5-grams 539

for jumbling error correction step takes a significant amount of time compared to other 540

correction steps. 541

• Word Boundary. The F1-score of word boundary correction using bioPDFX are also 542

improved modestly comparing to PDFX. We also attempted to use a language model (along 543

with the English dictionary) to increase the range of corrections possible for biomedical 544

terms, but it did not give us any performance improvement. We believe this is because 545

word boundary errors are comparatively infrequent compared to total number of words in 546

an article, and because most words in an article are English words or numbers, most word 547

boundary errors happen in English words, which are already corrected by the English 548

dictionary. 549

Also, the usability of bioPDFX can be largely increased by a multiple-files uploading 550

functionality, so that users can submit many PDF files of interest for conversion instead of 551


36

uploading one-by-one. We are currently implementing an interface for such a multiple-uploading 552

feature (Figure 8) for a biocuration tool that will include this feature to maximize the utility of 553

bioPDFX. 554

555

556


37

Figure 8. The interface for multiple-files uploading. 557

The usability of bioPDFX can be largely increased by integrating it within a biocuration tool. 558

559

560


38

5. Conclusions 561

We have developed a tool to convert PDF files into XML format specifically for biomedical 562

articles, targeting the five challenges of state-of-the-art converters: title/abstract, 563

reference/acknowledgement, special character, jumbling error, and word boundary. Overall, the 564

whole pipeline developed in this paper makes the published literature in form of PDF files much 565

better suited for text mining tasks while slightly improving the overall text quality as well. 566

Although the tool discussed in this paper is trained to be specifically optimized for our target 567

corpus (GWAS), we believe that these techniques can be applied to other kinds of literatures as 568

well. 569

It should be noted that although we use PDFX XMLs as the main text sources, and use Apache 570

PDFBox and OCR text as auxiliary text sources, bioPDFX is general and is not dependent on these 571

text sources. Given a primary text source to correct substitution errors, other secondary text sources 572

can also be exploited in the correction steps. 573

There are some limitations of the approach, but we believe that these can be easily overcome 574

in different domains and by investigation of several different text sources. These limitations are 575

summarized as below: 576

577

• Title and abstract corrections depend on the PMC Entrez e-Utilities API 578

• As we model n-grams as sequences of characters and due to the Markov assumption, we 579

cannot correct errors in which one character is replaced by multiple ones, or is deleted 580

• Jumbling error correction is only an approximation since we fix n=5 581

582


39

In the future, we plan to run each of the text sources in parallel further improving the overall 583

performance, for which our initial result was reported in (Goyal et al. 2016). Also, we plan to 584

explore more challenging issues in the PDF to XML conversion process. Finally, we plan to extend 585

our experiments on more diverse types of large-scale corpuses. 586

587


40

Acknowledgements 588

We would like to thank Dr. Lucia Hindroff of NHGRI for her generous support of this project by 589

providing us PDF copies of test articles and the curation guidelines of the Catalog of GWAS to 590

make this research possible. We also would like to thank Dr. Helen Parkinson, Dr. Jackie 591

MacArthur and their team members at EBI for their assistance. 592

593


41

References 594

2015a. Adobe Acrobat SDK. Available at http://www.adobe.com/devnet/acrobat/overview.html. 595

2015b. Apache PDFBox. Available at https://pdfbox.apache.org/index.html. 596

2015c. Enchant Spellchecking System. 597

2015d. Mozilla PDF.js. Available at https://mozilla.github.io/pdf.js/. 598

2015e. National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), 599

PubMed Central (PMC) Open Access Subset. Available at http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/. 600

Baum LE, and Petrie T. 1966. Statistical inference for probabilistic functions of finite state Markov chains. 601

The annals of mathematical statistics 37:1554-1563. 602

Berg ØR. 2011. High precision text extraction from PDF documents. 603

Breiman L. 2001. Random forests. Machine Learning 45:5-32. 604

Constantin A, Pettifer S, and Voronkov A. 2013. PDFX: fully-automated PDF-to-XML conversion of 605

scientific literature. Proceedings of the 2013 ACM symposium on Document engineering. Florence, 606

Italy: ACM. p 177-180. 607

Cortez E, da Silva AS, Gonçalves MA, Mesquita F, and de Moura ES. 2007. FLUX-CIM: flexible 608

unsupervised extraction of citation metadata. Proceedings of the 7th ACM/IEEE-CS joint 609

conference on Digital libraries: ACM. p 215-224. 610

Deerwester S, Dumais ST, Furnas GW, Landauer TK, and Harshman R. 1990. Indexing by latent semantic 611

analysis. Journal of the American Society for Information Science 41:391-407. 10.1002/(sici)1097-612

4571(199009)41:6<391::aid-asi1>3.0.co;2-9 613

Garcia A, Murray-Rust P, Burns G, Stevens R, Tkaczyk D, McLaughlin C, Belin A, Di Iorio A, García L, 614

and Gruson-Daniel C. 2013. PDFJailbreak-a communal architecture for making biomedical PDFs 615

semantic. Proceedings of BioLINK SIG 2013:63. 616

Goyal A, Singh A, Bhargava S, Crawl D, Altintas I, and Hsu C-N. 2016. Natural Language Processing 617

using Kepler Workflow System: First Steps. Procedia Computer Science 80:712-721. 618

Hansen PC. 1987. The truncatedsvd as a method for regularization. BIT Numerical Mathematics 27:534-619

553. 620

Hindorff L, Sethupathy P, Junkins H, Ramos E, Mehta J, Collins F, and Manolio T. 2009. Potential etiologic 621

and functional implications of genome-wide association loci for human diseases and traits. 622

Proceedings of the National Academy of Sciences 106:9362-9367. citeulike-article-id:4806365 623

doi: 10.1073/pnas.0903103106 624

Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, and Chung IF. 2008. Integrating high dimensional 625

bi-directional parsing models for gene mention tagging. Bioinformatics 24:i286-i294. citeulike-626

article-id:12926636 627

Jain S, Tumkur K, Kuo T-T, Bhargava S, Lin G, and Hsu C-N. 2016. Weakly Supervised Learning of 628

Biomedical Information Extraction from Curated Data. BMC Bioinformatics. 629

Kolda TG, and O'leary DP. 1998. A semidiscrete matrix decomposition for latent semantic indexing 630

information retrieval. ACM Transactions on Information Systems (TOIS) 16:322-346. 631

Levenshtein VI. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet physics 632

doklady. p 707. 633

Ramakrishnan C, Patnia A, Hovy E, and Burns GA. 2012. Layout-aware text extraction from full-text PDF 634

of scientific articles. Source code for biology and medicine 7:1. 635

Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio 636

M, and Federhen S. 2011. Database resources of the national center for biotechnology information. 637

Nucleic acids research 39:D38-D51. 638

Smith R. 2007. An overview of the Tesseract OCR engine. 639

Tkaczyk D, Szostek P, Dendek PJ, Fedoryszak M, and Bolikowski L. 2014. CERMINE--Automatic 640

Extraction of Metadata and References from Scientific Literature. Document Analysis Systems 641


42

(DAS), 2014 11th IAPR International Workshop on: IEEE. p 217-221. 642

Tong X, and Evans DA. 1996. A statistical approach to automatic OCR error correction in context. 643

Proceedings of the fourth workshop on very large corpora. p 88-100. 644

Velagapudi P. 1999. Using hmms to boost accuracy in optical character recognition. Proceedings of SPIE, 645

27th AIPR Workshop: Advances in Computer-Assisted Recognition: Citeseer. p 96-104. 646

Viterbi A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. 647

IEEE TRANSACTIONS ON INFORMATION THEORY 13:260-269. 648

Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff 649

L, and Parkinson H. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait 650

associations. Nucleic acids research 42:D1001-D1006. citeulike-article-id:12826571 651

doi: 10.1093/nar/gkt1229 652

Zhuang L, and Zhu X. 2005. An OCR post-processing approach based on multi-knowledge. International 653

Conference on Knowledge-Based and Intelligent Information and Engineering Systems: Springer. 654

p 346-352. 655

656


Documents

bioPDFX: preparing PDF scientific articles for biomedical ... › preprints › 2993.pdf · 57 (such as Apache PDFBox Mozilla PDF.js (2015d) Adobe Acrobat SDK (2015a) Tesseract 58