Upload
raven-synthx
View
224
Download
0
Embed Size (px)
Citation preview
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
1/26
Extracting Searchable Text from Arabic PDFs
Br ian Car r ie r , Ph .D.
Director of Digital Forensics
Basis Technology Corp.
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
2/26
Motivation
Need to get the text out of files before they can be
indexed and searched.
Arabic PDF files can be challenging.
2
Database
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
3/26
PDF Basics
Raw f ile contents are organized into objects.
Each obj ect stores a specif ic type of info:
Document (Root ) object
Page obj ects
Font objects
Basic structure of file is viewable text:
3
[]
7 0 obj
endobj
[]
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
4/26
PDF Text
Text is stored in chunks of one or more characters.
Each chunk is located at a given X,Y coordinate
Chunks can be stored in any order in the file
4
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
5/26
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
6/26
PDF Font s and Encodings
PDF fonts typically store only the glyphs that are used.
Text chunk stores an index into a PDF font obj ect .
Font object may map glyph to a Unicode value.
6
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
7/26
Rendering Dif ference
Displaying a PDF requires the PDF Engine to map fonts
Note that standard encoding values are not required.
7
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
8/26
Basic Ext ract ion Approach
1. Parse PDF f ile to ident ify page content objects
2. Parse page content stream into text chunks
3. Sort text chunks based on coordinates
4. Process chunks in order:
1. Get index for each character2. Use font informat ion to map index to Unicode (if defined)
3. Add Unicode value to end of st ring
8
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
9/26
English Ext ract ion Example
9
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
10/26
Arabic Glyphs
Arabic characters have different shapes depending on
their locat ion in a word.
Each shape is a different glyph in a font.
10
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
11/26
Arabic Ext ract ion Example
11
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
12/26
Logical and Presentat ion Orders
Text in computers is typically stored in logical order
First character stored is f irst character read or writ ten
Presentation order is based on screen layout
Orders are same for Left to Right (LTR) Languages:
Opposite for Right to Left (RTL) Languages:
12
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
13/26
Possible Order Solut ion
PDF stores data in presentation (display) order.
Text editors need the text in logical order though.
Need to convert from presentat ion to logical order.
Obvious solut ion:
After decoding each line, reverse the order of t he Arabic text:
13
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
14/26
Bi-direct ional Text
How should the following be logically stored?
14
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
15/26
Bi-direct ional Text
How should the following be logically stored?
15
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
16/26
Bi-direct ional Text
Text can have both RTL and LTR characters and each
should go in the correct direct ion
Unicode Bi-directional Text (BiDi) algorithm defineshow t o order characters in a paragraph based on:
Dominant direct ion of text in paragraph
Direct ion of each character in text Punctuat ion and neighboring characters
Implicit direct ion markers
BiDi lets you convert from logical to presentationorder.
16
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
17/26
Reverse Bi-direct ional Algori t hm
17
We need Reverse BiDi to convert from presentat ion to
logical order.
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
18/26
Updated Ext ract ion Approach
1. Parse PDF f ile to ident ify page content objects
2. Parse page content stream into text chunks
3. Sort text chunks based on coordinates
4. Determine dominant text direction
5. Process chunks in order and by line:1. Get index for each character
2. Use font informat ion to map index to Unicode
3. Add Unicode value to end of presentat ion order st ring
4. Apply reverse BiDi algorit hm to presentat ion order st ring
18
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
19/26
Present at ion Forms / Ligatures
Encodings typically define only the general form of
Arabic characters.
Unicode is an exception.
The OS determines which glyph form to use (init ial,
medial, etc.) based on the context of the character.
PDF stores the specific form of each Arabic character.
Unicode presentation forms should not be used in a
st ring and many tools cannot process them.
Need to normalize text from presentation to general
forms
19
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
20/26
Arabic Ext ract ion Example 2
20
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
21/26
Font -specif ic Ligature I mplement at ions
U+FDF2 is the Unicode Arabic ligature for Allah ().
The single ligature represents four characters: Alef, Lam, Lam, Heh .
Some fonts implement the ligature dif ferent ly:
Lam, Lam, Heh
They add a separate Alef before the ligature.
Alef (U+0627) Allah(U+FDF2)
When decomposing using Unicode specs:
Alef Alef Lam Lam Heh
21
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
22/26
Diacr it ic Placement
Vocalizations and diacritics can be separate glyphs
With Unicode:
Diacri t ics are stored after the base character in logical order
Diacrit ics are placed over the base character when rendered onscreen
With PDF:
Diacri t ics are stored in a separate text chunks
Coordinates cause them to overlap
Diacrit ic chunk can be before or aft er t he chunk it modifies
22
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
23/26
Diacrit ic I nsert ion
23
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
24/26
Spacing Est imat ion
Spaces and newlines are not explicitly stored.
Spacing is achieved by direct placement of text.
Ext raction requires guessing where spaces and newlines
should exist.
Is this text chunks X-value furt her away then we expected?
Is this text chunks Y-value furt her away then we expected?
Spacing estimation can be done by keeping track of
average character width thus far.
Newline estimation can be done by keeping track of
character heights.
24
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
25/26
PDFBox
PDFBox is an open source Apache Incubator project
It worked well for many documents in LTR languages
We enhanced it to:
Correct direct ion of RTL text
Normalize ligatures and presentat ion forms
Merge diacrit ics into text
Bet ter est imate where to add spaces
Fix parsing issues
Deal with corrupt / non-compliant f iles
Can be freely downloaded (in next release):
ht tp:/ / incubator.apache.org/ pdfbox/
25
7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF
26/26
Thank You!