Document Image Retrieval LBSC 796/CMSC 828o Douglas W. Oard April 12, 2004 mostly adapted from A lecture by David Doermann

Document Image RetrievalDocument Image Retrieval

LBSC 796/CMSC 828oLBSC 796/CMSC 828o

Douglas W. OardDouglas W. Oard

April 12, 2004April 12, 2004

mostly adapted frommostly adapted from

A lecture by David DoermannA lecture by David Doermann

AgendaAgenda

QuestionsQuestions Definitions - Document, Image, RetrievalDefinitions - Document, Image, Retrieval Document Image AnalysisDocument Image Analysis

– Page decompositionPage decomposition

– Optical character recognitionOptical character recognition

Traditional Indexing with ConversionTraditional Indexing with Conversion– Confusion matrixConfusion matrix

– Shape codesShape codes

Doing things Without ConversionDoing things Without Conversion– Duplicate Detection, Classification, Summarization, AbstractingDuplicate Detection, Classification, Summarization, Abstracting

– Keyword spotting, etcKeyword spotting, etc

Example: Chinese document imagesExample: Chinese document images

Goals of this ClassGoals of this Class

Expand your definition of what is a “DOCUMENT”Expand your definition of what is a “DOCUMENT”

To get an appreciation of the issues in document To get an appreciation of the issues in document image indexingimage indexing

To look at different ways of solving the same To look at different ways of solving the same problems with different mediaproblems with different media

Your job: compare/contrast with other mediaYour job: compare/contrast with other media

DocumentDocumentDO

CU

ME

NT

DO

CU

ME

NT

DA

TAB

AS

E

DA

TAB

AS

E

IMAGEIMAGE

Basic Medium for Recording InformationBasic Medium for Recording Information TransientTransient

– SpaceSpace

– Time Time

Multiple FormsMultiple Forms– Hardcopy (paper, stone, ..) / Electronic (CDROM, Internet, …)Hardcopy (paper, stone, ..) / Electronic (CDROM, Internet, …)

– Written/Auditory/Visual (symbolic, scenic)Written/Auditory/Visual (symbolic, scenic)

Access Requirements Access Requirements – SearchSearch

– BrowseBrowse

– ““Read” Read”

Sources of Document ImagesSources of Document Images The WebThe Web

– Some PDF files come from scanned Some PDF files come from scanned documentsdocuments

– Arabic news stories are often GIF imagesArabic news stories are often GIF images

Digital copiersDigital copiers– Produce “corporate memory” as a byproductProduce “corporate memory” as a byproduct

Digitization projectsDigitization projects– Provide improved access to hardcopy Provide improved access to hardcopy

documentsdocuments

Some DefinitionsSome Definitions

ModalityModality– A means of expressionA means of expression

Linguistic modalitiesLinguistic modalities– Electronic text, printed, handwritten, spoken, signedElectronic text, printed, handwritten, spoken, signed

Nonlinguistic modalitiesNonlinguistic modalities– Music, drawings, paintings, photographs, videoMusic, drawings, paintings, photographs, video

MediaMedia– The means by which the expression reaches youThe means by which the expression reaches you

• Internet, videotape, paper, canvas, …Internet, videotape, paper, canvas, …

Document ImagesDocument Images

A collection of dots called “pixels”A collection of dots called “pixels”– Arranged in a grid and called a “bitmap”Arranged in a grid and called a “bitmap”

Pixels often binary-valued (black, white)Pixels often binary-valued (black, white)– But greyscale or color is sometimes neededBut greyscale or color is sometimes needed

300 dots per inch (dpi) gives the best results300 dots per inch (dpi) gives the best results– But images are quite large (1 MB per page)But images are quite large (1 MB per page)

– Faxes are normally 72 dpiFaxes are normally 72 dpi

Usually stored in TIFF or PDF formatUsually stored in TIFF or PDF format

ImagesImages

Pixel representation of intensity mapPixel representation of intensity map No explicit “content”, only relationsNo explicit “content”, only relations Image analysisImage analysis

– Attempts to mimic human visual behaviorAttempts to mimic human visual behavior

– Draw conclusions, hypothesize and verifyDraw conclusions, hypothesize and verify

DO

CU

ME

NT

DO

CU

ME

NT

DA

TAB

AS

E

DA

TAB

AS

E

IMAGEIMAGE

10 27 33 29

27 34 33 54

54 47 89 60

25 35 43 9

Image databasesUse primitive image analysis to represent contentTransform semantic queries into “image features”

color, shape, texture …spatial relations

Document ImagesDocument Images

Scanned Pixel representation of documentScanned Pixel representation of document Data Intensive (100-300dpi, 1-24 bpp)Data Intensive (100-300dpi, 1-24 bpp) NO EXPLICIT CONTENTNO EXPLICIT CONTENT Document image analysis or manual annotation Document image analysis or manual annotation

requiredrequired– takes pixels -> contentstakes pixels -> contents

– automatic means are not guaranteedautomatic means are not guaranteed

Yet we want to be able to process them like text files!Yet we want to be able to process them like text files!

DO

CU

ME

NT

DO

CU

ME

NT

DA

TAB

AS

E

DA

TAB

AS

E

IMAGEIMAGE

Document Image Document Image DatabaseDatabase

Collection of scanned imagesCollection of scanned images Need to be available for indexing and retrieval, Need to be available for indexing and retrieval,

abstracting, routing, editing, dissemination, abstracting, routing, editing, dissemination, interpretation …interpretation …

DO

CU

ME

NT

DO

CU

ME

NT D

ATA

BA

SE

DA

TAB

AS

E

IMAGEIMAGE

InformationRetrieval

DocumentUnderstanding

DocumentImage

Retrieval

Managing Document Image Managing Document Image DatabasesDatabases

Document Image Databases are often influenced by Document Image Databases are often influenced by traditional DB indexing and retrieval philosophiestraditional DB indexing and retrieval philosophies

– We are comfortable with themWe are comfortable with them

– They workThey work

Problem: Requires content to be accessibleProblem: Requires content to be accessible Techniques:Techniques:

– Content based retrieval (keywords, natural language)Content based retrieval (keywords, natural language)

– Query by structure (logical/physical)Query by structure (logical/physical)

– Query by Functional attributes (titles, bold, …)Query by Functional attributes (titles, bold, …)

Requirements:Requirements:– Ability to Browse, search and readAbility to Browse, search and read

Indexing Page ImagesIndexing Page Images(Traditional)(Traditional)

Optical CharacterRecognition

Page Decomposition

ScannerDocument

PageImage

StructureRepresentation

Character orShape Codes

TextRegions

Document Image AnalysisDocument Image Analysis

General Flow:General Flow:– Obtain Image - DigitizeObtain Image - Digitize

– PreprocessingPreprocessing

– Feature ExtractionFeature Extraction

– ClassificationClassification

General TasksGeneral Tasks– Logical and Physical Page Structure AnalysisLogical and Physical Page Structure Analysis

– Zone ClassificationZone Classification

– Language IDLanguage ID

– Zone Specific ProcessingZone Specific Processing

• RecognitionRecognition

• VectorizationVectorization

Page AnalysisPage Analysis

Skew correctionSkew correction– Based on finding the primary orientation of linesBased on finding the primary orientation of lines

Image and text region detectionImage and text region detection– Based on texture and dominant orientationBased on texture and dominant orientation

Structural classificationStructural classification– Infer logical structure from physical layoutInfer logical structure from physical layout

Text region classificationText region classification– Title, author, letterhead, signature block, etc.Title, author, letterhead, signature block, etc.

Image DetectionImage Detection

Text Region DetectionText Region Detection

Language IdentificationLanguage Identification

Language-independent skew detectionLanguage-independent skew detection– Accommodate horizontal and vertical writingAccommodate horizontal and vertical writing

Script class recognitionScript class recognition– Asian script have blocky charactersAsian script have blocky characters

– Connected scripts can’t be segmented easilyConnected scripts can’t be segmented easily

Language identificationLanguage identification– Shape statistics work well for western languagesShape statistics work well for western languages

– Competing classifiers work for Asian languagesCompeting classifiers work for Asian languages

Optical Character RecognitionOptical Character Recognition

Pattern-matching approachPattern-matching approach– Standard approach in commercial systemsStandard approach in commercial systems

– Segment individual charactersSegment individual characters

– Recognize using a neural network classifierRecognize using a neural network classifier

Hidden Markov model approachHidden Markov model approach– Experimental approach developed at BBNExperimental approach developed at BBN

– Segment into sub-character slicesSegment into sub-character slices

– Limited lookahead to find best character choiceLimited lookahead to find best character choice

– Useful for connected scripts (e.g., Arabic)Useful for connected scripts (e.g., Arabic)

OCR Accuracy ProblemsOCR Accuracy Problems

Character segmentation errorsCharacter segmentation errors– In English, segmentation often changes “m” to “rn”In English, segmentation often changes “m” to “rn”

Character confusionCharacter confusion– Characters with similar shapes often confoundedCharacters with similar shapes often confounded

OCR on copies is much worse than on originalsOCR on copies is much worse than on originals– Pixel bloom, character splitting, binding bendPixel bloom, character splitting, binding bend

Uncommon fonts can cause problemsUncommon fonts can cause problems– If not used to train a neural networkIf not used to train a neural network

Improving OCR AccuracyImproving OCR Accuracy

Image preprocessingImage preprocessing– Mathematical morphology for bloom and splittingMathematical morphology for bloom and splitting

– Particularly important for degraded imagesParticularly important for degraded images

““Voting” between several OCR engines helpsVoting” between several OCR engines helps– Individual systems depend on specific training dataIndividual systems depend on specific training data

Linguistic analysis can correct some errorsLinguistic analysis can correct some errors– Use confusion statistics, word lists, syntax, …Use confusion statistics, word lists, syntax, …

– But more harmful errors might be introducedBut more harmful errors might be introduced

OCR SpeedOCR Speed

Neural networks take about 10 seconds a pageNeural networks take about 10 seconds a page– Hidden Markov models are slowerHidden Markov models are slower

Voting can improve accuracyVoting can improve accuracy– But at a substantial speed penaltyBut at a substantial speed penalty

Easy to speed things up with several machinesEasy to speed things up with several machines– For example, by batch processing - using desktop computers at For example, by batch processing - using desktop computers at

nightnight

Problem: Logical Page Analysis Problem: Logical Page Analysis (Reading Order)(Reading Order)

Can be hard to guess in some casesCan be hard to guess in some cases– Newspaper columns, figure captions, appendices, …Newspaper columns, figure captions, appendices, …

Sometimes there are explicit guidesSometimes there are explicit guides– ““Continued on page 4” (but page 4 may be big!)Continued on page 4” (but page 4 may be big!)

Structural cues can helpStructural cues can help– Column 1 might continue to column 2Column 1 might continue to column 2

Content analysis is also usefulContent analysis is also useful– Word co-occurrence statistics, syntax analysisWord co-occurrence statistics, syntax analysis

Processing Converted TextProcessing Converted Text

Typical Document Image IndexingTypical Document Image Indexing

Convert hardcopy to an “electronic” documentConvert hardcopy to an “electronic” document– OCROCR

– Page Layout AnalysisPage Layout Analysis

– Graphics RecognitionGraphics Recognition

Use structure to add metadataUse structure to add metadata Manually supplement with keywords Manually supplement with keywords

Use traditional text indexing and retrieval techniques?Use traditional text indexing and retrieval techniques?

Information Retrieval on OCRInformation Retrieval on OCR

Requires robust ways of indexingRequires robust ways of indexing Statistical methods with large documents work bestStatistical methods with large documents work best Key EvaluationsKey Evaluations

– Success for high quality OCR (Croft et al 1994, Taghva 1994)Success for high quality OCR (Croft et al 1994, Taghva 1994)

– Limited success for poor quality OCR (1996 TREC, UNLV)Limited success for poor quality OCR (1996 TREC, UNLV)

– Clustering successful for > 85% accuracy (Tsuda et al, 1995)Clustering successful for > 85% accuracy (Tsuda et al, 1995)

Proposed SolutionsProposed Solutions

Improve OCRImprove OCR Automatic Correction Automatic Correction

– Taghva et al, 1994Taghva et al, 1994

Enhance IR techniques Enhance IR techniques – Lopresti and Zhou, 1996Lopresti and Zhou, 1996 NGrams NGrams

Applications Applications – Cornell CS TR Collection (Lagoze et al, 1995)Cornell CS TR Collection (Lagoze et al, 1995)

– Degraded Text Simulator (Doermann and Yao, 1995)Degraded Text Simulator (Doermann and Yao, 1995)

N-GramsN-Grams

Powerful, Inexpensive statistical method for Powerful, Inexpensive statistical method for characterizing populationscharacterizing populations

Approach Approach – Split up document into n-character pairs failsSplit up document into n-character pairs fails

– Use traditional indexing representations to perform analysisUse traditional indexing representations to perform analysis

– ““DOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENTDOCUMENT” -> DOC, OCU, CUM, UME, MEN, ENT

AdvantagesAdvantages– Statistically robust to small numbers of errorsStatistically robust to small numbers of errors

– Rapid indexing and retrievalRapid indexing and retrieval

– Works from 70%-85% character accuracy where traditional IR failsWorks from 70%-85% character accuracy where traditional IR fails

Matching with OCR ErrorsMatching with OCR Errors

Above 80% character accuracy, use wordsAbove 80% character accuracy, use words– With linguistic correctionWith linguistic correction

Between 75% and 80%, use n-gramsBetween 75% and 80%, use n-grams– With n somewhat shorter than usualWith n somewhat shorter than usual

– And perhaps with character confusion statisticsAnd perhaps with character confusion statistics

Below 75%, use word-length shape codesBelow 75%, use word-length shape codes

Handwriting RecognitionHandwriting Recognition

With stroke information, can be automatedWith stroke information, can be automated– Basis for input padsBasis for input pads

Simple things can be read without strokesSimple things can be read without strokes– Postal addresses, filled-in formsPostal addresses, filled-in forms

Free text requires human interpretationFree text requires human interpretation– But repeated recognition is then possibleBut repeated recognition is then possible

Conversion?Conversion?

Full Conversion often requiredFull Conversion often required Conversion is difficult!Conversion is difficult!

– Noisy dataNoisy data

– Complex LayoutsComplex Layouts

– Non-text componentsNon-text components

Points to PonderPoints to Ponder Do we really need to convert?Do we really need to convert? Can we expect to fully describe documents without Can we expect to fully describe documents without

assumptions?assumptions?

Researchers are seeing a progression from Researchers are seeing a progression from full conversion to image based approachfull conversion to image based approach

ApplicationsApplications– Indexing and RetrievalIndexing and Retrieval– Information Extraction Information Extraction – Duplicate DetectionDuplicate Detection– Clustering (Document Similarity)Clustering (Document Similarity)– SummarizationSummarization

AdvantagesAdvantages– Makes use of powerful image properties (Function, IVC 1998)Makes use of powerful image properties (Function, IVC 1998)– Can be cheaper then conversionCan be cheaper then conversion– Makes use of redundancy in the language.Makes use of redundancy in the language.

OutlineOutline

Processing Converted TextProcessing Converted Text Manipulating Images of TextManipulating Images of Text

– Title ExtractionTitle Extraction

– Named Entity ExtractionNamed Entity Extraction

– Keyword SpottingKeyword Spotting

– Abstracting and SummarizationAbstracting and Summarization

Indexing based on StructureIndexing based on Structure Graphics and DrawingsGraphics and Drawings Related Work and ApplicationsRelated Work and Applications

Processing Images of TextProcessing Images of Text

CharacteristicsCharacteristics– Does not require expensive OCR/ConversionDoes not require expensive OCR/Conversion

– Applicable to filtering applicationsApplicable to filtering applications

– May be more robust to noiseMay be more robust to noise

Possible DisadvantagesPossible Disadvantages– Application domain may be very limitedApplication domain may be very limited

– Processing time may be an issue if indexing is otherwise Processing time may be an issue if indexing is otherwise requiredrequired

Proper Noun Detection Proper Noun Detection (DeSilva and Hull, 1994)(DeSilva and Hull, 1994)

Problem: Filter proper nouns in images of textProblem: Filter proper nouns in images of text– People, Places, Things People, Places, Things

Advantages of the Image Domain: Advantages of the Image Domain: – Saves converting all of the textSaves converting all of the text

– Allows application of word recognition approachesAllows application of word recognition approaches

– Limits post-processing to a subset of wordsLimits post-processing to a subset of words

– Able to use features which are not available in the textAble to use features which are not available in the text

Approach: Approach: – Identify Word FeaturesIdentify Word Features

• Capitalization, location, length, and syntactic categoriesCapitalization, location, length, and syntactic categories

– Classify using rule-setClassify using rule-set

– Achieve 75-85% accuracy without conversionAchieve 75-85% accuracy without conversion

Keyword SpottingKeyword Spotting

Techniques:Techniques:– Work Shape/HMM - (Chen et al, 1995)Work Shape/HMM - (Chen et al, 1995)

– Word Image Matching - (Trenkle and Vogt, 1993; Hull et al)Word Image Matching - (Trenkle and Vogt, 1993; Hull et al)

– Character Stroke Features - (Decurtins and Chen, 1995)Character Stroke Features - (Decurtins and Chen, 1995) Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996)Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996)

Applications:Applications:– Filing System (Spitz - SPAM, 1996)Filing System (Spitz - SPAM, 1996)

– Numerous IRNumerous IR

– Processing handwritten documentsProcessing handwritten documents

Formal Evaluation :Formal Evaluation :– Scribble vs. OCR (DeCurtins, SDIUT 1997) Scribble vs. OCR (DeCurtins, SDIUT 1997)

Shape CodingShape Coding

ApproachApproach– Use of Generic Character DescriptorsUse of Generic Character Descriptors

– Make Use of Power of Language to resolve ambiguityMake Use of Power of Language to resolve ambiguity

– Map Character based on ShapeMap Character based on Shape features including ascenders, features including ascenders, descenders, punctuation and character with holesdescenders, punctuation and character with holes

a aeox cmnrsuvwxyzA fhklti Ij;b bdg gpq

Shape CodesShape Codes

Group all characters that have similar shapesGroup all characters that have similar shapes– {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X,

Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0}Y, Z, 2, 3, 4, 5, 6, 7, 8, 9, 0}

– {a, c, e, n, o, r, s, u, v, x, z} {a, c, e, n, o, r, s, u, v, x, z}

– {b, d, h, k, }{b, d, h, k, }

– {f, t}{f, t}

– {g, p, q, y}{g, p, q, y}

– {i, j, l, 1}{i, j, l, 1}

– {m, w} {m, w}

Why Use Shape Codes?Why Use Shape Codes?

Can recognize shapes faster than charactersCan recognize shapes faster than characters– Seconds per page, and very accurateSeconds per page, and very accurate

Preserves recall, but with lower precisionPreserves recall, but with lower precision– Useful as a first pass in any systemUseful as a first pass in any system

Easily extracted from JPEG-2 imagesEasily extracted from JPEG-2 images– Because JPEG-2 uses object-based compressionBecause JPEG-2 uses object-based compression

Additional ApplicationsAdditional Applications

Handwritten Archival Manuscripts Handwritten Archival Manuscripts – (Manmatha, 1997)(Manmatha, 1997)

Page Classification Page Classification – (Decurtins and Chen, 1995)(Decurtins and Chen, 1995)

Matching Handwritten Records Matching Handwritten Records – (Ganzberger et al, 1994)(Ganzberger et al, 1994)

Headline ExtractionHeadline Extraction Document Image Compression (UMD, 1996-1998)Document Image Compression (UMD, 1996-1998)

OutlineOutline

Processing Converted TextProcessing Converted Text Manipulating Images of TextManipulating Images of Text Indexing Based on StructureIndexing Based on Structure

– LogicalLogical

– Physical Physical

– FunctionalFunctional

Graphics and DrawingsGraphics and Drawings Related Work and ApplicationsRelated Work and Applications

Document FunctionalityDocument Functionality[ICDAR 1997][ICDAR 1997]

Humans process documents very robustlyHumans process documents very robustly When interacting with documents, we can interpret without When interacting with documents, we can interpret without

recognitionrecognition We can judge relevance without readingWe can judge relevance without reading We can rapidly navigate documents to find the information we We can rapidly navigate documents to find the information we

wantwant

ClaimsClaims We must provide basic ways to We must provide basic ways to interact interact with documents, and with documents, and

interaction often relies as much on the interaction often relies as much on the structurestructure of a document, of a document, as on the contentas on the content

Traditional geometric properties and type-dependent logical Traditional geometric properties and type-dependent logical models are not sufficientmodels are not sufficient

The Role of DocumentsThe Role of Documents

The role or “function” of a document is to store data in symbolic form The role or “function” of a document is to store data in symbolic form which has been produced by a sender (the author) to facilitate which has been produced by a sender (the author) to facilitate

transfer to a receiver (the reader)transfer to a receiver (the reader)

Documents are designed to be interpreted by humansDocuments are designed to be interpreted by humans Authors typically tailor this design to optimize the transfer of Authors typically tailor this design to optimize the transfer of

informationinformation Readers use structure to enhance interpretationReaders use structure to enhance interpretation

In what ways does the design facilitate, disambiguate or enhance the In what ways does the design facilitate, disambiguate or enhance the flow of information?flow of information?

Functional StructuresFunctional Structures

Structure Example Use

header Centered Relative importance, focal point

list Enumerated Itemized

Convays temporal sequence Suggests similar level of descriptivness

separator White space or rule line

Physical and possiblity semantic dissasocation

attachment Footnote Boxed text Side bar

Supplemental information under some semantic hierarchy

illustration

Table Figure

Supplementation information – Preserves 2D association. Graphics representation of Info

OutlineOutline

Processing Converted TextProcessing Converted Text Manipulating Images of TextManipulating Images of Text Indexing based on StructureIndexing based on Structure Graphics and DrawingsGraphics and Drawings Related Work and ApplicationsRelated Work and Applications

Map InterpretationMap InterpretationSamet et alSamet et al

Identify Legend on the Map ImageIdentify Legend on the Map Image Extract Images map labels and descriptionsExtract Images map labels and descriptions Identify labels in the map imagesIdentify labels in the map images Allow user to query based on extracted imagesAllow user to query based on extracted images Bootstraps the information extraction and Bootstraps the information extraction and

interpretation problemsinterpretation problems

OutlineOutline

Processing Converted TextProcessing Converted Text Manipulating Images of TextManipulating Images of Text Indexing based on StructureIndexing based on Structure Graphics and DrawingsGraphics and Drawings Related Work and ApplicationsRelated Work and Applications

Duplicate DetectionDuplicate Detection

Same content, same formatSame content, same format– For example, a xerox copyFor example, a xerox copy

Same content, different formatSame content, different format– For example, as a web page or on paperFor example, as a web page or on paper

Shared content, same formatShared content, same format– For example, a paper with annotationsFor example, a paper with annotations

Shared content, different formatShared content, different format– For example, including text with cut-and-pasteFor example, including text with cut-and-paste

Duplicate ReconciliationDuplicate Reconciliation

ApproachApproach

Use global features to restrict searchUse global features to restrict search– Number of pages, number of lines, page momentsNumber of pages, number of lines, page moments

Extract a signatureExtract a signature– using shape codesusing shape codes

Convert signature Convert signature – use a set of n-gram keys to index the databaseuse a set of n-gram keys to index the database

Rank and verifyRank and verify– return top N documentsreturn top N documents

– visual or algorithmic refinementvisual or algorithmic refinement

Advantages:Advantages:– Robust to noise, extracted quickly, extracted easily, efficiently Robust to noise, extracted quickly, extracted easily, efficiently

storedstored

Cross-Language Duplicate Cross-Language Duplicate Detection (= finding translations!)Detection (= finding translations!)

EvaluationEvaluation

The usual approach: Model-based evaluationThe usual approach: Model-based evaluation– Apply confusion statistics to an existing collectionApply confusion statistics to an existing collection

A bit better: Print-scan evaluationA bit better: Print-scan evaluation– Scanning is slow, but availability is no problemScanning is slow, but availability is no problem

Best: Scan-only evaluationBest: Scan-only evaluation– No existing IR collections have printed materialsNo existing IR collections have printed materials

SummarySummary

Many applications benefit from image based indexingMany applications benefit from image based indexing– Less discriminatory featuresLess discriminatory features

– Features may therefore be easier to computeFeatures may therefore be easier to compute

– More robust to noiseMore robust to noise

– Often computationally more efficientOften computationally more efficient

Many classical IR techniques have application for DIRMany classical IR techniques have application for DIR Structure as well as content are important for Structure as well as content are important for

indexingindexing Preservation of structure is essential for in-depth Preservation of structure is essential for in-depth

understandingunderstanding

Example Title Pages (#4 & #9)Example Title Pages (#4 & #9)

Title Page Overall AccuracyTitle Page Overall Accuracy

57 Title pages, 891 non-title pages57 Title pages, 891 non-title pages Overall Accuracy = 906/948 = 95.57%Overall Accuracy = 906/948 = 95.57% Title Page Accuracy = 37/57 = 64.91%Title Page Accuracy = 37/57 = 64.91% False Positives = 22False Positives = 22 False Negatives = 20False Negatives = 20

ObservationsObservations– All without Type-Specific InformationAll without Type-Specific Information

– Need Functional (or Logical) FeaturesNeed Functional (or Logical) Features

AgendaAgenda

QuestionsQuestions Definitions - Document, Image, RetrievalDefinitions - Document, Image, Retrieval Document Image AnalysisDocument Image Analysis Traditional Indexing with ConversionTraditional Indexing with Conversion Doing things Without ConversionDoing things Without Conversion Recent work on IR with Chinese document imagesRecent work on IR with Chinese document images

– Tseng and OardTseng and Oard

Document Retrieval Approaches for Document Retrieval Approaches for Images of TextImages of Text

Full-text search based on manually re-keying Full-text search based on manually re-keying the textthe text

– Prohibitively expensive at large scaleProhibitively expensive at large scale

Search based on bibliographic metadataSearch based on bibliographic metadata– May be difficult to adequately describe the materials.May be difficult to adequately describe the materials.

Full text based on Optical Character Full text based on Optical Character Recognition (OCR)Recognition (OCR)

– Inexpensive and relatively rapidInexpensive and relatively rapid

– Sensitive to OCR accurracySensitive to OCR accurracy

Key Questions for Information Key Questions for Information RetrievalRetrieval

What to index?What to index?– Phrase, words, character, or shape codesPhrase, words, character, or shape codes– Unigrams or Unigrams or nn-grams -grams

How to weight a term in a document?How to weight a term in a document?– Term frequency (TF)Term frequency (TF)– Document frequency (DF)Document frequency (DF)– Document length normalizationDocument length normalization– (Term position)(Term position)

How to assign scores to documents?How to assign scores to documents?– Boolean, vector space, and probabilistic modelsBoolean, vector space, and probabilistic models

Chinese Text Retrieval IssuesChinese Text Retrieval Issues

Words may be any number of characters (typically 2-5)Words may be any number of characters (typically 2-5)– But some that contain only 1 character or more than 5 charactersBut some that contain only 1 character or more than 5 characters– e.g., “e.g., “ 貓” 貓” ((cat), “cat), “ 聯合國教科文組織” 聯合國教科文組織” ((UNESCO)UNESCO)

Longer words (over 2 characters) often have shorter sub-word unitsLonger words (over 2 characters) often have shorter sub-word units– Transliteration is an exceptionTransliteration is an exception

Written Chinese has no word separatorWritten Chinese has no word separator– A sentence can be segmented in different ways, all may be legalA sentence can be segmented in different ways, all may be legal– Similar to the phrase detection problem in EnglishSimilar to the phrase detection problem in English

Chinese character inventory is very largeChinese character inventory is very large– 13,500 characters in Big-5 code (traditional Chinese: Taiwan and Hong Kong)13,500 characters in Big-5 code (traditional Chinese: Taiwan and Hong Kong)– Over 6,000 characters in GB code (simplified Chinese: China, Singapore)Over 6,000 characters in GB code (simplified Chinese: China, Singapore)– About 3,000 commonly used characters in each character setAbout 3,000 commonly used characters in each character set

Socio-Cultural Research Center Socio-Cultural Research Center (SCRC) Collection(SCRC) Collection

800,000 newspaper clippings from 1950-1976800,000 newspaper clippings from 1950-1976– Scanned over 300,000 at 300 dpiScanned over 300,000 at 300 dpi

30 China, Hong Kong, and Taiwan news 30 China, Hong Kong, and Taiwan news agenciesagencies

– Mostly simplified Chinese, some traditional ChineseMostly simplified Chinese, some traditional Chinese

Focus on diplomatic and military activitiesFocus on diplomatic and military activities

Document PreparationDocument Preparation

Selected 11,108 scanned document imagesSelected 11,108 scanned document images OCR yielded 8,438 valid docs (Presto! OCR Pro, Big-5) OCR yielded 8,438 valid docs (Presto! OCR Pro, Big-5)

– Avg valid document had a 69% system-reported “recognition rate”Avg valid document had a 69% system-reported “recognition rate”

• Computed on a sample of 1,300 documentsComputed on a sample of 1,300 documents

Second version prepared using Big-5 to GB Second version prepared using Big-5 to GB conversionconversion

– GB version used in experimentsGB version used in experiments

Topic PreparationTopic Preparation

Based on contemporaneous Chinese journal articlesBased on contemporaneous Chinese journal articles– From 100 paper titles, 30 were selected and rewritten as Chinese From 100 paper titles, 30 were selected and rewritten as Chinese

topicstopics

Made English translations for cross-language Made English translations for cross-language experimentsexperiments

– Translated by native speakers of ChineseTranslated by native speakers of Chinese

<top><num> 12<title> Anti-Chinese Movements<description> Activities related to the anti-Chinese movements in Indonesia<narrative> Articles must deal with activities related to the anti-Chinese movement in Indonesia; case reports or articles dealing with PRC's criticism of the Anti-Chinese movement will be considered partly relevant.</top>

Relevance JudgmentsRelevance Judgments Exhaustive tri-state relevance judgmentsExhaustive tri-state relevance judgments

– Irrelevant (=0), partially relevant (=1), fully relevant (=2)Irrelevant (=0), partially relevant (=1), fully relevant (=2)

Every topic-document pair judged by 3 assessors Every topic-document pair judged by 3 assessors – 2 majored in history, 1 majored in library science2 majored in history, 1 majored in library science

– Averaged 4 minutes per document image (for all 30 topics)Averaged 4 minutes per document image (for all 30 topics)

Sum of the judgments provides a final estimateSum of the judgments provides a final estimate– 0=not relevant, 1…5=partially relevant, 6=fully relevant0=not relevant, 1…5=partially relevant, 6=fully relevant

– Threshold as desired to reflect the intended applicationThreshold as desired to reflect the intended application

• In our experiments, any score > 0 is treated as “relevant”In our experiments, any score > 0 is treated as “relevant”

Query ID Doc. ID 1st Assessor 2nd Assessor 3rd Assessor Total Score

01 0053487 1 1 0 2

01 0053489 1 2 1 4

…

02 0054425 2 2 2 6

02 0054452 1 1 1 3

…

Chinese OCR Text Retrieval StrategiesChinese OCR Text Retrieval Strategies

Indexing method:Indexing method:– BothBoth 1-gram (for partial match) 1-gram (for partial match) andand 2-gram (for preserving 2-gram (for preserving

sequence)sequence)– Example: “Example: “ABCABC” will be indexed with “” will be indexed with “AA”, “”, “BB”, “”, “CC”, “”, “ABAB”, “”, “BCBC””– Compared to Compared to 1-gram only 1-gram only and and 2-gram only2-gram only

Weighting scheme: Weighting scheme: – document terms : TF*IDF = log(1+ tf ) * log(N/df)document terms : TF*IDF = log(1+ tf ) * log(N/df)– query terms : tf * (3w-1), where w is the length of the termquery terms : tf * (3w-1), where w is the length of the term

Retrieval model:Retrieval model:– Vector spaceVector space model compared with model compared with probabilisticprobabilistic model model

Document length normalization: Document length normalization: – byte sizebyte size for document terms, compared to for document terms, compared to cosinecosine

T

k kjd

T

k kjkiji

qbytesize

qdqdSim

i 1

2,

375.0

1 ,,

)(),(

t

k kj

t

k ki

t

k kjkiji

qd

qdQDSim

1

2,1

2,

1 ,,),(

OCR and Length NormalizationOCR and Length Normalization

Experiments by Taghva et al showed that Experiments by Taghva et al showed that – some sophisticated weighting schemes shown to be more effective for ordinary some sophisticated weighting schemes shown to be more effective for ordinary

text might lead to more unstable results for OCR degraded text. text might lead to more unstable results for OCR degraded text. Singhal, Salton, Buckley [‘96] analyzed this phenomenon by Singhal, Salton, Buckley [‘96] analyzed this phenomenon by

– Vector space model (SMART system)Vector space model (SMART system)– Word-based indexingWord-based indexing– simulated OCR output of a TREC collection (2GB of 742,202 docs)simulated OCR output of a TREC collection (2GB of 742,202 docs)– 50 TREC queries (numbered from 151 to 200) 50 TREC queries (numbered from 151 to 200) – Specifically, effects of Specifically, effects of cosine normalizationcosine normalization and and IDF IDF are analyzedare analyzed– Incorrect terms like ‘systom’ have large IDF and thus affect weights of other Incorrect terms like ‘systom’ have large IDF and thus affect weights of other

terms in the same document if cosine normalization is used:terms in the same document if cosine normalization is used:

– They correct this problem by using They correct this problem by using byte size normalizationbyte size normalization::((byte size)byte size)0.3750.375

222

21 ... Twww

Results SummaryResults Summary

0.3

0.35

0.4

0.45

0.5

Inquery ByteSize Cosine

1-gram

2-gram

1+2-gram

Long Queries Title Queries

• 1+2 gram is best

• ByteSize beats Cosine

• Long queries beat Titles

• Inquery does well

Mea

n A

vera

ge P

reci

sion

Conclusions of StudyConclusions of Study

The SCRC test collection is usefulThe SCRC test collection is useful– But more than 30 topics may be needed for statistical significanceBut more than 30 topics may be needed for statistical significance

Indexing 1-grams and 2-grams together works wellIndexing 1-grams and 2-grams together works well– If 2-grams are given greater weight in the queryIf 2-grams are given greater weight in the query

Byte size normalization outperforms cosine normalizationByte size normalization outperforms cosine normalization– But Inquery does better than either on short queriesBut Inquery does better than either on short queries

OCR errors adversely affect blind relevance feedbackOCR errors adversely affect blind relevance feedback– A clean comparable collection would probably work betterA clean comparable collection would probably work better– Pruning seems to helpPruning seems to help– Considerable parameter tuning is needed (Considerable parameter tuning is needed (, , , and , and kk))