24
Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Embed Size (px)

Citation preview

Page 1: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Digital Reformatting of Text

Aaron ChoateDigital Library Production Services

The University of Texas Libraries

Page 2: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

From last time:

Calculating potential file size (no really… this time we got it!)

file size = height x width x bit-depth x dpi2

8 bits per byte

Page 3: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.

Page 4: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Physical Type, size and presentation

Page 5: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBanchmarking

Physical condition• Darkening pages

• Fading ink

• Stains

• bleed-through

• Uneven printing

• Fold lines

• smearing

Page 6: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Document classification• Simple text / printed line art

• Distinct-edge based representationBitonal?

• Manuscripts• Soft-edge-based

Grayscale / color

• Mixed material

Page 7: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Medium and support• Support – (paper, clay tablet, etc.)

• Thin paper? (bleed through)

• Medium – (graphite pencil, inks, etc)• Fading of ink

• Variations in color or density

Page 8: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Tonal Representation

Page 9: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Color Appearance• Is color reproduction necessary to the

document’s meaning?

• What purpose does the color serve?

• How important is maintaining the color appearance?

Page 10: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Detail• Printed text –

• Measure the height of the smallest lowercase letter that typifies the item or group of items.

• Manuscripts, line art –• Measure the finest stroke-width that must be

represented and characterize the needed level of quality

Page 11: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

QI…(Quality Index)• Defining detail as character height

• ANSI/AIIM preservation microfilming standard for determining requirements for text legibility

• Defines a range from barely legible through excellent that maps to technical test targets

Page 12: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Line pairs

Excellent = 8 line pairs

Good = 5 line pairs

Marginal = 3.6 line pairs

Barely legible = 3.0 line pairs

Page 13: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

imagingBenchmarking

Digital QI Bitonal (only black pixels)

QI = (dpi x .039h)/3

h = 3QI/.039dpi

dpi = 3QI/.039h

Tonal images (grayscale for printed text)QI = (dpi x .039h)/2

h = 2QI/0.39dpi

dpi = 2QI/.039h

Page 14: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Text Capture

Methods• Rekeying

• OCR

Accuracy …

Page 15: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Software

Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition

Page 16: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Encoding

Page 17: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

XML vs SGML

SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages

XML is a subset of SGML with an intent on being the format for use on the Internet.

XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)

Page 18: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

xmlDTDs vs Schemas

Page 19: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

xmlTEI

Text Encoding Initiative• Initially launched in 1987, the TEI is an

international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.

Page 20: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

xmlTEI

Levels of encoding• Level 1: Fully Automated Conversion and En

coding

• Level 2: Minimal Encoding

• Level 3: Simple Analysis

• Level 4: Basic Content Analysis

• Level 5: Scholarly Encoding Projects

Page 21: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Character sets

Unicode –

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Page 22: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

character setsUnicode

Greek & Coptic

Page 23: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Software

XMetal Oxygen Cooktop

Page 24: Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

Software

MetaE