Upload
phammien
View
244
Download
2
Embed Size (px)
Citation preview
Overview
• Text: main form of communicatingknowledge
• Document: a single unit of information• A document has
– syntax:– structure:– semantics: specified by the author– presentation style: specifies how it should
be displayed or printed
(dictated by the application or bythe person who created it)}
Characteristics of a Document
Document
SyntaxText + Structure+ Other Media
Semantics
Presentation Style
Overview
The syntax of a document can express– structure– presentation style– semantics– external actions
• This syntax can be– implicit– expressed in a simple declarative language– expressed in a programming language
Metadata
Metadata, ‘data about the data’, isinformation on– the organization of the data– the various data domains– the relationship between them
Metadata Examples
• Database management system:– name of the relations– fields or attributes of each relation– domain of each attribute
• Text:– author– date of publication– source of publication– document length– document genre
Descriptive Metadata
Descriptive Metadata: metadata that is– external to the meaning of the document– pertain how the document was created
Example: the Dublin Core MetadataElement Set: proposes 15 fields todescribe a document
Semantic Metadata
Semantic Metadata: metadata that– characterizes the subject matter found in the
document’s contents– is associated with a wide number of
documents– is increasing in its availability
Example:– All books published in the USA are assigned
Library of Congress subject codes
Semantic Metadata
Example:– Many journals require author-
assigned key terms (from a closedvocabulary of relevant terms)
– topical metadata in biomedicalarticles within the MEDLINE systemare disease, anatomy,pharmaceuticals, etc.
Metadata for Web Document
• In the web, metadata can be used for:– cataloging: a popular format is BibTeX– content rating– intellectual property rights– digital signatures– privacy levels– applications to electronic commerce
Metadata for Web Document
Resource Description Framework (RDF):new standard for Web metadata– provides interoperability between
applications– allows the description of Web resources to
facilitate automated processing of theinformation
– consists of a description of nodes andattached attribute/value pairs
Metadata for Web Document
node: Web resource, Uniform ResourceIdentifier (URI)
attribute: properties of nodesvalue: text strings or other nodes
Text
• text is coded in binary digits for computer• First coding schemes: EBCDIC, ASCII
– use seven bits for each symbol• Later, ASCII was standardized to eight bits
(ISO-Latin)– accommodate several languages– including accents and diacritical marks
• Unicode (ISO 10616) uses 16-bit code– for oriental languages
Formats
• In the past, IR systems convert adocument to an internal formatdisadvantages:– original application related to the document
is no longer useful– contents of a document cannot be changed
• Current IR system uses filters– might not be possible with proprietary or
non-public formats
Formats
• Full ASCII syntax: TeX• Binary syntax: Word, WordPerfect,
FrameMaker• Rich Text Format (RTF):
– used by word processors– has ASCII syntax– developed for document interchange
Formats
• Portable Document Format (PDF) andPostscript– developed for displaying and printing
documents• Multipurpose Internet Mail Exchange
(MIME)– interchange formats– used to encode electronic mail
Formats
• Compressed text:– Compress (Unix)– ARJ (PCs)– ZIP (gzip-Unix, Winzip-Windows), etc.
• Conversion tools: convert binary files(compressed text) to ASCII text fortransmission:– uuencode/uudecode– binhex
Information Theory
• the distribution of symbols related to information(or semantics) in written text
• entropy: used to capture information content (orinformation uncertainty)– σ: symbols the alphabet has– pi: probability of each symbol appearance (the
symbol frequency over the total number of symbols)– E: the entropy of this text
!
E = " pi log2 pii=1
#
$
Entropy
• the σ symbols of the alphabet are coded inbinary → the entropy is measured in bits
• example: for σ = 2,– the entropy is 1 if both symbols appear the same
number of times– the entropy is 0 if only one symbol appears
• the text model determines probabilities pi andamount of information in a text
Modeling Natural Language
• text is composed of symbols from a finitealphabet
• symbols can be divided into two subsets– symbols that separate words– symbols that belong to words
• A simple model to generate text is thebinomial model
• In natural language, these symbols are notuniformly distributed → each symbol dependson previous symbol
Modeling Natural Language
• a finite-context or Markovian model canbe used to compute this dependency
• more complex models: finite-statemodels and grammar models
Distribution of the Frequencies
• Zipf’s Law is used to model the distribution ofword frequencies in the text– the frequency of the i-th most frequent word is 1/iθ
times that of the most frequent word– in a text of n words with a vocabulary of V words,
the i-th most frequent word appears n/(iθHV(θ))times
• HV(θ) is the harmonic number of order θ of V
– θ depends on the text, usually θ > 1 (1.5-2.0)
!
HV (") =1
j"
j=1
V
#
Distribution of Words
A simple model: consider each word appearsthe same number of time in every document
A better model: a negative binomial distribution– the fraction of documents containing a word
k times is
where p and α are parameters (depend on theword and the document collection)
!
F(k) = Ck
"+k#1pk(1+ p)
#"#k
Document Vocabulary
• Heaps’ Law is used to predict thegrowth of the vocabulary size in naturallanguage text
V = Knβ = O(nβ)• V: vocabulary size of a text of n words• K, β: free parameters - depend on text
10 ≤ K ≤ 100; 0 ≤ β ≤ 1• See Figure 6.2
Average Length of Words
• Heaps’ law:– the length of the words in the vocabulary
increases logarithmically with the text size• In practice:
– the average length of the words is constant• Finite-state model
– the space character has probability close to 0.2– the space character can’t appear twice in a row– there are 26 letters
Similarity Models
• similarity is measured by– a distance function: Hamming distance– edit or Levenshtein distance– longest common subsequence (LCS)
• a distance function should– be symmetric: arguments order is important– satisfy the triangular inequality
distance(a,c) ≤ distance(a,b)+distance(b,c)
Similarity Models
• extending similarity to documents is done by– consider lines as single symbols and compute the
longest common sequence of lines between twofiles (diff command in Unix)
problems:– time consuming– does not consider lines that are similar
• The second problem can be fixed by– taking a weighted edit distance between lines– computing the LCS over all the characters
Document Similarity
Other solutions include• extract fingerprints of the documents and
compare them, or find large repeated pieces• use visual tools to see document similarity:
Dotplot draws a rectangular map where– both coordinates are file lines– the entry for each coordinate is a gray pixel that
depends on the edit distance between theassociated lines
Markup Languages
• Markup: extra textual syntax used to describe– formatting actions– structure information– text semantics– attributes, etc.
ex. the formatting commands of TeX• formal markup languages are much more
structured• the marks are called tags (initial+text+ending)• Samples markup languages: SGML, HTML, XML
SGML
• Standard Generalized Markup Language(ISO 8879): a metalanguage for tagging text– developed by a group led by Goldfarb– based on earlier work done at IBM– provides the rules for defining a markup language
based on tags• an SGML document is defined by
– a description of the structure of the document– the text marked with tags describing the structure
SGML
• each instance of SGML includes a descriptionof the document structure called a documenttype definition
• the document type definition is used to– describe and name the pieces that a document is
composed of– define how those pieces relate to each other
• part of the definition can be specified by anSGML document type declaration (DTD)
SGML
• SGML cannot formally express– semantics of elements– attributes– application conventionsonly informal form (comment) can be done
• SGML tag are denoted by anglebrackets <>– <tagname> text </tagname>
TEI
• One important use of SGML is in TEI(Text Encoding Initiative), a cooperativeproject started in 1987– to generate guidelines for the preparation
and interchange of electronic texts forscholarly research and for industry
– one of the most used formats is TEI Lite
HTML
• HyperText Markup Language (HTML):– is an instance of SGML– created in 1992, the latest version is 4.0– is being extended to solve its limitation– HTML tags follow SGML conventions– HTML tags include format directives– other media can be embedded in HTML
documents
HTML
• HyperText Markup Language (HTML):– supports backward and forward
compatibility• Cascade Style Sheets (CSS)
– offer a powerful and manageable way tocreate visual effects of HTML pages
HTML 4.0
• specified in strict, transitional andframeset
• Strict: only worries about non-presentational markup, leaving all thedisplay information to CSS
• Transitional: uses all the presentationfeatures for pages
• Frameset: used when frames is used
HTML Limitation
• HTML does not– allow users to specify their own tags or
attributes– support the specification of nested
structures– support the kind of language specification
that allows consuming applications tocheck data for structural validity onimportation
XML
• eXtensible Markup Language (XML)– is a simplified subset of SGML– is not a markup language– is a metalanguage capable of containing
markup languages– allows a human-readable semantic markup
(also machine-readable)– is easier to develop and deploy new
specific markup
XML
• eXtensible Markup Language (XML)– enables automatic authoring, parsing, and
processing of networked data– does not have many restrictions imposed
by HTML– imposes a more rigid syntax on the markup– distinguishes upper and lower case– is easier to be parsed without knowledge of
the tags (all attribute values must bebetween quotes)
XML
• eXtensible Markup Language (XML)– allows users to define new tags, more
complex structures– has data validation capabilities
Recent Uses of XML
• Mathematical Markup Language(MathML): two sets of tags– for presentation of formulas– for the meaning of mathematical
expressions
Recent Uses of XML
• Synchronized Multimedia Integrationlanguage (SMIL):- a declarative language for scheduling
multimedia presentations in the Web- the position and activation time of different
objects can be specified• Resource Description Format (RDF):
used as metadata information for XML
Multimedia
Multimedia: applications that handledifferent types of digital data originatingfrom distinct types of media
Most common types of media are- text, sound, images, video (animated
sequence of images)The differences among these media types
- volume, format, processing requirements
Image Formats
Several formats for images:• direct representations of a bit-mapped display
- consume too much space: XBM, BMP, PCX• compressed:
– Graphic Interchange Format (GIF)– Joint Photographic Experts Group (JPEG)
• Tagged Image File Format (TIFF):– exchange documents between different
applications and different computer platforms– has fields for metadata and support compression
Image Formats
Several formats for images:• True-vision Targa image file (TGA):
– associated with video game boards• Other formats:
– fax (bi-level image formats): JBIG– fingerprints (highly accurate and compressed):
WSQ– satellite (large resolution and full-color images)– Portable Network Graphics (PNG)
Audio Formats
Several formats for small piece of digital audio:– AU: created by Sun Microsystems, one of the
most common formats on the Web– MIDI: standard format to interchange music
between electronic instruments and computers– WAVE: the native sound format within the
Windows environment, one of the most commonon the Web
Formats for audio libraries– RealAudio or CD formats
Animation Formats
for animations or moving images:– Moving Pictures Expert Group (MPEG):
related to JPEG– AVI: includes compression (CinePac)– FLI: originally developed by Autodesk, Inc.,
play back faster than MPEG for computergenerated animations at 640x480
– QuickTime: developed by Apple
Textual Images
Very important in office systems– images of documents that contain mainly
typed or typeset text– obtained by scanning the documents– usually for archiving purposes
• Large portion of a textual image is text– can be used for retrieval purpose– allow efficient compression
Textual Images
• further compression can be achieved by– extracting the different text symbols or
marks from the image– building a library of symbols– representing each one by a position in the
library
Retrieval of Textual Images
• associated a set of keywords at creationtime or added to the database
• use OCR to extract the text of theimage
• use the symbols extracted from theimages as basic units to combine imageretrieval techniques with sequenceretrieval techniques
Graphics and Virtual Reality
For three-dimensional graphics• Computer Graphics Metafile (CGM)
standard (ISO 8632):– defined for the open interchange of
structured graphical objects and associatedattributes
– specifies a two-dimensional datainterchange standard
Graphics and Virtual Reality– allows graphical data to be stored and
exchanged between graphics devices,applications, and computer systems(device-independent)
– can represent vector graphics and rasterformat
– support a collection of elements, calledmetafile
– specifies which elements are allowed tooccur in which positions in a metafile
Graphics and Virtual Reality
For three-dimensional graphics• Virtual Reality Modeling Language (VRML,
ISO/IEC 14772-1):– file format for describing interactive 3D objects and
worlds– is a subset of the Silicon Graphics OpenInventor
file format– intended to be a universal interchange format for
integrated 3D graphics and multimedia
HyTime
The Hypermedia/Time-based StructuringLanguage (HyTime) is a standard(ISO/IEC 10744)– defined for multimedia documents markup– is an SGML architecture that specifies the
generic hypermedia structure of documents– Allows DTDs to be written for individual
document models
HyTime
The hypermedia concepts directlyrepresented by HyTime include– complex locating of document objects– relationships (hyperlinks) between
document objects– numeric, measured associations between
document objects
HyTime
The HyTime architecture has three parts:• The base linking and addressing
architecture:addresses the syntax and semantics of hyperlinks
• The scheduling architecture (derivedfrom the base architecture):
defines the abstract representation of complexhypermedia structures (including music andinteractive presentations)
HyTime
• The rendition architecture (an applicationof the scheduling architecture):
defines a general mechanism for defining thecreation of new schedules from existing schedules(by applying special ‘rendition rules’ of differenttypes)
Applications of HyTimeStandard Music Description Language (SMDL)
– an architecture for the representation of musicinformation
– supporting multimedia time sequencinginformation
Metafile for Interactive Documents (MID)– a common interchange structure– based on SGML and HyTime– takes data from various authoring systems and
structures it for display on different presentationsystems (with minimal human intervention)
Trends and Research Issues
• The main trend is the convergence andintegration of the different efforts (the Web isthe main application)
• ODA (Open Document Architecture):– designed to share documents electronically without
losing control over the content, structure, and layoutof those documents
– defines a logical structure, a layout and the content– an ODA file can be formatted, processable, or
formatted processable
Trends and Research Issues
• Formatted files– cannot be edited– have information about content and layout
• Processable files– can be edited– have content and logical information
• Formatted processable files– have everything
Trends and Research Issues
• Recent developments include:– the document object model (DOM)– integration between VRML and Dynamic
HTML– Integration between the Standard
Exchange for Product Data format (STEP,ISO 10303) and SGML
– Effort to convert MARC to SGML bydefining DTD as well as MARC to XML
Trends and Research Issues
• Recent developments include:– CGM: developing a new encoding which
can be parsed by XML– Several new proposals such as
• SDML (Signed Document Markup Language)• VML (Vector Markup Language)• PGML (Precision Graphics Markup Language)