106
Hypertext (1) • Historically, text is sequential: read from beginning to end • Hypertext is non-sequential, with internal links from one part to another Hypertext, the word, coined by Ted Nelson in 1966. • First hypertext system, Xanadu, named for Coleridge’s magical world.

Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Embed Size (px)

Citation preview

Page 1: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Hypertext (1)

• Historically, text is sequential: read from beginning to end

• Hypertext is non-sequential, with internal links from one part to another

• Hypertext, the word, coined by Ted Nelson in 1966.

• First hypertext system, Xanadu, named for Coleridge’s magical world.

Page 2: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Hypertext (2)

Links in hypertext give access to:

• topics or information directly related to the current idea

• notes, such as footnotes or endnotes

• explanations of special words or phrases

• biographical information about people behind the current idea

Page 3: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Claims about Hypertext

• Represents large body of information organized into numerous fragments

• Fragments relate to one another

• User needs only a small fraction of the fragments at any time

• Exists only in cooperation with the reader

• Is a legitimate literary concept

Page 4: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Claims about Hypertext (2)

• Integrates three technologies– Publishing (as a book publisher would)– Computing (as the infrastructure)– Broadcasting (over a computer network)

• Depends on computer environment for high-speed transitions between nodes

• Modelled by network ADT

Page 5: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Using Hypertext

• Browser, or hypertext engine: a computer-based system that allows links to be followed easily

• Navigation aids: parts of the user interface that provide a sense of location and direction

• Notation: a convenient way of specifying links as a hypertext author

Page 6: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

WWW as a Hypertext System

• Browser: Netscape, for example

• Navigational aids:– Forward, back, home– History list– Colored anchors– Consistent titles

• Notation: HTML

Page 7: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Network ADT

• Model of hypertext

• Similar to tree ADT, but allows cycles

• Links have an explicit direction, capturing the idea of going forward and going back

Page 8: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Network ADT (2)

• Definition: A network is a collection of nodes and links between pairs of nodes such that– Each link has a direction.– Each node is reachable from any other node.

However, the path is not necessarily unique.– No node is linked to itself.– There are no duplicate links in the same

direction.

Page 9: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Network ADT (3)

• Observations:– There is no hierarchy; all nodes are considered

the same. (In a tree, the root is special.)– Links have direction, but reverse travel is

possible. (One can go backwards on a link, or forwards on a link that goes in the opposite direction.)

– Cycles are allowed.

Page 10: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Directed Graphs

• Both networks and rooted trees are examples of a connected directed graph, sometimes called a digraph.

• Formally, a digraph is a set of nodes and a set of links joining ordered pairs of nodes. The link (A,B) that joins A to B is different from the link (B,A) that joins B to A

Page 11: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Navigation in Sequential Text

• Low level:– Punctuation– Fonts– Separation into sentences and paragraphs

• High level:– Chapters, sections, subsections– Table of contents– Index

Page 12: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Navigation in Sequential Text (2)

• Page layout– Page numbers– Running heads– Displayed text

Page 13: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Navigating in Hypertext

• Issues:– Where am I? Have I been here before? When?

– How did I get here?

– Where can I go?• Anchors (or links)

• Implicit anchors (or links): clipboard, glossary, calculator

• Computed links: next train

• Back

• Forward

• Home

Page 14: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Navigating in Hypertext (2)

• Within a node:– Save to disk– Print– Annotate– Scroll– Zoom

Page 15: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Navigating in Hypertext (3)

• User interface support– Give power to the users through

• short response time

• low cognitive load

• path clues, perhaps decaying over time

– Follow a path forward or backward– Return to a node

Page 16: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Text Markup

• Unified view of text and hypertext presentation

• Foundation of all word processors

• Describes all electronic manuscripts by– separating logical elements– specifying processing functions for these

elements

Page 17: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Text Markup (2)

• Originated by William Tunnicliffe (Sept. 1967), in talk advocating separating information content of document from format

• Control formatting with embedded codes

Page 18: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Generalized Markup

• Goal: allow editing, formatting, and retrieval systems to share documents

• Devised by Goldfarb, Mosher, Lorie at IBM, 1969

• Formally defined – document types– explicit nested element structure– generic identifier associated with each element

Page 19: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

SGML

• Standard Generalized Markup Language

• First draft standard, 1980

• ISO 8879, 1986

• Based on the ADT tree

• Allows the description of a document, considered as a tree, to be embedded in the file containing the document

Page 20: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Functions of SGML

• Tags documents in a formal language

• Describes internal logical structures

• Links files with an addressing scheme

• Acts as a database language for text

• Accommodates multimedia and hypertext

• Provides a grammar for style sheets

• Allows coded text reuse in surprising ways

Page 21: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Functions of SGML (2)

• Represents documents independent of computing platform

• Provides a standard for transfering documents among platforms and applications

• Acts as a metalanguage for document types

• Represents hierarchies

• Extends to accommodate new document types

Page 22: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Generic Identifiers

• Tagging vs. formatting– Tagging shows document structure– Formatting describes document display– Example: A paragraph is a sequence of closely

connected sentences and can be delimited by a tag. A paragraph can be displayed with either

• initial indenting or not

• extra separation or not

Page 23: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Generic Identifiers (2)

• Syntax– Beginning: < identifier >– End: </ identifier >

• Attribute list, with assigned values, may follow identifier

Page 24: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Generic Identifiers (3)

• Typical identifiers:– p paragraph– q quotation– ol numbered (ordered) list– ul unnumbered list– li list item– b bold face– i italics

Page 25: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Display of Text

• ASCII codes for printing characters carry no information about display

• Printed or displayed characters are described by their font.

Page 26: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts• Fonts come in families, which are a group of fonts

with similar design characteristics.• A font is a set of displayed characters in a

particular design. To describe a font, we specify:– The font face, or type face, which is the design of the

font.

– The size, measured in points, which is the height of representative characters.

– The appearance: bold, italic, underline, outline, shadow, small cap, redline, strikeout, etc.

Page 27: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts (2)

• Font families include standard modifications of a base font, such as italics and bold, to change the appearance. (This family is Times New Roman.)

• Some families are sans serif, without the cross strokes accentuating the ends of the main strokes.

Page 28: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts (3)

• Typical examples of fonts are– Times New Roman

– Arial– Century Schoolbook– Lucinda Calligraphy– Verdana

Page 29: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts (4)

• The size of this font is 32 points

• This is 54 points• This is 24 points

• There are exactly 72.27 points per inch

Page 30: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts (5)

To render a character in a font, one must

• Know the computer code (ASCII) of the character

• The font name and properties

Then the computer creates the glyph that represents the character in the specified font.

Page 31: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Fonts (6)

In the process, the computer uses the• Baseline: the invisible line on which

characters are aligned.• x-height: the actual height of the character x• Kerning: spacing between two letters.

Note that in printing “wo” the “o” slides under the “w”

to form and locate the glyph

Page 32: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Input devices for text

• Keyboard

• Scanning with optical character recognition– Hand printed – Hand written (cursive)– Machine printed

• Voice recognition

• Pen-based

Page 33: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Input errors

• Human-based, e.g.– Typographic– Poor writing

• Machine dependent– Small typeface differences: O vs. D

• Limits of technology

• Pre-existing errors

Page 34: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Automatic error correction

• Error rate for keyboard input = 98% OCR accuracy + automatic correction

• Automatic correction also helpful in:– Computer-aided authoring– Communication enhancement for disabled– Natural language responses– Database interaction

• Example: MS Word AutoCorrect

Page 35: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Automatic spelling correction

• Three increasingly difficult tasks:– Non-word detection: string in text not in

dictionary– Isolated word correction: thier automatically

becomes their– Context-dependent correction: here

automatically becomes hear

Page 36: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

MS Word AutoCorrect

Page 37: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

General spelling correction

• Can allow human intervention, e.g. choose the correct spelling from a list of candidates

• No context dependent general purpose correction tool exists yet.

Page 38: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Issues for spelling correction

• Type of input device– Focus on adjacent keys: b vs. n– Focus on similar shapes: O vs. D

• Interactive vs. automatic correction– How many choices are reasonable? (One for

automatic correction.)– How accurate should guesses be?

• Proper choice of dictionary

Page 39: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Proper Dictionary

Page 40: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Word list choice

• Use lexicon--a word list appropriate to a particular topic

• As opposed to dictionary -- a comprehensive list of words

• Include provision for adding new words

Page 41: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Word list choice: Example 1

• Compare NY Times news wire text with Webster’s 7th Collegiate Dictionary

• 8 million words in news wire text:– only 36% in dictionary– only 39% of dictionary words used in text

Page 42: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Example 1 (continued)

• Of text words not in dictionary– 1/4 inflected forms (change in case, gender, tense)– 1/4 proper names– 1/6 hyphenated forms– 1/12 misspellings– 1/4 unresolved by investigators (new words, etc.)

• How to handle proper names?

Page 43: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Example 2

• Corpus of 22 million words from a variety of genres

• Effect of changing lexicon from 50,000 to 60,000 words?– Eliminated 1348 false rejections (words are now

included in lexicon)– Created 23 false acceptances (originally

misspelled, now occur in lexicon and therefore, treated as correctly spelled.)

Page 44: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Unintentionally correct spellings

• Misuse of word: there for their, to for too

• Typo: from for form

• Quote from Mozart: I’ll see you in five minuets

Page 45: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Issues in detection

• Given document as a sequence of words, lexicon as ordered list of words, report all document words not in lexicon, but:

• How to handle upper case letters?

• How to handle suffixes and prefixes?

• What definition of word to use?

Page 46: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Issues in detection (2)

• Upper case: Change all to lower case– Handles first word of sentence and proper

names that are words: Bob Brown– Confuses: DEC (ok), Dec (abbreviation), dec

(misspelling) – Must put back capitalization

Page 47: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Types of errors

• From keyboard input, 80% of misspellings– Insertion– Deletion– Substitution, especially nearby keys– Transposition

• Few errors occur in first letter

• Mostly, length is same or changes by 1

Page 48: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Suggestion Strategies

• Words with same first letter first

• Order rest by change in length

Page 49: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Types of errors (2)

• Improper spacing: run-ons or splits– Significant unsolved problem

• Cognitive– recieve for receive; procede for proceed– conspiricy for conspiracy; mispell for misspell

• Phonetic– abiss for abyss; nacherly for naturally

Page 50: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Spelling Rules

• I before E except after C

• Ex, Suc, Pro ceed. All others are cede, except supersede

Page 51: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Suggestion Strategies (2)

• Words with same first letter first

• Order rest by change in length

• Use standard spelling rules

Page 52: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Suggestion Principles

• Edit distance: The minimum number of insertions, deletions, or substitutions needed to change one string to another, defined by Levenshtein in 1966

• Provide suggestions in increasing order of edit distance

Page 53: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Detection Algorithms

• For each word in text, search for word in dictionary. If not found, report spelling error.

• Issues:– Efficiency when text or dictionary is large

Page 54: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Detection Algorithms (2)

• n-gram analysis

• Issues:– Requires preprocessing of dictionary– Extremely fast if misspelling creates unusual n-

gram

Page 55: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

n-gram Fundamentals

• Definition: an n-gram is a substring of length n of a given word.

• Examples:– The word weasel contains 5 digrams (2-grams),

namely we,ea,as,se,el.– The word monkey contains 4 trigrams (3-grams),

namely mon, onk, nke, key.– The word turkey contains 6 monograms (1-grams),

namely t,u,r,k,e,y.

Page 56: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

n-gram Strategy

• Preprocess the dictionary to create a list of all the n-grams contained in words in the dictionary.– Eliminate duplicates from the list– Perhaps record the position within the word of the

n-gram.

• Detect a spelling error by discovering an n-gram in the target word that is not in the n-gram list.

Page 57: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Arrays

• Definition: A data structure is a particular way of storing data in a computer.

• Definition: An array is an indexed set of values. Informally, an array can be viewed as a table.

• Example (of a data structure): An array is a data structure.

Page 58: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Arrays (2)

• Array index:– Usually positive integers to some maximum

size, e.g. 1 to 500.– Can also be another ordered set, e.g. the

alphabet, the characters in ASCII order

• Values: Whatever one wants to store: numbers, letters, strings, other arrays.

Page 59: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Array Examples

• Table of hex and binary numbers corresponding to base 10 numbers. The index set is the base 10 numbers, the array values (table entries) are the corresponding hex and binary numbers

• List of words for searching. The index is the position in the list, the array values are the words viewed as strings.

Page 60: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Array Examples (2)

• Shift table for Boyer-Moore searching. The index is the set of characters. The array value is the number representing the shift amount for that index character.

• List of ASCII codes. The index is the ASCII code, 00 to FF in hex numbers. The array value is the character represented by the index hex number.

Page 61: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Digram Arrays

• A digram array is an array indexed by the letters a through z. Each value is, in turn, an array indexed by the letters a through z.

• A digram array can be viewed as a table whose rows and columns are indexed by the 26 lower case letters.

• Typically, we use binary digits as the values in a digram array, creating a binary digram array, or BDA.

Page 62: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Digram Arrays (2)

• Assume that a dictionary is given.

• Preprocess the dictionary by setting the value in a digram array for each digram that appears in each word in the dictionary.

• Notes:– The digram array depends on the dictionary– Typically 42% of entries are 0– Trigram arrays may be constructed in the same way.

Page 63: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Nonpositional BDA

• Each value, or cell, in a BDA is associated with the digram represented by the row and column index of the cell.

• Example: The digram ck is associated to the value in the cell in row c, column k.

• The value in a nonpositional BDA associated to a digram is 1 if that digram appears in some word in the dictionary and is 0 otherwise.

Page 64: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Nonpositional BDA (2)

• Example: The value associated with the digram ck is 1 if some word containing ck appears in the dictionary (e.g. cuckoo). The value is 0 if no word in the dictionary contains ck.

• Example: If the word whose spelling is being checked contains the digram mv and the value associated with this digram is 0, then the word does not appear in the dictionary.

Page 65: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Nonpositional BDA (3)

• Example: If the word whose spelling is being checked contains the digram gh and the value associated with this digram in the array is 1, then one cannot say whether the word is spelled correctly, based just on this information.

Page 66: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Example: Moby Dick

• Class examined Chapters 31-93• Summary file contains

– 284,591 characters– 63,851 words– 63,853 sentences– 63,585 lines– 63,583 paragraphs– 1413 pages

Page 67: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Example: Moby Dick (2)

• After processing (removing numbers, upper case letters, and punctuation), file contains– 70039 characters– 9578 words– 9577 sentences– 9577 lines– 9577 paragraphs– 213 pages

Page 68: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Example: Moby Dick (3)

• Checking digrams, we find

Page 69: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Positional BDA

• Assume that the longest word in the dictionary has length M.

• Denote the position of a digram by k. Then k has value 1, 2, ... , M-1.

• For each digram, create an array of length M-1, where the value at index k is 1 if the digram appears in a word in the dictionary in position k. The value is 0 if no word in the dictionary has this digram at position k.

Page 70: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Positional BDA (2)

• Example: In the positional BDA for the digram at the value indexed by k=3 is equal to 1 if some word in the dictionary has the form ??at*

• Example: In the positional BDA for the digram sp the value indexed by k=7 is equal to 0 if no word in the dictionary is of the form ??????sp*

Page 71: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Effectiveness

• Typically about 42% of entries in a non-positional BDA are 0

• Randomly changing one letter in a word will produce a digram with value 0 in NP BDA about 70% of time

• Study of handprinted 6-letter words, 7662 with a single substitution error, 7561 detected by positional trigram analysis

Page 72: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Encryption

• Goal: provide privacy and security for text transmitted by computer network.– Confidentiality of contents

– Authenticity of sender and receiver

– Integrity of contents

• Interested parties– Military and diplomatic officers

– Mathematicians and computer scientists

– E-commerce providers

Page 73: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Encryption History

• Early work– Cryptography book by George Fisher published

by Benjamin Franklin

• Present day– Text transmitted by computer network– Techniques regulated by federal government

Page 74: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Encryption on Networks

• Situation: no transmission on any computer network can be considered absolutely private– Network tap is not physically difficult– Legitimate use for monitoring traffic to detect

problems and potential bottlenecks

Page 75: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Intruders

• Passive: listens, gathers information

• Active: captures and (perhaps) replaces– Changes amount in a financial transaction– Uses a stolen credit card number

Page 76: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Encryption Model

Page 77: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Encryption Techniques

• Character-based– Shift (Caeser cipher)– Monoalphabetic substitution (cryptograms)– Polyalphabetic cipher

• Numeric– Each character is represented by 8 bits– Four characters form a 32 bit number– Encode these numbers

Page 78: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Shift Encryption

• Encryption: Each letter is encoded with the letter k positions from it in the alphabet

• Key: The integer k, in the range –25..25

Page 79: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Shift Encryption (2)

• Example 1: Shift

Replace each letter by the one three positions forward in the alphabet, k=+3

WILDCATS ---> ZLOGFDWV

• Example 2: Shift, k = +5

CATS ---> HFYX

Decrypt using k = –5

Page 80: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Shift Encryption (3)

Notes on shift encryption

• Only 26 different strategies are possible, and one of those is the null strategy (no encrypting is done).

• If encryption uses the key k, then decryption uses the key –k (or the key 26 – k)

Page 81: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Monoalphabetic Substition

• Encrypt by using a random permutation of the alphabet.

• Key is the permutation, 26! choices are available.

• Decryption by checking all permutations is impossible.

• However, this is the Daily Cryptogram in the newspaper.

Page 82: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Monoalphabetic Substitution (2)

• Example:

XE XU BRUXBF EM EROJ EARI EM

IT IS EASIER TO TALK THAN TO

AMOP MIB’S EMIWCB

HOLD ONE’S TONGUE

Page 83: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Monoalphabetic Substitution (3)

• Notes on monoalphabetic substitution– Decryption strategy uses letter patterns, e.g.

common digrams and trigrams– Heuristics, as opposed to an algorithm

Page 84: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution

• Caesar cipher has too few keys

• Monoalphabetic substitution has enough keys, but word patterns (digrams and trigrams) allow easy code breaking

• Develop strategy with – large number of keys– disrupted word patterns

Page 85: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (2)

• Start with a 26 x 26 array of letters, shifted by one letter in each row

• Choose a string as a key• Example: key = springforward spr ingforwardsp ringf or ward The confidential terms of your springforw ardsprin gfo rw ardspri employment contract are as follows

Page 86: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (3)

• The ith character in the text, denoted by c, is replaced by m(d,c), where d is the corresponding character in the key, and the replacement is the character m, appearing in the dth row and cth column of the array.

• Example: d = s, c = t, m(d,c) = l d = p, c = h, m(d,c) = w d = r, c = e, m(d,c) = v

Page 87: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (4)

• The encoded message starts

lwvkb tkwua nklsa kmesx cwuol uwbgt …

where the letters have been written in groups of five.

Page 88: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (5)

• To decode a message, knowing the key, match the key with the message.

• Example: key = declaration

decla ratio ndecl arati ionde

zlgyi etamq bxvup owhnu cahzg

Page 89: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (6)

• The ith character p of the plaintext message is the character such that m(d,p) = e, where d is the character of the key corresponding to the ith character e of the encrypted message.

• Operationally,– Go to the dth row of the array– Find e in this row by scanning across– Record p, the column index of e

Page 90: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Polyalphabetic Substitution (7)

• Example: d = d, e = z, appears in column w, p = w

d = e, e = l, appears in column h, p = h

d = c, e = g, appears in column e, p = e

Page 91: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,
Page 92: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Text Compression

• Text is represented as a long string of binary digits, 8 digits per character.

• A 2000 word essay, has about – 10,000 characters – 2000 spaces– 96,000 bits

• Question: Can we represent this essay in substantially fewer bits?

Page 93: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Text Compression (2)

• Answer: Most likely, since we really only need 7 bits per character for the 94 printing characters plus white space characters.

Page 94: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Techniques

• Represent fixed text with a short symbol string, e.g.– Stock exchange symbols for company names– ISBN numbers for book title and author

• Shorter symbol strings for more frequently occurring text strings– Use one bit for the most frequent character, etc.

Page 95: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Techniques (2)

• Context dependent strings– Represent common combinations with their

own codes– Represent constant bit strings

Page 96: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding

• Frequency dependent coding

• Uses frequency distribution of characters in text– Most common occurring letter is E, 13.05%– Next most is T, 9.02%

– Rarest is Z, 0.09%

Page 97: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (2)

• Creating a Huffman code for a set of characters– List the characters and their relative frequencies– Sort the list in order of least frequent to most

frequent– Build a coding tree, which is a binary tree, as

described below

Page 98: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Binary Tree

• A binary tree is a tree in which– Each interior node has degree 2– The child nodes are ordered

Page 99: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (3)

• To build a Huffman tree– List the characters in order of frequency from most

to least– Make the two least frequent characters leaf nodes

and join them to a new node.– Label the new node with the sum of the

frequencies of the two child nodes– Label the link to the least frequent with 0 and the

other link with 1

Page 100: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (4)

– Join the newly created node with the next least frequent character.

– Again add the frequencies, label the new node, and label the link to the least frequent node with 0, the other link with 1. Caution: compare the character frequency with the new node frequency

– Continue until all characters have been joined.– The last node (the root of the tree) will be

labeled with frequency 1.00. (Why?)

Page 101: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (5)

To compress text with a Huffman code:

• Follow the tree from the root to the leaf labeled by a character to find the code of the character, the code being the sequence of link labels on the (unique) path to the character

Page 102: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (6)

Example: Assume only 4 characters (so that the tree doesn’t get too large) with relative frequencies:

A = .40

B = .20

C = .15

D = .25

Total = 1.00

Page 103: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (7)

Sort the characters by frequency, smallest firstC = .15

B = .20

D = .25

A = .40

Join B and C to get a node labeled .35 = .15 + .20

with link C .35 labeled 0 and B .35 labeled 1.

Page 104: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (8)

Join the next least frequent character node (D) to the new node (.35) and create a node labeled .60 = .35 + .25

Label link D .60 with 0 and link .35 .60 with 1

Page 105: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (9)

Join the next least frequent character node (A) to the new node (.60) and create a node labeled 1.00 = .60 + .40

Label link A 1.00 with 0 and link .60 1.00 with 1

Page 106: Hypertext (1) Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext,

Huffman Coding (10)

Follow the tree from the root to the leaves to find the codes:

A = 0

B = 111

C = 110

D = 10

Without compression, BAD takes 24 bits

With compression BAD = 111010, 6 bits