30
Digital Text and Data Processing Week 2

Digital Text and Data Processing

  • Upload
    denton

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Digital Text and Data Processing. Week 2. Text Mining Research. This class: focus is mostly on computational analysis of literary texts Different names: ‘Text analysis’ Digital Literary Studies Literary informatics (Martin Mueller) Algorithmic Criticism (Stephen Ramsay) - PowerPoint PPT Presentation

Citation preview

Page 1: Digital Text and  Data Processing

Digital Text and

Data Processing

Week 2

Page 2: Digital Text and  Data Processing

“The book is a machine to think with”I.A. Richards, Principles of Literary Criticism

“The technologising of the word”Walter Ong, Orality and Literacy

Page 3: Digital Text and  Data Processing

□ Discussion of the reading

□ Regular expressions

□ Tokenisation

□ Frequency lists

□ Individual research projects

Today’s class

Page 4: Digital Text and  Data Processing

□ Text analysis□ Digital Literary Studies□ Algorithmic Criticism (Stephen Ramsay)□ Literary informatics (Martin Mueller)

Terminology

Page 5: Digital Text and  Data Processing

Becket, Andrew, A concordance to Shakespeare suited to all the editions, in which the distinguished and parallel passages in the plays of that justly admired writer are methodically arranged. 1787

Page 6: Digital Text and  Data Processing

□ Segmentation or tokenisation

□ Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.)

□ “soft mark up”

Studies based on vocabulary

Source: Chistopher Kelty, Abracadabra: Language, Memory, Representation

Page 7: Digital Text and  Data Processing

□ Token counts reflect the total number of words; Types are the unique words in a text

□ ‘Bag of words’ model: original word order is ignored

Frequency lists

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity”

Tokens: 36Types: 13

the 6it 6of 6was 6epoch 2age 2times 2foolishness 1wisdom 1

Page 8: Digital Text and  Data Processing

authors

Stylometrics

David Hoover, Textual Analysis

□ Study of style on the basis of quantitative aspects□ Analyses of differences and similarities between texts in

different genres, in different periods, texts by different authors

Page 9: Digital Text and  Data Processing

Hugh Craig, Stylistic Analysis and Authorship Studies

Page 10: Digital Text and  Data Processing

Common words

□ Zipf’s law: A small numer of words have a high frequency, a large number of ‘hapax legomena’ (words that appear only once)

□ Function words and lexical words

□ Common words may be ignored by making use of a list of stop words, e.g. Glasgow stop word list

Page 11: Digital Text and  Data Processing

Authorship attribution

John Burrows, Never Say Always Again: Reflections on the Numbers Game

□ Suggesting an author for texts whose authorship is disputed

Page 12: Digital Text and  Data Processing

Digital Shakespeare

□ “Secondary query potential” of digital text□ “non-reading” or scalable reading□ “The underlying methods (…) are probabilistic and in many ways more compatible with a spirit of tentative inquiry□ “The impossibly impoverishing reduction of a text into lists of its constituent parts may let you see some salient differences

and resemblances across many texts that you could not as readily see by reading”□ “Is it an instance of the old joke about the drunk who is looking for his lost car key under a lamp post because that is where

the light is?”□ Digital methods focus on “Establishing the ‘‘fact that’’ than with explaining the ‘‘reason why’’.

Page 13: Digital Text and  Data Processing

Recapitulation W1

□ Variables begin with a dollar sign. Two types: strings and numbers

□ Statements end in a semi-colon

□ “Use strict” has the effect that all variables need to be declared on first use with the “my” keyword

□ “Use warnings” means that programmers will be warned when there errors, even when these are “non-fatal”

Page 14: Digital Text and  Data Processing

Reading a file

open ( IN , "shelley.txt") ;

while(<IN>) {

print $_ ;

} close ( IN ) ;

Curly brackets create a “block” of code

Page 15: Digital Text and  Data Processing

Operators□ Concatenation of strings with the dot

$string1 = "Hello" ;$string2 = "World" ;$string3 = $string . " " . $string2 ;

□ Mathematical operators:

$sum = 5 + 1 ; $sum = 5++ ;

$number = 2 ;$number += 3 ;

Page 16: Digital Text and  Data Processing

Functions□ Functions “cluster” a number of instructions

□ Examples:□ length()

my $title = "Ulysses" ;print length($title) ;# output of this line: 7

□ lc() and uc()

my $title = "Ulysses" ;print lc($title) ;# output of this line: “ulysses"

Page 17: Digital Text and  Data Processing

□ Text patterns□ Simplest regular expression: Simple sequence of characters

Example:

Regular expressions

/sun/Also matches: disunited, sunk, Sunday, asunder

/ sun / Does NOT match:[…] the gate of the eastern sun,[…] gloom beneath the noonday sun. 

Page 18: Digital Text and  Data Processing

□ \b can be used in regular expressions to represent word boundaries

□ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner.

/\bsun\b/i

[…] Points to the unrisen sun! […][…] Startles the dreamer, sun-like truth […] […] stamped upon the sun; […]

Page 19: Digital Text and  Data Processing

. Any character

\w Any alphanumerical character: alphabetical characters,

numbers and underscore

\d Any digit

\s White space: space, tab, newline

[..] Any of the characters supplied within square brackets,

e.g. [A-Za-z]

Character classes

Page 20: Digital Text and  Data Processing

{n,m} Pattern must occur a least n times, at most m times

{n,} At least n times

{n} Exactly n times

? is the same as {0,1}

+ is the same as {1,}

* Is the same as {0,}

Quantifiers

Page 21: Digital Text and  Data Processing

/\d{4}/

Matches: 1234, 2013, 1066

/b[aeiou]{1,2}t\w*/Matches: bit, but, beat, boathouseNot: beauty, blister, boyhood

/[a-zA-Z]+/

Matches any word that consists of alphabetical characters only

Does not FULLY match: e-mail, catch22, can’t

Examples

Page 22: Digital Text and  Data Processing

Do not match characters, but locations within strings.

\b Word boundaries

^ Start of a line

$ End of a line

Anchors

Page 23: Digital Text and  Data Processing

Match variables

□ Parentheses create substrings within a regular expression

□ In perl, this substring is stored as variable $1

□ Example:

$keyword = “quick-thinking” ;

if ( $keyword =~ /(\w+)-\w+/ ) {print $1 ;#This will print “quick”}

Page 24: Digital Text and  Data Processing

□ Regular expressions can be combined with vertical bar (‘|’)

/\bsun\b|\bstar\b|\bmoon\b/

□ ‘special characters’ need to be escaped with the backslash (‘\’)

/\?/

Page 25: Digital Text and  Data Processing

Three types of variables

□ Scalars: a single value; start with $

□ Arrays: multiple values; start with @

□ Hashes: Multple values which can be referenced with ‘keys’; start with %

Page 26: Digital Text and  Data Processing

$line = "If music be the food of love, play on" ;

@array = split(" " , $line ) ;

# $array[0] contains "If"# $array[4] contains "food"

Basic tokenisation

Page 27: Digital Text and  Data Processing

Looping through an array

foreach my $w ( @words ) {

print $w ;

}

Looping through an array

Page 28: Digital Text and  Data Processing

my %freq ;

$freq{"if"}++ ; $freq{“music"}++ ;

print $freq{"if"} . “\n" ;

Page 29: Digital Text and  Data Processing

Calculation of frequencies

my %freq ;

foreach my $w ( @words ) {

$freq{ $w }++ ;

}

Page 30: Digital Text and  Data Processing

foreach my $f ( sort { $freq{$b} <=> $freq{$a} } keys

%freq )

{print $f . "\t" . $freq{$f} . "\n" ;

}

Looping through a hash