Transcript
Page 1: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview

Page 2: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Google Books corpus/Ngram Viewer

http://books.google.com/ngrams/

Page 3: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Google Books corpus

• OCR quality variable, particularly poor in 1700s(difficulties with long-s: ſ )

• Does not evenly sample across genres(data collection fairly opportunistic)

• Chronological placement questionable(implicit metadata not always correct)

• Very large data set(155 billion tokens)

Page 4: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView uses

• English One Million corpus“Books with low OCR quality were removed, and serials were removed.”

• 1850 to present(avoids long-s)

• 98 billion tokens(still very large)

• Filter out very infrequently used words(or keep large sample of most frequently used)

Page 5: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView concept

• Quick and easy to use• Aggregate and summarise data• Promote browsing and opportunistic discovery• Help identify cultural trends across time• Highlight salient or ‘interesting’ terms• Provide links to more in-depth analysis• Inspect corpus by decade or year• Ability to work with any corpora or any dataset

Page 6: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView method/measuring salience

Proportion of term occurrences inentire corpus

vs

Proportion of term occurrences inparticular year

Page 7: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Word ‘and’

100 of 1000 words in entire corpus is ‘and’ = 10%

Year 1 45 of 500 words = 9% = -10% of corpus proportion (10%)Year 2 55 of 500 words = 11% = +10% of corpus proportion (10%)

Word ‘sausage’

20 of 1000 words in entire corpus is ‘sausage’ = 2%

Year 1 4 of 500 words = 0.2% = -90% of corpus proportion (2%)Year 2 16 of 500 words = 3.2% = +60% of corpus proportion (2%)

Rank for salience by year, ignoring underuse (not negative %ages)

Year 1 -Year 2 ‘sausage’ (+60%), ‘and’ (+10%)

Page 8: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView method

• Word frequency alone does not dictate salience(extraordinary over use does)

• Traverse entire corpus by year/decade• Calculate salience for each type• Rank types according to salience• Apply visual style to word lists• Create links back to Ngram Viewer

for in-depth analysis

Page 9: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

www.scottishcorpus.ac.uk/corpus/diaview

Page 10: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

www.scottishcorpus.ac.uk/corpus/diaview

Page 11: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012
Page 12: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview


Recommended