DiaView:Visualise Cultural Change in Diachronic Corpora
David BeavanUCL Centre for Digital Humanities
@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview
Google Books corpus/Ngram Viewer
http://books.google.com/ngrams/
Google Books corpus
• OCR quality variable, particularly poor in 1700s(difficulties with long-s: ſ )
• Does not evenly sample across genres(data collection fairly opportunistic)
• Chronological placement questionable(implicit metadata not always correct)
• Very large data set(155 billion tokens)
DiaView uses
• English One Million corpus“Books with low OCR quality were removed, and serials were removed.”
• 1850 to present(avoids long-s)
• 98 billion tokens(still very large)
• Filter out very infrequently used words(or keep large sample of most frequently used)
DiaView concept
• Quick and easy to use• Aggregate and summarise data• Promote browsing and opportunistic discovery• Help identify cultural trends across time• Highlight salient or ‘interesting’ terms• Provide links to more in-depth analysis• Inspect corpus by decade or year• Ability to work with any corpora or any dataset
DiaView method/measuring salience
Proportion of term occurrences inentire corpus
vs
Proportion of term occurrences inparticular year
Word ‘and’
100 of 1000 words in entire corpus is ‘and’ = 10%
Year 1 45 of 500 words = 9% = -10% of corpus proportion (10%)Year 2 55 of 500 words = 11% = +10% of corpus proportion (10%)
Word ‘sausage’
20 of 1000 words in entire corpus is ‘sausage’ = 2%
Year 1 4 of 500 words = 0.2% = -90% of corpus proportion (2%)Year 2 16 of 500 words = 3.2% = +60% of corpus proportion (2%)
Rank for salience by year, ignoring underuse (not negative %ages)
Year 1 -Year 2 ‘sausage’ (+60%), ‘and’ (+10%)
DiaView method
• Word frequency alone does not dictate salience(extraordinary over use does)
• Traverse entire corpus by year/decade• Calculate salience for each type• Rank types according to salience• Apply visual style to word lists• Create links back to Ngram Viewer
for in-depth analysis
www.scottishcorpus.ac.uk/corpus/diaview
www.scottishcorpus.ac.uk/corpus/diaview
DiaView:Visualise Cultural Change in Diachronic Corpora
David BeavanUCL Centre for Digital Humanities
@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview