12
DiaView: Visualise Cultural Change in Diachronic Corpora David Beavan UCL Centre for Digital Humanities @DavidBeavan www.scottishcorpus.ac.uk/corpus/ diaview

DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Embed Size (px)

DESCRIPTION

Talk given at Digital Humanities 2012 (DH2012) in Hamburg, Germany on 18 July 2012. Web site: http://www.scottishcorpus.ac.uk/corpus/diaview/ Video: http://lecture2go.uni-hamburg.de/konferenzen/-/k/13916 Abstract: http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/diaview-visualise-cultural-change-in-diachronic-corpora/ This paper will introduce and demonstrate DiaView, a new tool to investigate and visualise word usage in diachronic corpora. DiaView highlights cultural change over time by exposing salient lexical items from each decade or year, and providing them to the user in an effortless visualisation. This is made possible by examining large quantities of diachronic textual data, in this case the Google Books corpus (Michel et al. 2010) of one million English books. This paper will introduce the methods and technologies at its core, perform a demonstration of the tool and discuss further possibilities.

Citation preview

Page 1: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview

Page 2: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Google Books corpus/Ngram Viewer

http://books.google.com/ngrams/

Page 3: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Google Books corpus

• OCR quality variable, particularly poor in 1700s(difficulties with long-s: ſ )

• Does not evenly sample across genres(data collection fairly opportunistic)

• Chronological placement questionable(implicit metadata not always correct)

• Very large data set(155 billion tokens)

Page 4: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView uses

• English One Million corpus“Books with low OCR quality were removed, and serials were removed.”

• 1850 to present(avoids long-s)

• 98 billion tokens(still very large)

• Filter out very infrequently used words(or keep large sample of most frequently used)

Page 5: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView concept

• Quick and easy to use• Aggregate and summarise data• Promote browsing and opportunistic discovery• Help identify cultural trends across time• Highlight salient or ‘interesting’ terms• Provide links to more in-depth analysis• Inspect corpus by decade or year• Ability to work with any corpora or any dataset

Page 6: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView method/measuring salience

Proportion of term occurrences inentire corpus

vs

Proportion of term occurrences inparticular year

Page 7: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

Word ‘and’

100 of 1000 words in entire corpus is ‘and’ = 10%

Year 1 45 of 500 words = 9% = -10% of corpus proportion (10%)Year 2 55 of 500 words = 11% = +10% of corpus proportion (10%)

Word ‘sausage’

20 of 1000 words in entire corpus is ‘sausage’ = 2%

Year 1 4 of 500 words = 0.2% = -90% of corpus proportion (2%)Year 2 16 of 500 words = 3.2% = +60% of corpus proportion (2%)

Rank for salience by year, ignoring underuse (not negative %ages)

Year 1 -Year 2 ‘sausage’ (+60%), ‘and’ (+10%)

Page 8: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView method

• Word frequency alone does not dictate salience(extraordinary over use does)

• Traverse entire corpus by year/decade• Calculate salience for each type• Rank types according to salience• Apply visual style to word lists• Create links back to Ngram Viewer

for in-depth analysis

Page 9: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

www.scottishcorpus.ac.uk/corpus/diaview

Page 10: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

www.scottishcorpus.ac.uk/corpus/diaview

Page 11: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012
Page 12: DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

DiaView:Visualise Cultural Change in Diachronic Corpora

David BeavanUCL Centre for Digital Humanities

@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview