Upload
labsbl
View
27
Download
0
Embed Size (px)
Citation preview
Data Science at the Alan Turing Institute and British Library Web Archiving
Dr Scott A. Hale @computermacgyve http://scott.hale.us
1. Fifteen Years of British Universities on the Web
2. Live versus Archive
3. Full Stack Data Science
Mapping the UK Webspace:
Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts
Fifteen Years of British Universities on the Web
With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.
WebSci 2014, https://arxiv.org/abs/1405.2856
Few longitudinal studies of the Web
To what extent can online proxies reproduce traditional measures?
Does physical distance matter for universities online?
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
Grouped to 3rd level domain (e.g., ox.ac.uk)
Grouped pages crawled at similar times (within 1,000 seconds)
Edge weight between any two domains for a given year is the largest number of hyperlinks between those two domains for any group that year
cam.ac.uk
ox.ac.uk
(2005, 2), (2006,8), ..., (2010, 13)
Colour ~ intensity
σ𝑖𝑗 =𝑠𝑖𝑗
𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑟0.28
University affiliations weakly reflected
Correlation between network centrality and league table rankings increasing
Physical distance still important
Completeness
Variable timing of captures
Boundary effects (.uk) ◦ Not really an issue for .ac.uk
Live versus Archive:
Scott A. Hale, Grant Blank, & Victoria D. Alexander
Comparing a Web Archive and to a Population of Webpages
In Web as History, R. Schroeder and N. Brügger (Eds.), London: UCL Press.
Ainsworth, et al. (2013). 35-90% of the Web is archived
Unclear how much of any website is archived
Comparison of 1996-2013 JISC data for tripadvisor.co.uk to the live Web
Why? Can determine when new webpages are added.
24% of TripAdvisor’s London attractions were in the JISC/Internet Archive data
Archived pages biased toward prominent Not a random sample
Full Stack Data Science
Methods to discover and evaluate whether a site not in .uk is ‘British’
More complete crawls / machine-readable metadata on what is not crawled
For social science research Appropriate network null models for missing/biased data
Rich and accessible meta-data
Data Science at the Alan Turing Institute and British Library Web Archiving
Dr Scott A. Hale @computermacgyve http://scott.hale.us