Upload
marieke-guy
View
107
Download
2
Embed Size (px)
DESCRIPTION
Workshop session given at the Institutional Web Management Workshop 2012 (IWMW 2012) event held at the University of Edinburgh on 18th - 20th June 2012.
Citation preview
1
UKOLN is supported by:
Big and Small Web Data
Marieke Guy, Institutional Support Officer, Digital Curation Centre, UKOLN, University of Bath, UK
Institutional Web Management Workshop 2012
This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
2
Who Am I?
• Have worked for UKOLN for over 12 years• Worked on variety of projects:
Subject portals project, IMPACT, Good APIs, JISC Observatory, cultural heritage work, digital preservation work, …etc
• Remote worker, into amplified events• Co-chair of IWMW for a number of years
• Now working for Digital Curation Curation• Institutional Support Officer helping HEIs with their
RDM• New to data….
3
The Digital Curation Centre
• A consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)
• launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline
• Funded by JISC with additional HEFCE funding from 2011 for the provision of support to national cloud services
• Targeted institutional development• http://www.dcc.ac.uk/
4
Assessing Data Use
5
Data Management Tools
6How to cite data
Advocacy and Training
• Informatics: disciplinary metadata schema, standards, formats, identifiers, ontologies
• Storage: file-store, cloud, data centres, funder policy
• Access: embargoes, FOI• Policy: making the case
7
• Are you part of a Web team?• Are you part of a MIS team?• Are you a researcher?
• Do you know what data is?
• Do you use structured data?• Do you manage data?
Who Are You?
8
• Presentation: What is data anyway? Looking at current data trends and what it has to do with Web managers
• Break out groups: What data do you deal with? Anything goes from personnel data to key information sets and Web stats…
• Presentation/Show and Tell: Taster of tools that help with data (mining, citation, visualization, analytics, etc.)
• Presentation: Case study - Data @ Southampton
• Discussion and buzzword bingo
Today’s Workshop: A Data Journey!
9
• All urls at:http://www.delicious.com/mariekeguy/iwmw12
• All slides at:http://www.slideshare.net/MariekeGuy
• Also on IWMW12 Web site
Today’s Resources
10
http://www.flickr.com/photos/thinkmulejunk/352387473/
http://www.google.co.uk/imgres?q=illumina+bgi&hl=en&client=firefox-a&hs=Jl2&rls=org.mozilla:en-GB:official&biw=1366&bih
http://www.flickr.com/photos/wasp_barcode/4793484478/http://www.flickr.com/photos/charleswelch/
3597432481//
http://www.flickr.com/photos/usfsregion5/4546851916//
What is Data Anyway?
11
• Datum is / data are (!!!):– Facts and statistics collected together for
reference or analysis– Typically the results of measurements– Can be qualitative or quantitative– Unstructured or structured– Raw data, field data, experimental data– Data – information – knowledge– Data is the lowest level of abstraction
• Even researchers don’t know what data is….
A Data Definition
12
“Data underpins our economy and our society - data about how much is being spent and where, data about how schools, hospitals and police are performing, data about where things are and data about the weather.”
Tim Berners Lee, director of W3C.
A Data Present
13
• Big data• DIY data• Consumer data• Activity data• Crowd Sourced data• Linked data/ Web of data / semantic Web• Open data
Some Flavours of Data
14
Big Data
“Data that is too big to manage using ‘normal’ (database) tools.”
“big data people obviously like alliteration – “volume, velocity, variety, value” “speed, size, scope” Andy Powell
15
“The cost of sequencing DNA has taken a nosedive...and is now dropping by 50% every 5 months”
“I worry there won’t be enough people around to do the analysis” Chris Ponting, University of Oxford
“A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”
“The 1000 Genomes Project generated more DNA sequence data in its first 6 months than GenBank had accumulated in its entire 21 year existence”
“Raw image files for a single human genome have been estimated at 28.8 terabytes, which is approaching 30,000 gigabytes”
Big Data
16
Big Data
• 3 Vs: volume, velocity and variety• Could include scientific & research data, data Web
logs, RFID data, social data, search data, video, e-commerce
• Likely to require different tools and practices from what ‘we are used to’
• Technologies include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms and scalable storage systems
• Example tools are Hadoop, NoSQL, CouchDB, • Issues regarding storage, speed of access,
exponential growth, infrastructure, complexity
17
“DIY”
http://www.technologyreview.com/biomedicine/37784/
Kyle Machulis
Human physiology data
DIY Data
18
Consumer Data
19 http://www.touchagency.com/free-twitter-infographic/
Consumer Data
20
Consumer Data
1 in every 9 people on Earth is on Facebook
There are over 6 billion photos on Flickr
30 billion pieces of content are shared on Facebook each month
Google has been estimated to run over 1 million servers in data centers around the world
Walmart take data from 1 million customer transactions per hour
21
Activity Data
• “Data about users’ actions and attention”• Access, attention and activity• Many systems in institutions store data about the
actions of students, teachers and researchers• It’s good business• http://www.activitydata.org/• JISC Projects:
– Recommender systems– Improving the student experience– Resource management
• JISC Info kit – Business intelligence• Student retention
22
23
“Crowd-sourced” astronomy
Crowd Sourced Data
24
• “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.” Open Knowledge Foundation
• Why? Use of public money, advancement of science• Why not? Commercial and reputation reasons, cost of
preparing data• “You can do all types of stuff with data” TBL• But tricky to open access to data (cost, preparation,
capturing meaning, annotations, context, meaning etc.)
• Data is more valuable when accessible• Open data on Web: CKAN, open.gov, infochimps,
openstreetmap, dbpedia, freebase, numbrary, etc.
Open Data
25
http://www.flickr.com/photos/reedsturtevant/4288406572/
Linked Data
• Repurposing and aggregating data in machine readable format
• Southampton• data.open.ac.u
k• Lucero project• Linkeduniversiti
es.org• XCRI• Lincoln• Data.gov.uk
26
• Scale and complexity – data deluge – volume, pace, infrastructure
• Sensitivity of data• Openness – why aren’t people sharing?• Quality of data• Reputation – FOI, DPA, computer misuse• Management – Storage, incentive, costs &
sustainability• Preservation – where is your data?• Funding for researchers• Analysis
• Doing something useful with it…
The Key Data Issues
27
• DPA 1998– Sensitive Personal Data
“Data regarding an individual’s race or ethnic origin, political opinion, religious beliefs, trade union membership, physical or mental health, sex life, criminal proceedings or convictions…”
– Personal data• Relates to a living individual• The individual can be identified from those data
and other information• Includes any expression of opinion about the
individual• Data that may incriminate a person• Data a person prefers not to share with wider society
Sensitive Data
28
Choices are made according to context, with degrees of openness reached according to:• The kinds of data to be made available• The stage in the research process• The groups to whom data will be made
available• On what terms and conditions it will be
provided
Default position of most:• YES to protocols, software, analysis
tools, methods and techniques• NO to making research data content
freely available to everyone
After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010
Openness
29
Reputation
30
The case for cloud computing in genome informatics. Lincoln D Stein, May 2010
• Scalable• Cost-effective (rent on-demand)• Secure (privacy and IPR)• Robust and resilient• Low entry barrier / ease-of-use• Has data-handling / transfer /
analysis capability
What about Cloud services?
Data Storage Challenges
31
“So what has all this got to do with me..?”
The Web Managers ask:
32
What data do you deal with?
• Personnel data• Admissions• Timetables• Curriculum• key information sets• Web stats…
What do you do with this data?
Could you do more? What?
Break Out Groups
http://sidspace.info/
33
“So what has all this got to do with me..?”
Are the Web Managers still asking?
34Hal Varian, Chief Economist, Google
“The ability to take data - to be able to understand it, to process it, to extract value from it, to visualise it, to communicate it –that’s going to be a hugely important skill in the next decades.”
Hal Varian, Google’s chief economist.
A Data Future
35
• Data is relevant to those working with the Web at HEIs because:
• Data will affect your IT infrastructure, if it doesn’t already
• Data is becoming increasingly important for the REF and for funding so it will be increasingly important to your HEI
• It is getting easier to ask for data
• Structured data could make your life easier• The Web itself is becoming more structured• Data can show impact
• It’s all about the data….
Web Teams and Data
36
• Unstructured data accounts for more than 90% of digital universe (2011 Digital Universe study)
• Structured data on the rise for some time – deep web, annotation schemes, search data
• In the past web pages have contained information, now is the time for them to contain data
• Some key data areas Web teams need to think about:– Structure– Metrics– Patterns, data mining and analytics– Preservation (maybe one for another day?)
Web Teams and Data
37
• Move toward a Web that’s more fluid, less fixed, and more easily accessed on a multitude of devices
• futurefriend.ly’s Brad Frost, “get your content ready to go anywhere because it’s going to go everywhere.”
• Karen McGrane: calls them “content blobs” – “we can embrace meaningful, modular chunks that are ready to travel”
• Google Knowledge Graph: “currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects”
• Schema.org: “a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers’”
Web Data: Structure
38
• There is a need for structured content in Web sites• ‘Future ready content’ - Sara Wachter-Boettcher
– 1. Get Purposeful – why do users want this content?
– 2. Get Micro – get granular, break content down (schema.org – microdata)
– 3. Get Meaningful – considering the meaning of elements
– 4. Get Organised – looking at your CMS– 5. Get Structured – DITA? XML? HTML5
(microdata)• ‘Create once, publish everywhere’ idea – mobile,
apis, etc.
Preparing for Structure
39
• Metrics – the new black? Kristen Ratan • “The more you know the more you realise you don’t
know”• What should we be tracking? e.g. Figures opened,
downloaded, inks clicked, time spent on article page, supplemental info viewed, authors’ info viewed
• Look at the pathways that info travels• Data can drive tenure and promotion, grants,
reputation, discovery, prioritization, attention• Issues: Missed citation data, data sources that aren’t
reliable, digital addresses change, usage doesn’t mean useful
Web Data: Metrics
40Hal Varian, Chief Economist, Google
“In other words, we no longer need to speculate and hypothesise; we simply need to let machines lead us to the patterns, trends, and relationships in social, economic, political, and environmental relationships.”
Mark Graham, Big Data blog, the Guardian.
Web Data: Patterns
41
• Customers expect us to be leveraging their activity to benefit their user experience
• “the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data.” Adam Cooper, CETIS
• Reporting and descriptive methods Vs inferential and predictive methods
• Data driven decisions? “human decisions supported by the use of good tools to provide us with data-derived insights”
• Don’t “let the numbers speak for themselves” – data only one input to decision process
• Data specialists and domain specialists work together
• Need to ask the right questions
Web Data: Analytics
42
• “The measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs.” 1st International Conference on Learning Analytics & Knowledge
• Open University Learner Analytics Project – Looked at withdrawals - e.g. when students stop
study before completion of a module towards a degree
– Possible to map what points on paths of study withdrawals occur.
• Other uses: personalisation, recommendation, research profiles, marketing and surveys, help desk, CRM, library
• Looking at disabled students/accessibility – linking learner analytics and web metrics
Web Data: Learning Analytics
43
• The Protection of Freedoms Bill is a UK parliamentary bill introduced in February 2011
• Has completed it’s readings – now passing through house of Lords
• 102 - amendments to FOIA - mandatory for public authorities to permit re-use of datasets when communicating them in response to a FOI request
• Datasets are collections of information held in electronic form i.e. 'raw data' gathered or created in connection with the university's functions or 'services’
• Government’s Innovation and Research Strategy for Growth - "a transformation in the accessibility of research and data”
Protection of Freedoms Bill
44
Tools that Could Help
http://www.flickr.com/photos/luc/5418037955/
45
Tools: Structure
• Schema.org• Google Rich Snippets testing tool – tests microdata,
microformats, RDFa• List of tools on Semanticweb.org
46
Tools: Metrics & Text Mining
• Google Analytics• Elsevier• total-impact• altmetric.com
47
Tools: Analytics
• SNAPP: Social Networks Adapting Pedagogical Practice
• GLASS (Gradient’s Learning Analytics System)• International Educational Data Mining society• Learning Analytics and Knowledge Conference
48
Data Visualisations
• Use your IT and your graphics design department• Make it interactive• Getting Awesome Results from Data
Visualisation – Rich Kirk• Data visualisation strategy
– Have a purpose– Have measurable KPIs vs purpose– Plan distribution in advance– Resource– Ensure visualisation matches purpose
• Chart chooser (Gene Zelazny's Saying It With Charts)• Measurement: pageviews, buzz, links, key word
ranking• “Tell a story with your data” – Ewan McIntosh at
IDCC11
49
Data Visualisation Help
• Great Web sites– Ewan McIntosh– Information is Beautiful– Pinterest– Guardian data blog– Flowing data– Infosthetics – information aesthetics – where form follows data
• Great tools– Manyeyes– Chartsbin, icharts, Google chart tools – Google developer– Google Fusion tables– Tableau public– Datamarket– Colour Brewer
50
Visualisations: Google Maps
51
Data Case Study: Southampton
• Not big data but small data
• Got to be useful!!
Chris Gutteridge - http://blogs.ecs.soton.ac.uk/data/
52
• Places: Buildings, Rooms, Campuses, Counties, Disabled Access
• Organisation Structure• Products & Services: Coffee, Sandwiches, Library
Services, Recycle Points• Points of Service: Coffee Shops, Swimming Pools,
Libraries, Receptions• Teaching: Courses, Modules, Statistics, Student
Satisfaction• Travel: Stations, Bus-Stops, Bus-Routes, Bus Times• Resources: EPrints, Videos, Learning Objects• People: Contact Information, Experts for the Media• Events: Open Days, University History• Jargon
Southampton Data
53
Southampton Open Data
54
• Google docs, excel spread sheets, RDF, triples• Grinder – github• Graphite – php library• Graphite (publishing RDF). Required skills:
– RDF structure– RDF/XML– XSLT
• Graphite (consuming RDF). Required skills: – RDF structure– PHP
Southampton Uses…
55
Data Case Study: Aberdeen
“I managed the Web and then inherited MIS. These two have now converged so that Web is using much better, structured data and standardising and consolidating sources. The MIS brings discipline to the Web – much needed if you ask me, anarchist though I am...”
Mike McConnell, Head of Web Services, University of Aberdeen.
56
• Loughborough University’s Pedestal for Progression• Roehampton University’s fulCRM• Southampton Student Dashboard at the University of
Southampton• tutees, directory info, whether coursework has been
handed in, and attendance.• University of Derby’s SETL (Student Engagement
Traffic Lighting)• The ESCAPES (Enhancing Student Centred
Administration for Placement ExperienceS) project at the University of Nottingham
Student Attendance Data
57
• At the moment it’s all about the data… (whether you like it or not!)
• Be aware of what is happening with data at your institution – data repository, MIS, RIM, CRIS, repository etc. Where do you sit in the picture?
• Structure your Web data – it makes sense• You can start with ‘little data’…• Think about what strategic questions you want to ask• Be grounded – efficiency and effectiveness• Start from the user end - think about the uses and
output• Follow up from the IT end – how can you automate
processes?• What can you use your data for? Can you show
impact/success?• How about telling a story with it?
Conclusions
58
Buzzword Bingo
cloud computing
Linked data
knowledge discovery in data (KDD)
data mining
predictive analytics
Big data
Data-Driven Decision making
clustering
data journalism
para data
data tsunami
data wrangler
data scientist
59
• From Guardian Datablog, by Johnathan Gray
• Data is not a force unto itself.• Data is not a perfect reflection of the world.• Data does not speak for itself. • Data is not power.• Interpreting data is not easy.
What Data Can and Cannot Do
60
Thanks!!
“The data that is valuable to you is already passing through your hands" ”
Doug Cutting, Chairman, Apache Software Foundation