60
1 UKOLN is supported by: Big and Small Web Data Marieke Guy, Institutional Support Officer, Digital Curation Centre, UKOLN, University of Bath, UK Institutional Web Management Workshop 2012 This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0

Big and Small Web Data

Embed Size (px)

DESCRIPTION

Workshop session given at the Institutional Web Management Workshop 2012 (IWMW 2012) event held at the University of Edinburgh on 18th - 20th June 2012.

Citation preview

Page 1: Big and Small Web Data

                                                             

1

UKOLN is supported by:

Big and Small Web Data

Marieke Guy, Institutional Support Officer, Digital Curation Centre, UKOLN, University of Bath, UK

Institutional Web Management Workshop 2012

This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0

Page 2: Big and Small Web Data

                                                             

2

Who Am I?

• Have worked for UKOLN for over 12 years• Worked on variety of projects:

Subject portals project, IMPACT, Good APIs, JISC Observatory, cultural heritage work, digital preservation work, …etc

• Remote worker, into amplified events• Co-chair of IWMW for a number of years

• Now working for Digital Curation Curation• Institutional Support Officer helping HEIs with their

RDM• New to data….

Page 3: Big and Small Web Data

                                                             

3

The Digital Curation Centre

• A consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)

• launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline

• Funded by JISC with additional HEFCE funding from 2011 for the provision of support to national cloud services

• Targeted institutional development• http://www.dcc.ac.uk/

Page 4: Big and Small Web Data

                                                             

4

Assessing Data Use

Page 5: Big and Small Web Data

                                                             

5

Data Management Tools

Page 6: Big and Small Web Data

                                                             

6How to cite data

Advocacy and Training

• Informatics: disciplinary metadata schema, standards, formats, identifiers, ontologies

• Storage: file-store, cloud, data centres, funder policy

• Access: embargoes, FOI• Policy: making the case

Page 7: Big and Small Web Data

                                                             

7

• Are you part of a Web team?• Are you part of a MIS team?• Are you a researcher?

• Do you know what data is?

• Do you use structured data?• Do you manage data?

Who Are You?

Page 8: Big and Small Web Data

                                                             

8

• Presentation: What is data anyway? Looking at current data trends and what it has to do with Web managers

• Break out groups: What data do you deal with? Anything goes from personnel data to key information sets and Web stats…

• Presentation/Show and Tell: Taster of tools that help with data (mining, citation, visualization, analytics, etc.)

• Presentation: Case study - Data @ Southampton

• Discussion and buzzword bingo

Today’s Workshop: A Data Journey!

Page 9: Big and Small Web Data

                                                             

9

• All urls at:http://www.delicious.com/mariekeguy/iwmw12

• All slides at:http://www.slideshare.net/MariekeGuy

• Also on IWMW12 Web site

Today’s Resources

Page 10: Big and Small Web Data

                                                             

10

http://www.flickr.com/photos/thinkmulejunk/352387473/

http://www.google.co.uk/imgres?q=illumina+bgi&hl=en&client=firefox-a&hs=Jl2&rls=org.mozilla:en-GB:official&biw=1366&bih

http://www.flickr.com/photos/wasp_barcode/4793484478/http://www.flickr.com/photos/charleswelch/

3597432481//

http://www.flickr.com/photos/usfsregion5/4546851916//

What is Data Anyway?

Page 11: Big and Small Web Data

                                                             

11

• Datum is / data are (!!!):– Facts and statistics collected together for

reference or analysis– Typically the results of measurements– Can be qualitative or quantitative– Unstructured or structured– Raw data, field data, experimental data– Data – information – knowledge– Data is the lowest level of abstraction

• Even researchers don’t know what data is….

A Data Definition

Page 12: Big and Small Web Data

                                                             

12

“Data underpins our economy and our society - data about how much is being spent and where, data about how schools, hospitals and police are performing, data about where things are and data about the weather.”

Tim Berners Lee, director of W3C.

A Data Present

Page 13: Big and Small Web Data

                                                             

13

• Big data• DIY data• Consumer data• Activity data• Crowd Sourced data• Linked data/ Web of data / semantic Web• Open data

Some Flavours of Data

Page 14: Big and Small Web Data

                                                             

14

Big Data

“Data that is too big to manage using ‘normal’ (database) tools.”

“big data people obviously like alliteration – “volume, velocity, variety, value” “speed, size, scope” Andy Powell

Page 15: Big and Small Web Data

                                                             

15

“The cost of sequencing DNA has taken a nosedive...and is now dropping by 50% every 5 months”

“I worry there won’t be enough people around to do the analysis” Chris Ponting, University of Oxford

“A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”

“The 1000 Genomes Project generated more DNA sequence data in its first 6 months than GenBank had accumulated in its entire 21 year existence”

“Raw image files for a single human genome have been estimated at 28.8 terabytes, which is approaching 30,000 gigabytes”

Big Data

Page 16: Big and Small Web Data

                                                             

16

Big Data

• 3 Vs: volume, velocity and variety• Could include scientific & research data, data Web

logs, RFID data, social data, search data, video, e-commerce

• Likely to require different tools and practices from what ‘we are used to’

• Technologies include massively parallel processing (MPP) databases, datamining grids, distributed file systems, distributed databases, cloud computing platforms and scalable storage systems

• Example tools are Hadoop, NoSQL, CouchDB, • Issues regarding storage, speed of access,

exponential growth, infrastructure, complexity

Page 17: Big and Small Web Data

                                                             

17

“DIY”

http://www.technologyreview.com/biomedicine/37784/

Kyle Machulis

Human physiology data

DIY Data

Page 18: Big and Small Web Data

                                                             

18

Consumer Data

Page 19: Big and Small Web Data

                                                             

19 http://www.touchagency.com/free-twitter-infographic/

Consumer Data

Page 20: Big and Small Web Data

                                                             

20

Consumer Data

1 in every 9 people on Earth is on Facebook

There are over 6 billion photos on Flickr

30 billion pieces of content are shared on Facebook each month

Google has been estimated to run over 1 million servers in data centers around the world

Walmart take data from 1 million customer transactions per hour

Page 21: Big and Small Web Data

                                                             

21

Activity Data

• “Data about users’ actions and attention”• Access, attention and activity• Many systems in institutions store data about the

actions of students, teachers and researchers• It’s good business• http://www.activitydata.org/• JISC Projects:

– Recommender systems– Improving the student experience– Resource management

• JISC Info kit – Business intelligence• Student retention

Page 22: Big and Small Web Data

                                                             

22

Page 23: Big and Small Web Data

                                                             

23

“Crowd-sourced” astronomy

Crowd Sourced Data

Page 24: Big and Small Web Data

                                                             

24

• “A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and share-alike.” Open Knowledge Foundation

• Why? Use of public money, advancement of science• Why not? Commercial and reputation reasons, cost of

preparing data• “You can do all types of stuff with data” TBL• But tricky to open access to data (cost, preparation,

capturing meaning, annotations, context, meaning etc.)

• Data is more valuable when accessible• Open data on Web: CKAN, open.gov, infochimps,

openstreetmap, dbpedia, freebase, numbrary, etc.

Open Data

Page 25: Big and Small Web Data

                                                             

25

http://www.flickr.com/photos/reedsturtevant/4288406572/

Linked Data

• Repurposing and aggregating data in machine readable format

• Southampton• data.open.ac.u

k• Lucero project• Linkeduniversiti

es.org• XCRI• Lincoln• Data.gov.uk

Page 26: Big and Small Web Data

                                                             

26

• Scale and complexity – data deluge – volume, pace, infrastructure

• Sensitivity of data• Openness – why aren’t people sharing?• Quality of data• Reputation – FOI, DPA, computer misuse• Management – Storage, incentive, costs &

sustainability• Preservation – where is your data?• Funding for researchers• Analysis

• Doing something useful with it…

The Key Data Issues

Page 27: Big and Small Web Data

                                                             

27

• DPA 1998– Sensitive Personal Data

“Data regarding an individual’s race or ethnic origin, political opinion, religious beliefs, trade union membership, physical or mental health, sex life, criminal proceedings or convictions…”

– Personal data• Relates to a living individual• The individual can be identified from those data

and other information• Includes any expression of opinion about the

individual• Data that may incriminate a person• Data a person prefers not to share with wider society

Sensitive Data

Page 28: Big and Small Web Data

                                                             

28

Choices are made according to context, with degrees of openness reached according to:• The kinds of data to be made available• The stage in the research process• The groups to whom data will be made

available• On what terms and conditions it will be

provided

Default position of most:• YES to protocols, software, analysis

tools, methods and techniques• NO to making research data content

freely available to everyone

After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010

Openness

Page 29: Big and Small Web Data

                                                             

29

Reputation

Page 30: Big and Small Web Data

                                                             

30

The case for cloud computing in genome informatics. Lincoln D Stein, May 2010

• Scalable• Cost-effective (rent on-demand)• Secure (privacy and IPR)• Robust and resilient• Low entry barrier / ease-of-use• Has data-handling / transfer /

analysis capability

What about Cloud services?

Data Storage Challenges

Page 31: Big and Small Web Data

                                                             

31

“So what has all this got to do with me..?”

The Web Managers ask:

Page 32: Big and Small Web Data

                                                             

32

What data do you deal with?

• Personnel data• Admissions• Timetables• Curriculum• key information sets• Web stats…

What do you do with this data?

Could you do more? What?

Break Out Groups

http://sidspace.info/

Page 33: Big and Small Web Data

                                                             

33

“So what has all this got to do with me..?”

Are the Web Managers still asking?

Page 34: Big and Small Web Data

                                                             

34Hal Varian, Chief Economist, Google

“The ability to take data - to be able to understand it, to process it, to extract value from it, to visualise it, to communicate it –that’s going to be a hugely important skill in the next decades.”

Hal Varian, Google’s chief economist.

A Data Future

Page 35: Big and Small Web Data

                                                             

35

• Data is relevant to those working with the Web at HEIs because:

• Data will affect your IT infrastructure, if it doesn’t already

• Data is becoming increasingly important for the REF and for funding so it will be increasingly important to your HEI

• It is getting easier to ask for data

• Structured data could make your life easier• The Web itself is becoming more structured• Data can show impact

• It’s all about the data….

Web Teams and Data

Page 36: Big and Small Web Data

                                                             

36

• Unstructured data accounts for more than 90% of digital universe (2011 Digital Universe study)

• Structured data on the rise for some time – deep web, annotation schemes, search data

• In the past web pages have contained information, now is the time for them to contain data

• Some key data areas Web teams need to think about:– Structure– Metrics– Patterns, data mining and analytics– Preservation (maybe one for another day?)

Web Teams and Data

Page 37: Big and Small Web Data

                                                             

37

• Move toward a Web that’s more fluid, less fixed, and more easily accessed on a multitude of devices

• futurefriend.ly’s Brad Frost, “get your content ready to go anywhere because it’s going to go everywhere.”

• Karen McGrane: calls them “content blobs” – “we can embrace meaningful, modular chunks that are ready to travel”

• Google Knowledge Graph: “currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects”

• Schema.org: “a collection of schemas, i.e., html tags, that webmasters can use to markup their pages in ways recognized by major search providers’”

Web Data: Structure

Page 38: Big and Small Web Data

                                                             

38

• There is a need for structured content in Web sites• ‘Future ready content’ - Sara Wachter-Boettcher

– 1. Get Purposeful – why do users want this content?

– 2. Get Micro – get granular, break content down (schema.org – microdata)

– 3. Get Meaningful – considering the meaning of elements

– 4. Get Organised – looking at your CMS– 5. Get Structured – DITA? XML? HTML5

(microdata)• ‘Create once, publish everywhere’ idea – mobile,

apis, etc.

Preparing for Structure

Page 39: Big and Small Web Data

                                                             

39

• Metrics – the new black? Kristen Ratan • “The more you know the more you realise you don’t

know”• What should we be tracking? e.g. Figures opened,

downloaded, inks clicked, time spent on article page, supplemental info viewed, authors’ info viewed

• Look at the pathways that info travels• Data can drive tenure and promotion, grants,

reputation, discovery, prioritization, attention• Issues: Missed citation data, data sources that aren’t

reliable, digital addresses change, usage doesn’t mean useful

Web Data: Metrics

Page 40: Big and Small Web Data

                                                             

40Hal Varian, Chief Economist, Google

“In other words, we no longer need to speculate and hypothesise; we simply need to let machines lead us to the patterns, trends, and relationships in social, economic, political, and environmental relationships.”

Mark Graham, Big Data blog, the Guardian.

Web Data: Patterns

Page 41: Big and Small Web Data

                                                             

41

• Customers expect us to be leveraging their activity to benefit their user experience

• “the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data.” Adam Cooper, CETIS

• Reporting and descriptive methods Vs inferential and predictive methods

• Data driven decisions? “human decisions supported by the use of good tools to provide us with data-derived insights”

• Don’t “let the numbers speak for themselves” – data only one input to decision process

• Data specialists and domain specialists work together

• Need to ask the right questions

Web Data: Analytics

Page 42: Big and Small Web Data

                                                             

42

• “The measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning and the environments in which it occurs.” 1st International Conference on Learning Analytics & Knowledge

• Open University Learner Analytics Project – Looked at withdrawals - e.g. when students stop

study before completion of a module towards a degree

– Possible to map what points on paths of study withdrawals occur.

• Other uses: personalisation, recommendation, research profiles, marketing and surveys, help desk, CRM, library

• Looking at disabled students/accessibility – linking learner analytics and web metrics

Web Data: Learning Analytics

Page 43: Big and Small Web Data

                                                             

43

• The Protection of Freedoms Bill is a UK parliamentary bill introduced in February 2011

• Has completed it’s readings – now passing through house of Lords

• 102 - amendments to FOIA - mandatory for public authorities to permit re-use of datasets when communicating them in response to a FOI request

• Datasets are collections of information held in electronic form i.e. 'raw data' gathered or created in connection with the university's functions or 'services’

• Government’s Innovation and Research Strategy for Growth - "a transformation in the accessibility of research and data”

Protection of Freedoms Bill

Page 44: Big and Small Web Data

                                                             

44

Tools that Could Help

http://www.flickr.com/photos/luc/5418037955/

Page 45: Big and Small Web Data

                                                             

45

Tools: Structure

• Schema.org• Google Rich Snippets testing tool – tests microdata,

microformats, RDFa• List of tools on Semanticweb.org

Page 46: Big and Small Web Data

                                                             

46

Tools: Metrics & Text Mining

• Google Analytics• Elsevier• total-impact• altmetric.com

Page 47: Big and Small Web Data

                                                             

47

Tools: Analytics

• SNAPP: Social Networks Adapting Pedagogical Practice

• GLASS (Gradient’s Learning Analytics System)• International Educational Data Mining society• Learning Analytics and Knowledge Conference

Page 48: Big and Small Web Data

                                                             

48

Data Visualisations

• Use your IT and your graphics design department• Make it interactive• Getting Awesome Results from Data

Visualisation – Rich Kirk• Data visualisation strategy

– Have a purpose– Have measurable KPIs vs purpose– Plan distribution in advance– Resource– Ensure visualisation matches purpose

• Chart chooser (Gene Zelazny's Saying It With Charts)• Measurement: pageviews, buzz, links, key word

ranking• “Tell a story with your data” – Ewan McIntosh at

IDCC11

Page 49: Big and Small Web Data

                                                             

49

Data Visualisation Help

• Great Web sites– Ewan McIntosh– Information is Beautiful– Pinterest– Guardian data blog– Flowing data– Infosthetics – information aesthetics – where form follows data

• Great tools– Manyeyes– Chartsbin, icharts, Google chart tools – Google developer– Google Fusion tables– Tableau public– Datamarket– Colour Brewer

Page 50: Big and Small Web Data

                                                             

50

Visualisations: Google Maps

Page 51: Big and Small Web Data

                                                             

51

Data Case Study: Southampton

• Not big data but small data

• Got to be useful!!

Chris Gutteridge - http://blogs.ecs.soton.ac.uk/data/

Page 52: Big and Small Web Data

                                                             

52

• Places: Buildings, Rooms, Campuses, Counties, Disabled Access

• Organisation Structure• Products & Services: Coffee, Sandwiches, Library

Services, Recycle Points• Points of Service: Coffee Shops, Swimming Pools,

Libraries, Receptions• Teaching: Courses, Modules, Statistics, Student

Satisfaction• Travel: Stations, Bus-Stops, Bus-Routes, Bus Times• Resources: EPrints, Videos, Learning Objects• People: Contact Information, Experts for the Media• Events: Open Days, University History• Jargon

Southampton Data

Page 53: Big and Small Web Data

                                                             

53

Southampton Open Data

Page 54: Big and Small Web Data

                                                             

54

• Google docs, excel spread sheets, RDF, triples• Grinder – github• Graphite – php library• Graphite (publishing RDF). Required skills:

– RDF structure– RDF/XML– XSLT

• Graphite (consuming RDF). Required skills: – RDF structure– PHP

Southampton Uses…

Page 55: Big and Small Web Data

                                                             

55

Data Case Study: Aberdeen

“I managed the Web and then inherited MIS. These two have now converged so that Web is using much better, structured data and standardising and consolidating sources. The MIS brings discipline to the Web – much needed if you ask me, anarchist though I am...”

Mike McConnell, Head of Web Services, University of Aberdeen.

Page 56: Big and Small Web Data

                                                             

56

• Loughborough University’s Pedestal for Progression• Roehampton University’s fulCRM• Southampton Student Dashboard at the University of

Southampton• tutees, directory info, whether coursework has been

handed in, and attendance.• University of Derby’s SETL (Student Engagement

Traffic Lighting)• The ESCAPES (Enhancing Student Centred

Administration for Placement ExperienceS) project at the University of Nottingham

Student Attendance Data

Page 57: Big and Small Web Data

                                                             

57

• At the moment it’s all about the data… (whether you like it or not!)

• Be aware of what is happening with data at your institution – data repository, MIS, RIM, CRIS, repository etc. Where do you sit in the picture?

• Structure your Web data – it makes sense• You can start with ‘little data’…• Think about what strategic questions you want to ask• Be grounded – efficiency and effectiveness• Start from the user end - think about the uses and

output• Follow up from the IT end – how can you automate

processes?• What can you use your data for? Can you show

impact/success?• How about telling a story with it?

Conclusions

Page 58: Big and Small Web Data

                                                             

58

Buzzword Bingo

cloud computing

Linked data

knowledge discovery in data (KDD)

data mining

predictive analytics

Big data

Data-Driven Decision making

clustering

data journalism

para data

data tsunami

data wrangler

data scientist

Page 59: Big and Small Web Data

                                                             

59

• From Guardian Datablog, by Johnathan Gray

• Data is not a force unto itself.• Data is not a perfect reflection of the world.• Data does not speak for itself. • Data is not power.• Interpreting data is not easy.

What Data Can and Cannot Do

Page 60: Big and Small Web Data

                                                             

60

Thanks!!

“The data that is valuable to you is already passing through your hands" ”

Doug Cutting, Chairman, Apache Software Foundation