Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

  • View
    216

  • Download
    0

Embed Size (px)

Text of Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science...

  • Slide 1
  • Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library
  • Slide 2
  • Que es NSDL? National Science Digital Library Purpose: Purpose: Educational Educational broad definition of Science: also Technology, Engineering, Mathematics, etc. broad definition of Science: also Technology, Engineering, Mathematics, etc. Production Research Production Research Users: Users: teachers, students, researchers, general public teachers, students, researchers, general public K-gray K-gray http://nsdl.org http://nsdl.org http://nsdl.org http://comm.nsdl.org http://comm.nsdl.org http://comm.nsdl.org Virtual communities Virtual communities
  • Slide 3
  • NSDL: Metadata Aggregator Centralized Metadata Repository Centralized Metadata Repository Two-tiered model: collections & items Two-tiered model: collections & items Item records harvested from collections Item records harvested from collections Diverse metadata formats and granularity levels Diverse metadata formats and granularity levels
  • Slide 4
  • Metadata Repository collection item collection item NSDL Architecture resource Search Service UI
  • Slide 5
  • Goal: Provide Normalized Metadata Why? Why? Quality of NSDL services (e.g. search results, or UI display) Enhance predictability of metadata for reharvesting services Improve metadata quality, when possible How? How?
  • Slide 6
  • Metadata Normalization Challenges Broad content Broad content Types of resources Types of resources Topics Topics Metadata Quality Metadata Quality Wildly inconsistent (what fields are used, what info is present) Wildly inconsistent (what fields are used, what info is present) Missing information Missing information Consistent, controlled vocabularies? Fuggedaboutit Consistent, controlled vocabularies? Fuggedaboutit Disparate Quantities Disparate Quantities (by subject, by collection) (by subject, by collection) 7 vs. 300,000 items 7 vs. 300,000 items Virtual Communities Virtual Communities Within communities, no agreement on needs Within communities, no agreement on needs Reduce human effort to keep costs down Reduce human effort to keep costs down
  • Slide 7
  • Metadata in the MARC World Relatively controlled, closed system with strong community Relatively controlled, closed system with strong community Comprehensive and current documentation Comprehensive and current documentation Edit checks at MARC application and bibliographic utility levels Edit checks at MARC application and bibliographic utility levels Routine review at creation point Routine review at creation point Random sampling at import/export Random sampling at import/export Trusted suppliers Trusted suppliers
  • Slide 8
  • Metadata Wild West Scattered community with many working in isolation, few with relevant background in describing resources Scattered community with many working in isolation, few with relevant background in describing resources Wide variety of resources to describe Wide variety of resources to describe Insufficient documentation and training available Insufficient documentation and training available Harvesting model developed well before notion of data quality Harvesting model developed well before notion of data quality
  • Slide 9
  • scrubbed & normalized NSDL Harvesting Model NSDL MR OAI server NSDL Search Service http://nsdl.org NSDL Archive Service NSDL Metadata Repository (MR) collection AAA metadata collection BBB metadata collection BBB metadata collection AAA metadata OAI server OAI server
  • Slide 10
  • Continuum of Approaches (1) Random sampling (XMLSpy) Random sampling (XMLSpy) Advantages Advantages Includes some formatting and color coding Includes some formatting and color coding Disadvantages Disadvantages Assumes consistency/predictability Assumes consistency/predictability Difficult to determine extent of problems found Difficult to determine extent of problems found Tedious, at best Tedious, at best
  • Slide 11
  • Continuum of Approaches (2) Spreadsheets (Microsoft Excel) Spreadsheets (Microsoft Excel) Advantages Advantages Better sorting and control by reviewer Better sorting and control by reviewer Disadvantages Disadvantages Unwieldy for large files Unwieldy for large files Requires sustained focus from reviewer Requires sustained focus from reviewer Requires translation into tab-delimited file Requires translation into tab-delimited file
  • Slide 12
  • Continuum of Approaches (3) Visual Graphical Analysis (Spotfire) Visual Graphical Analysis (Spotfire) Advantages Advantages View of several data dimensions simultaneously View of several data dimensions simultaneously Reviewer controls data display Reviewer controls data display Tends to pull reviewer focus to anomalies Tends to pull reviewer focus to anomalies Handles fairly large files at one time, while allowing subset views Handles fairly large files at one time, while allowing subset views Display manipulation possible without programmers Display manipulation possible without programmers Disadvantages Disadvantages High cost of software High cost of software Requires translation into tab-delimited file Requires translation into tab-delimited file
  • Slide 13
  • Visual Graphical Analysis: Allows you to review ALL the information in the file THOROUGHLY and QUICKLY. With a mouse click or two, you can: Reassign which characteristics the axes represent in a scatter plot Assign color, shape, and/or size to any characteristic to represent up to 5 dimensions simultaneously Display or not display specific values, including empty values, for any characteristic Display a selection of values and/or characteristics, and have the selection apply to other visualizations (e.g. tables and plots) View the information as a table, or in other representations Sort tables by characteristic column(s)
  • Slide 14
  • Metadata Analysis Spotfire demo Spotfire demo
  • Slide 15
  • Metadata analysis questions: Are the elements values plausible? Are there any glaring errors that must be addressed?
  • Slide 16
  • Spotfire Table View DC Creator values in the language field! Only DC Language elements are selected for display Sorted by element value The ability to select interesting subsets of information on the fly allows for manageably sized, scrollable lists in which ALL values can be examined.
  • Slide 17
  • Metadata analysis questions: Are there non-empty values that supply no information and that may confuse end users? Are all the DC Date values in W3CDTF syntax?
  • Slide 18
  • Spotfire Table View Non-empty, no information values that may confuse end users Only DC Date elements are selected for display The only W3CDTF syntax present is four digits. Sorted by element value
  • Slide 19
  • Metadata analysis questions: Which of the values of the DC Type element are actually DCMIType terms?
  • Slide 20
  • Spotfire Table View Not DCMIType terms DCMIType term Only DC Type elements are selected for display Sorted by element value
  • Slide 21
  • So Visualizing metadata for analysis can: Visualizing metadata for analysis can: Improve efficiency and thoroughness of review efforts Improve efficiency and thoroughness of review efforts Improve predictability of transformation results Improve predictability of transformation results Allow extensive data analysis without an ongoing need for programming support Allow extensive data analysis without an ongoing need for programming support
  • Slide 22
  • How do we normalize metadata? Perform safe transforms to smarten up metadata Perform safe transforms to smarten up metadata XSL stylesheets -- from raw XML metadata to NSDL normalized XML metadata XSL stylesheets -- from raw XML metadata to NSDL normalized XML metadata Principles: Principles: Do no harm (Dont lose information) Do no harm (Dont lose information) Add information, when possible Add information, when possible Indicate schemes for valid values Indicate schemes for valid values Remove meaningless text Remove meaningless text , not available, - , not available, - Empty elements Empty elements Correct erroneous information Correct erroneous information text/pdf application/pdf text/pdf application/pdf Remove characters that impede functionality or display Remove characters that impede functionality or display Encoding fixes (e.g. &, double XML encodings, bad UTF-8 ) Encoding fixes (e.g. &, double XML encodings, bad UTF-8 ) Scrub URLs Scrub URLs
  • Slide 23
  • Goal 2: NSDL at a Glance Whats in the NSDL? Whats in the NSDL? Collections Collections Subjects Subjects Intuitive UI Intuitive UI Interactive GUI displays Interactive GUI displays
  • Slide 24
  • NSDL at a Glance - Demos Spotfire Spotfire Treemap Treemap http://www.smartmoney.com http://www.smartmoney.com Star Tree Star Tree http://nsdl.org/collections/ataglance/browseBySubject.html http://nsdl.org/collections/ataglance/browseBySubject.html
  • Slide 25
  • How About Better Online Browsing?
  • Slide 26
  • Search and Browse False dichotomy! False dichotomy! Many different user tasks Many different user tasks Multiple ways to present results to users Multiple ways to present results to users Should the presentation vary with quantity and/or context of results? Should the presentation