49
Metadata Metadata and and Information Information Visualization Visualization Naomi Dushay Naomi Dushay Cornell Information Science Cornell Information Science National Science Digital National Science Digital Library Library

Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata Metadata and and

Information VisualizationInformation Visualization

Naomi DushayNaomi Dushay

Cornell Information ScienceCornell Information Science

National Science Digital LibraryNational Science Digital Library

Page 2: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

¿Que es NSDL?¿Que es NSDL?National Science Digital LibraryNational Science Digital Library

Purpose:Purpose: EducationalEducational

broad definition of Science: also Technology, Engineering, broad definition of Science: also Technology, Engineering, Mathematics, etc.Mathematics, etc.

Production Production Research Research

Users:Users: teachers, students, researchers, general publicteachers, students, researchers, general public K-grayK-gray

http://nsdl.orghttp://nsdl.org

http://comm.nsdl.orghttp://comm.nsdl.org Virtual communitiesVirtual communities

Page 3: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

NSDL: Metadata AggregatorNSDL: Metadata Aggregator

Centralized Metadata RepositoryCentralized Metadata Repository

Two-tiered model: collections & itemsTwo-tiered model: collections & items Item records harvested from collectionsItem records harvested from collections

Diverse metadata formats and Diverse metadata formats and granularity levelsgranularity levels

Page 4: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata Repository

collection

item

item

itemitem

item

collection

item

item

itemitem

item

NSDL ArchitectureNSDL Architecture

resourceresource

resource

resource

resourceresource

resource

resource

resource

resource

SearchService

UI

Page 5: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Goal: Goal: Provide Normalized Metadata Provide Normalized Metadata

Why? Why? Quality of NSDL services (e.g. search Quality of NSDL services (e.g. search

results, or UI display)results, or UI display) Enhance predictability of metadata for Enhance predictability of metadata for

reharvesting servicesreharvesting services Improve metadata quality, when possibleImprove metadata quality, when possible

How?How?

Page 6: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata Normalization Metadata Normalization ChallengesChallenges

Broad contentBroad content Types of resourcesTypes of resources TopicsTopics

Metadata QualityMetadata Quality Wildly inconsistent (what fields are used, Wildly inconsistent (what fields are used,

what info is present)what info is present) Missing informationMissing information Consistent, controlled vocabularies? Consistent, controlled vocabularies?

FuggedaboutitFuggedaboutit

Disparate Quantities Disparate Quantities (by subject, by collection)(by subject, by collection) 7 vs. 300,000 items 7 vs. 300,000 items

Virtual Communities Virtual Communities Within communities, no agreement on Within communities, no agreement on

needsneeds

Reduce human effort to keep costs downReduce human effort to keep costs down

Page 7: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata in the MARC Metadata in the MARC WorldWorld

Relatively controlled, closed system with Relatively controlled, closed system with strong communitystrong community

Comprehensive and current documentationComprehensive and current documentation Edit checks at MARC application and Edit checks at MARC application and

bibliographic utility levelsbibliographic utility levels Routine review at creation point Routine review at creation point Random sampling at import/exportRandom sampling at import/export Trusted suppliersTrusted suppliers

Page 8: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata Wild WestMetadata Wild West

Scattered community with many working in Scattered community with many working in isolation, few with relevant background in isolation, few with relevant background in describing resourcesdescribing resources

Wide variety of resources to describeWide variety of resources to describe

Insufficient documentation and training Insufficient documentation and training

availableavailable Harvesting model developed well before Harvesting model developed well before

notion of data qualitynotion of data quality

Page 9: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

scrubbed & normalizedscrubbed & normalized

NSDL Harvesting ModelNSDL Harvesting Model

NSDLMROAI

server

NSDLSearchService

http://nsdl.org

NSDLArchiveService

NSDLMetadataRepository

(MR)

collectionAAA

metadata

collectionBBB

metadata

collectionBBB

metadata

collectionAAA

metadata

OAIserver

OAIserver

Page 10: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Continuum of Approaches Continuum of Approaches (1)(1)

Random sampling (XMLSpy)Random sampling (XMLSpy) AdvantagesAdvantages

Includes some formatting and color codingIncludes some formatting and color coding

DisadvantagesDisadvantages Assumes consistency/predictabilityAssumes consistency/predictability Difficult to determine extent of problems Difficult to determine extent of problems

foundfound Tedious, at bestTedious, at best

Page 11: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Continuum of Approaches Continuum of Approaches (2)(2)

Spreadsheets (Microsoft Excel)Spreadsheets (Microsoft Excel) AdvantagesAdvantages

Better sorting and control by reviewerBetter sorting and control by reviewer

DisadvantagesDisadvantages Unwieldy for large filesUnwieldy for large files

Requires sustained focus from reviewerRequires sustained focus from reviewer

Requires translation into tab-delimited fileRequires translation into tab-delimited file

Page 12: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Continuum of Approaches Continuum of Approaches (3)(3)

Visual Graphical Analysis (Spotfire)Visual Graphical Analysis (Spotfire) AdvantagesAdvantages

View of several data dimensions simultaneouslyView of several data dimensions simultaneously Reviewer controls data displayReviewer controls data display Tends to pull reviewer focus to anomaliesTends to pull reviewer focus to anomalies Handles fairly large files at one time, while allowing subset viewsHandles fairly large files at one time, while allowing subset views Display manipulation possible without programmersDisplay manipulation possible without programmers

DisadvantagesDisadvantages High cost of softwareHigh cost of software Requires translation into tab-delimited fileRequires translation into tab-delimited file

Page 13: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Visual Graphical Analysis:Visual Graphical Analysis:Allows you to review ALL the information in the file THOROUGHLY and Allows you to review ALL the information in the file THOROUGHLY and

QUICKLY.QUICKLY.

With a mouse click or two, you canWith a mouse click or two, you can:: ReassignReassign which characteristics the which characteristics the axesaxes represent in a scatter plot represent in a scatter plot Assign color, shape, and/or sizeAssign color, shape, and/or size to any characteristic to represent to any characteristic to represent

up to 5 dimensions simultaneouslyup to 5 dimensions simultaneously Display or not display specific valuesDisplay or not display specific values, including empty values, for , including empty values, for

any characteristicany characteristic Display a selection of valuesDisplay a selection of values and/or characteristics, and have the and/or characteristics, and have the

selection apply to other visualizations (e.g. tables and plots)selection apply to other visualizations (e.g. tables and plots) View the information as a View the information as a tabletable, or in other representations, or in other representations Sort tablesSort tables by characteristic column(s) by characteristic column(s)

Page 14: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata AnalysisMetadata Analysis

Spotfire demoSpotfire demo

Page 15: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Are the elements’ values plausible? Are the elements’ values plausible?

Are there any glaring errors that must Are there any glaring errors that must be addressed?be addressed?

Page 16: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Spotfire Table Spotfire Table ViewView

DC Creator values in the language

field!

Only DC Language elements are selected for

display Sorted by element

value

The ability to select interesting subsets of information – on the fly – allows for manageably sized, scrollable lists in which ALL values can be examined.

Page 17: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Are there non-empty values that Are there non-empty values that supply no information and that may supply no information and that may confuse end users?confuse end users?

Are all the DC Date values in W3CDTF Are all the DC Date values in W3CDTF syntax?syntax?

Page 18: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Spotfire Table Spotfire Table ViewView

Non-empty, “no information”

values that may confuse end users

Only DC Date elements are

selected for display

The only W3CDTF syntax present is four

digits.

Sorted by element value

Page 19: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Which of the values of the DC Type Which of the values of the DC Type element are actually DCMIType element are actually DCMIType terms?terms?

Page 20: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Spotfire Table ViewSpotfire Table View

Not DCMIType terms

DCMIType term

Only DC Type elements are

selected for display

Sorted by element value

Page 21: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

So …So …

Visualizing metadata for analysis can:Visualizing metadata for analysis can:

Improve efficiency and thoroughness of Improve efficiency and thoroughness of review effortsreview efforts

Improve predictability of transformation Improve predictability of transformation resultsresults

Allow extensive data analysis without an Allow extensive data analysis without an ongoing need for programming supportongoing need for programming support

Page 22: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

How do we normalize How do we normalize metadata?metadata?

Perform “safe” transforms to “smarten up” Perform “safe” transforms to “smarten up” metadatametadata XSL stylesheets -- from raw XML metadata to NSDL XSL stylesheets -- from raw XML metadata to NSDL

normalized XML metadatanormalized XML metadata

Principles:Principles: Do no harm (Don’t lose information)Do no harm (Don’t lose information) Add information, when possibleAdd information, when possible

Indicate schemes for valid valuesIndicate schemes for valid values Remove meaningless textRemove meaningless text

“…”“…”, “not available”, “-”, “not available”, “-” Empty elementsEmpty elements

Correct erroneous information Correct erroneous information ““text/pdf” text/pdf” “application/pdf” “application/pdf”

Remove characters that impede functionality or displayRemove characters that impede functionality or display Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …)Encoding fixes (e.g. “&”, double XML encodings, bad UTF-8 …) Scrub URLsScrub URLs

Page 23: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Goal 2: NSDL at a GlanceGoal 2: NSDL at a Glance

What’s in the NSDL?What’s in the NSDL? CollectionsCollections SubjectsSubjects

Intuitive UIIntuitive UI

Interactive GUI displaysInteractive GUI displays

Page 24: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

NSDL at a Glance - DemosNSDL at a Glance - Demos

SpotfireSpotfire

TreemapTreemap http://www.smartmoney.comhttp://www.smartmoney.com

Star TreeStar Tree http://nsdl.org/collections/ataglance/http://nsdl.org/collections/ataglance/

browseBySubject.htmlbrowseBySubject.html

Page 25: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

How AboutHow About

Better Online Browsing?Better Online Browsing?

Page 26: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Search and BrowseSearch and Browse False dichotomy!False dichotomy!

Many different user tasksMany different user tasks

Multiple ways to present results to usersMultiple ways to present results to users Should the presentation vary with quantity Should the presentation vary with quantity

and/or context of results?and/or context of results? e.g, “browse” may be a certain presentation of e.g, “browse” may be a certain presentation of

subject search results. subject search results.

Page 27: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

A Short List of User TasksA Short List of User Tasks

““Known Item Search”Known Item Search” Single Item SearchSingle Item Search Answer to a QuestionAnswer to a Question x “Best” Resourcesx “Best” Resources

Most informative? Easiest to access? Most appropriate to 8Most informative? Easiest to access? Most appropriate to 8thth graders?graders?

AllAll Germane Resources Germane Resources Sense of the Information SpaceSense of the Information Space Serendipitous FindsSerendipitous Finds

… … still looking for user needs and tasks analysis for information still looking for user needs and tasks analysis for information discovery …discovery …

} Inputs may be fuzzy

Page 28: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Problem NarrowedProblem Narrowed

Improve evaluation of resource Improve evaluation of resource relevance without having to “go there”relevance without having to “go there” ““See and Go Manifesto” Ramana RaoSee and Go Manifesto” Ramana Rao Allow users to manipulate result presentationAllow users to manipulate result presentation

What do we miss when we can’t walk What do we miss when we can’t walk through the stacks? through the stacks? Sense of information spaceSense of information space Serendipitous findsSerendipitous finds

Page 29: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Information Organization Information Organization

Books, Bookcases, Bookspines, Catalogs Books, Bookcases, Bookspines, Catalogs all evolved over timeall evolved over time library staff/user needslibrary staff/user needs bookstore staff/customer needsbookstore staff/customer needs organized by organized by subjectsubject

We are taught how to use libraries We are taught how to use libraries how resources are organizedhow resources are organized how to use tools (card catalog, OPAC)how to use tools (card catalog, OPAC)

Page 30: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

A Brief, Recent History of A Brief, Recent History of Information Discovery Information Discovery

Card catalog Card catalog (the world begins here)(the world begins here)

OPAC w/o keywordOPAC w/o keyword OPAC w/ keywordOPAC w/ keyword Internet, before WWWInternet, before WWW WWW before any catalogingWWW before any cataloging Yahoo, Alta Vista, etc.Yahoo, Alta Vista, etc. GoogleGoogle

}Open vs. Closed Stacks

Page 31: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

More Information More Information OrganizationOrganization

““binned” thenbinned” then (possibly) sub-binned then(possibly) sub-binned then sorted (alphabetical, size, format …)sorted (alphabetical, size, format …)

Note tension between linear ordering Note tension between linear ordering and hierarchical classificationand hierarchical classification

LocationLocation and and BookspineBookspine

Page 32: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

BookspinesBookspines Aid information discovery while allowing Aid information discovery while allowing

efficient book storageefficient book storage Surrogate for bookSurrogate for book

surrogate closely related to resourcesurrogate closely related to resource Visual (color, size, shape …)Visual (color, size, shape …) Aimed at multiple audiencesAimed at multiple audiences

Bookstore staffBookstore staff Potential usersPotential users

NISO standardNISO standard

Page 33: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Can We Improve Reality?Can We Improve Reality? A resource A resource cancan be in multiple places at once be in multiple places at once 2 or 3 dimensional organization instead of linear2 or 3 dimensional organization instead of linear Organization can be dynamicOrganization can be dynamic

User manipulabilityUser manipulability Can use Can use proximityproximity to indicate relationships to indicate relationships

Can we make visual surrogate richer?Can we make visual surrogate richer? Semantic zoom for resource?Semantic zoom for resource?

Different users have different needsDifferent users have different needs Visual surrogate … user selected?Visual surrogate … user selected?

Staff can alter organization of stored resources Staff can alter organization of stored resources without affecting users’ viewswithout affecting users’ views

Flexibility: organizing a very large collection has Flexibility: organizing a very large collection has different constraints than organizing a small collectiondifferent constraints than organizing a small collection

Page 34: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

The Big QuestionsThe Big Questions

How do we present shelves of How do we present shelves of bookspine information to our users bookspine information to our users within a monitor screen?within a monitor screen?

What should a virtual bookspine look What should a virtual bookspine look like?like?

(demo)(demo)

Page 35: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Design NotesDesign Notes TensionTension

intuitive, familiar intuitive, familiar new capabilities, change new capabilities, change

Semantic zoomSemantic zoom spec (partial bookspine info: color, position) spec (partial bookspine info: color, position) bookspine info bookspine info full metadata full metadata resource itselfresource itself

User manipulabilityUser manipulability

Text issuesText issues horizontal, not vertical horizontal, not vertical

Most materials in EnglishMost materials in English default sort is alphabeticaldefault sort is alphabetical

Page 36: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Prototype Next StepsPrototype Next Steps

Click through for resourceClick through for resource API API

Any fielded dataAny fielded data Search results? Colored by rank?Search results? Colored by rank?

Any tree structure for any fielded dataAny tree structure for any fielded data Multiple field values Multiple field values JitterJitter ScalingScaling

When too much, scroll it (a la spotfire)?When too much, scroll it (a la spotfire)? Table view (sortable, selectable, searchable, like Table view (sortable, selectable, searchable, like

spotfire)spotfire)

Page 37: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

The Metadata FrontierThe Metadata Frontier

Missing informationMissing information Automatically generated (full text, iVia, kth Automatically generated (full text, iVia, kth

nearest neighbor, support vector … based on nearest neighbor, support vector … based on training set)training set)

Via community (ENC?)Via community (ENC?) Controlled vocabulariesControlled vocabularies

Automatic translation ?Automatic translation ? Data mining?Data mining?

Value-added services to motivate providersValue-added services to motivate providers

Page 38: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Thank You!Thank You!

Page 39: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Goal 3 sub 1: ClassificationGoal 3 sub 1: Classification

LCC files on orderLCC files on order Star Tree?Star Tree? Windows Explorer?Windows Explorer? Other?Other?

Page 40: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Which XML elements are present in the Which XML elements are present in the metadata and with what namespaces metadata and with what namespaces are they associated?are they associated?

Are there any non-DC elements in the Are there any non-DC elements in the metadata?metadata?

Page 41: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Element Names vs. Namespaces Element Names vs. Namespaces (Scatter Plot)(Scatter Plot)

Page 42: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Do all the metadata records haveDo all the metadata records have

DC IdentifierDC Identifier DC FormatDC Format … …

Page 43: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Missing Elements Missing Elements (Scatter Plot)(Scatter Plot)

2 records without

language element

format element present

inconsistently

Easy to rescale axis on the fly

and scroll through records

Page 44: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Metadata analysis Metadata analysis questions:questions:

Exactly which elements use XML Exactly which elements use XML attributes?attributes?

Do those elements also appear in Do those elements also appear in the metadata without an attribute?the metadata without an attribute?

(this approach can be used to isolate (this approach can be used to isolate empty and non-empty elements)empty and non-empty elements)

Page 45: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Empty and Non-Empty CharacteristicsEmpty and Non-Empty Characteristics

all WITH an attribute presentall WITH an attribute present all WITHOUT an attribute presentall WITHOUT an attribute present

There are subject fields with and without the

nsdl_dc:GEM attribute value

There are no identifier fields without an attribute present

Page 46: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Data Problems: Missing Data Problems: Missing DataData

Defining what’s “missing” partially Defining what’s “missing” partially dependent on nature of dependent on nature of implementationimplementation

Title and Description critical for user Title and Description critical for user selectionselection

Format and Type particularly critical Format and Type particularly critical for NSDL filtering of search resultsfor NSDL filtering of search results

Page 47: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Data Problems: Incorrect Data Problems: Incorrect datadata

In wrong elementIn wrong element misunderstood definitions or careless misunderstood definitions or careless

crosswalkingcrosswalking

Nonsensical values (“promiscuous defaults”)Nonsensical values (“promiscuous defaults”)

Bad crosswalks (may be non-standard or too Bad crosswalks (may be non-standard or too limited)limited)

Metadata record ID used for IdentifierMetadata record ID used for Identifier

Page 48: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Data Problems: Confusing DataData Problems: Confusing Data

Ambiguous separators Ambiguous separators (comma instead of semi-colon)(comma instead of semi-colon)

HTML tagging within elementsHTML tagging within elements

Encoding problemsEncoding problems Double encoding: &Double encoding: & Bad UTF-8Bad UTF-8 Illegal XML characters (e.g., un-encoded Illegal XML characters (e.g., un-encoded

ampersand)ampersand)

Page 49: Metadata and Information Visualization Naomi Dushay Cornell Information Science National Science Digital Library

Automated MR ingest Automated MR ingest processprocess

NSDL Collection

Registration“raw” or “native”

metadata

Validation

Notify provider of problems;May need to halt processing

MetadataRepository

providerOAI

server

NSDLMROAI

server

OAI Harvest

NormalizeValidation

normalizedmetadata