25
HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010

Delivering Data For New Generations of Research

  • Upload
    palti

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Delivering Data For New Generations of Research. Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010. Introduction. Digital Repository Initial focus on digitized book and journal content “Light” archive Collections and Collaboration Comprehensive collection - PowerPoint PPT Presentation

Citation preview

Page 1: Delivering Data For  New Generations of Research

HATHI TRUST A Shared Digital Repository

Delivering Data For New Generations of Research

Strategies and ChallengesJeremy York

NISO/BISG ForumALA 2010

Page 2: Delivering Data For  New Generations of Research

Introduction

• Digital Repository– Initial focus on digitized book and journal content– “Light” archive

• Collections and Collaboration– Comprehensive collection– Shared strategies– Local services– Public Good

Page 3: Delivering Data For  New Generations of Research

Content Distribution

6,173,575 – Total1,177,667 – Public Domain

* As of June 15, 2010

Page 4: Delivering Data For  New Generations of Research

Language Distribution (1)

* As of June 15, 2010

Page 5: Delivering Data For  New Generations of Research

Language Distribution (2)The next 40 languages make up ~13% of total

* As of June 15, 2010

Page 6: Delivering Data For  New Generations of Research

Originating Institution

* As of June 15, 2010

Page 7: Delivering Data For  New Generations of Research

Content over time

* As of June 15, 2010

Page 8: Delivering Data For  New Generations of Research

Content Growth

Page 9: Delivering Data For  New Generations of Research
Page 10: Delivering Data For  New Generations of Research

Data Distribution & APIs

• OAI-PMH• Metadata files• Bibliographic API• Data API

Page 11: Delivering Data For  New Generations of Research

Extended Services

• Community Development Environment• Non-Google Ingest• Non-Book/Non-Journal Ingest• Computational Research

Page 12: Delivering Data For  New Generations of Research

Strategies for Computational Research

• Data distribution• Protocol-based access• Research Center

Page 13: Delivering Data For  New Generations of Research
Page 14: Delivering Data For  New Generations of Research

SEASR Architecture

Components Components

Virtualization InfrastructureVirtualization Infrastructure

Meandre InfrastructureMeandre Infrastructure

VisualizationVisualization

Component RepositoryComponent Repository Component DiscoveryComponent Discovery

Meandre Data-Intensive FlowsMeandre Data-Intensive Flows

AppsApps ServicesServicesPluginsPlugins Web AppsWeb Apps

AnalyticsAnalyticsDataData

Dev

elop

er T

ools

Dev

elop

er T

ools

RepositoriesData

AnalysisComponents

Flows

RepositoriesData

AnalysisComponents

Flows

User InterfacesUser Interfaces

Cloud ComputingCloud Computing

VisualizationsVisualizations

Meandre WorkbenchMeandre Workbench

Page 15: Delivering Data For  New Generations of Research

SEASR @ Work – Tag Cloud

• Count tokens

• Filter options supported

• Stem words

Page 16: Delivering Data For  New Generations of Research

SEASR @ Work – Entity Mash-up

• Entity Extraction with OpenNLP or Stanford NER

• Locations viewed on Google Map

• Dates viewed on Simile Timeline

Page 17: Delivering Data For  New Generations of Research

SEASR @ Work – Entities To Network

• Identify entities• Define relationships between entities

within same sentence

Page 18: Delivering Data For  New Generations of Research

SEASR @ Work – Text Clustering

• Clustering of Text by token counts• Filtering options for stop words, Part of

Speech• Dendogram Visualization

Page 19: Delivering Data For  New Generations of Research

SEASR @ Work – Audio Analysis

• NEMA: Executes a SEASR flow for each run

– Loads audio data

– Extracts features for every 10 sec moving window of audio

– Loads and applies the models

– Sends results back to the WebUI

• NESTER: Annotation of Audio via Spectral Analysis

Page 20: Delivering Data For  New Generations of Research

SEASR @ Work – Zotero

• Plugin to Firefox • Zotero manages the

collection• Launch SEASR Analytics

– Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR

– Zotero Export to Fedora through SEASR

– Saves results from SEASR Analytics to a Collection

• Launch MONK Processing– MONK DB Ingestion Workflo

w

Page 21: Delivering Data For  New Generations of Research

SEASR @ Work – Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Page 22: Delivering Data For  New Generations of Research

Sentiment Analysis: Visualization

Page 23: Delivering Data For  New Generations of Research

Person Extraction:Scott's Waverley, Ivanhoe, and The Heart of Midlothian.

Page 24: Delivering Data For  New Generations of Research

Location Extraction:Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent 

Page 25: Delivering Data For  New Generations of Research

Thank you!

[email protected]@umich.edu