36
Digital Curation Specialists in the Library Setting June 16, 2009 Special Libraries Association P. Bryan Heidorn

Sla2009 D Curation Heidorn

Embed Size (px)

DESCRIPTION

Panel on Digital Curation at the Special Library Association 2009 meeting

Citation preview

Page 1: Sla2009 D Curation Heidorn

Societal Need for Digital Curation Specialists in the

Library SettingJune 16, 2009

Special Libraries Association

P. Bryan Heidorn

Page 2: Sla2009 D Curation Heidorn

Introduction

Program Manager, Division of Biological Infrastructure, National Science Foundation

Associate Professor, Graduate School of Library and Information Science, University of Illinois

JRS Biodiversity Foundation, Board of Directors

Page 3: Sla2009 D Curation Heidorn

Why Libraries

Libraries manage the scholarly output of society Scholars in the humanities and sciences are

generating primary and secondary data at unprecedented rates

Social investment is not only in journal publications but all scholarly knowledge

Need for specialists for information organization, access and preservation

Libraries have the institutional structure and many of the skills needed to curate data and other digital resources.

Page 4: Sla2009 D Curation Heidorn

Cyberinfrastructure Vision

“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access.”

NSF Cyberinfrastructure Vision for 21st Century Discovery (2007), Chapter 3

Page 5: Sla2009 D Curation Heidorn

Recognition of need for data curation

“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.”

Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century (2005), Recommendations

Page 6: Sla2009 D Curation Heidorn

Recognition of the importance of Information

Recognition of the need for education

New work roles within traditional institutions

Interagency Working Group on Digital Data

Page 7: Sla2009 D Curation Heidorn

New Information Disciplines

Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)

Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form

Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection

(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)

Page 8: Sla2009 D Curation Heidorn

Library Skills

Page 9: Sla2009 D Curation Heidorn

Where is the data now?

Not in reference collectionsVaries mandates for sharingUnsustainable models

Individual researchersBoutique databases

Most data is from small projectsBig science and independent science

Page 10: Sla2009 D Curation Heidorn

Economics of the long tail

The Long Tail, By Chris Anderson. Wired Magizine.12.10, 2004. (http://www.wired.com/wired/archive/12.10/tail_pr.html)

NetFlix versus BlockBuster

Genbank versus Mary’s Lab

Page 11: Sla2009 D Curation Heidorn

Naive View of Science Data

f(x)=axk+o(xk)

Power Law of Science Data

f(x)=axk+o(xk)| X<.20

Dat

a V

olum

e

Science Projects and Initiatives

Page 12: Sla2009 D Curation Heidorn

Does NSF’s Data Follow the Power Law?

I do not know but if $1 = X bytes…..

Awarded Amount 2007

$0

$1,000,000

$2,000,000

$3,000,000

$4,000,000

$5,000,000

$6,000,000

$7,000,000

1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776

Page 13: Sla2009 D Curation Heidorn

20-80 Rule The small are big!

Total Grants 9347

$2,137,636,716

20% 80%

Number Grants 1869 7478

Total Dollars $1,199,088,125 $938,548,595

Range $6,892,810-$350,000

$350,000-$831

Page 14: Sla2009 D Curation Heidorn

Dark data is the data that we know is/was there but we can’t see it.

Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17

Page 15: Sla2009 D Curation Heidorn

Related Ideas

John Porter: Deep verses Wide databases

Swanson: Undiscovered Public Knowledge

Science Commons: Big Verses Small science

Page 16: Sla2009 D Curation Heidorn

Why is the tail also important

Valuable science data is in the tail Many scientists could use the tail data

•Unpublished observations of flowing time in Concord by Alfred Hosmer from 1888 to 1902•Photographs of Flowers•Blue Hill Observatory meteorological dataRichard B. Primack, Abraham J. Miller-Rushing, Daniel Primack, and Sharda Mukunda (2007). Using Photographs to Show the Effects of Climate Change on Flowing Time. Arnoldia 65(1), p2-9.

Valuable science data is in the tail Many scientists could use the tail data Science innovation occurs in the long tail Unpublished negative results / aka dark data We know very little about the tail Transformative science happens in the tail Computational thinking needed to free the tail NSF Current investments in the tail OECD Principles and Guidelines for Access to

Research Data from Public Funding

Page 17: Sla2009 D Curation Heidorn

The Case of Lake Victoria Data

Lake Victoria is the largest fresh water lake in Africa

Nile Perch, Water Hyacinth, Deforestation and human waste are destroying the fishery

Hundreds of data sets have been created over 50 years

There is no access to most of that information

Page 18: Sla2009 D Curation Heidorn

Barriers

Lack of professional reward structure Lack of education in data curation Intellectual property rights (IPR) Lack of technology Lack of financial reward structure Under valuation / lack of investment Cost of infrastructure creation Cost of infrastructure maintenance PDF, excel, MS word, arcview, floppy disks

Page 19: Sla2009 D Curation Heidorn

Technical Solutions: Move the tail to the head (increase k)

Data standards e.g. Environmental Markup Language (EML)e.g. TaxonX - taXMLit

Metadata Darwin Core (DwC)Access to Biological Collection Data (ABCD)

ProtocolsTAPIR

Page 20: Sla2009 D Curation Heidorn

Solutions

Controlled Vocabularies MeSH, ZooBank, IPNI, ITIS

Ontologies Gene Ontology (GO) Science Environment for Ecological Knowledge (SEEK) EcoGrid Leopold Semi-Automated ontology generation for

Amphibian Morphology DBI-0640053 (Semantic) web software DataNet

Page 21: Sla2009 D Curation Heidorn

Institutional Solutions

Well Paid LibrariansWell-heeled MuseumsProfessional societiesGenerous PublishersLibrary director John Hanson told the

Associated Press that a couple of dozen people are cited each year for failure to return materials or pay fines. The incident cost Dalibor about $30 for the two overdue paperbacks. It cost her mother $172 to free her.

Page 22: Sla2009 D Curation Heidorn

Organizational Solutions

Phase One of a Lake Victoria Biodiversity Informatics Project

DataNet (DataOne and Data Conservancy) Dryad LTER, NEON, GBIF, TDWG National Center for Ecological Analysis and

Synthesis (NCEAS) National Evolutionary Synthesis Center (NESCent) European Union Networks of Excellence (NoE) European Distributed Institute of Taxonomy (EDIT)

Page 23: Sla2009 D Curation Heidorn

Education Programs

Biological Information Specialist

Concentration in Data Curation (MSLIS)

Certificate of Advanced Study in Data Curation

Summer Institutes in Data Curation

Information and professional education in biodiversity informatics

Page 24: Sla2009 D Curation Heidorn

Biological Information SpecialistsBiological Information Specialists

• At present:

• Biologists at all degree levels self-trained in information technology

• Information technologists at all degree levels self-trained in biology • (both with gaps in knowledge for many

months, years)

• Differing roles of BIS in large and small science

• At present:

• Biologists at all degree levels self-trained in information technology

• Information technologists at all degree levels self-trained in biology • (both with gaps in knowledge for many

months, years)

• Differing roles of BIS in large and small science

Page 25: Sla2009 D Curation Heidorn

Master of Science in Biological Informatics

Master of Science in Biological Informatics

Degree Program began September 2007

Part of campus-wide bioinformatics masters program

NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)

Combines Biology, Bioinformatics, Computer Science core with LIS courses

Degree Program began September 2007

Part of campus-wide bioinformatics masters program

NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)

Combines Biology, Bioinformatics, Computer Science core with LIS courses

Page 26: Sla2009 D Curation Heidorn

What does a BIS need to know?What does a BIS need to know?

Biological training and interest in solving biological research problems

Information skills Evaluation and implementation of information

systems: user based assessment and continual quality improvement for the development of tools that work and are used.

Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.

Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.

Biological training and interest in solving biological research problems

Information skills Evaluation and implementation of information

systems: user based assessment and continual quality improvement for the development of tools that work and are used.

Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.

Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.

Page 27: Sla2009 D Curation Heidorn

UIUC bioinformatics core courseworkUIUC bioinformatics core coursework

Cross-disciplinary course distribution requirement

Bioinformatics: Computing in Molecular BiologyAlgorithms in BioinformaticsPrinciples of Systematics

Computer Science: AlgorithmsDatabase Systems

Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling

Cross-disciplinary course distribution requirement

Bioinformatics: Computing in Molecular BiologyAlgorithms in BioinformaticsPrinciples of Systematics

Computer Science: AlgorithmsDatabase Systems

Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling

Page 28: Sla2009 D Curation Heidorn

Sample of existing LIS coursesSample of existing LIS courses

Information Organization and Knowledge Representation

LIS 551 Interfaces to Information Systems

LIS 590DM Document Modeling LIS 590RO Representing and

Organizing Information Resources

LIS590ON Ontologies in Natural Science

Information Resources, Uses and users

LIS 503 Use and Users of Information

LIS 522 Information Sources in the Sciences

LIS 590TR Information Transfer and Collaboration in Science

Information Organization and Knowledge Representation

LIS 551 Interfaces to Information Systems

LIS 590DM Document Modeling LIS 590RO Representing and

Organizing Information Resources

LIS590ON Ontologies in Natural Science

Information Resources, Uses and users

LIS 503 Use and Users of Information

LIS 522 Information Sources in the Sciences

LIS 590TR Information Transfer and Collaboration in Science

Information Systems LIS 456 Information Storage

and Retrieval LIS 509 Building Digital

Libraries LIS 566 Architecture of

Network Information Systems LIS 590EP Electronic

Publishing

Disciplinary Focus LIS 530B Health Sciences

Information Services and Resources

LIS 590HI Healthcare Informatics (Healthcare Infrastructure)

LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)

Information Systems LIS 456 Information Storage

and Retrieval LIS 509 Building Digital

Libraries LIS 566 Architecture of

Network Information Systems LIS 590EP Electronic

Publishing

Disciplinary Focus LIS 530B Health Sciences

Information Services and Resources

LIS 590HI Healthcare Informatics (Healthcare Infrastructure)

LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)

Page 29: Sla2009 D Curation Heidorn

MSLIS Data Curation ConcentrationMSLIS Data Curation Concentration

Data Curation Educational Program (DCEP)

IMLS – Laura Bush 21st Century Librarian Program,

RE-05-06-0036-06 (Heidorn, PI)

Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations

Data Curation Educational Program (DCEP)

IMLS – Laura Bush 21st Century Librarian Program,

RE-05-06-0036-06 (Heidorn, PI)

Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations

Page 30: Sla2009 D Curation Heidorn

DCEP CurriculumDCEP Curriculum

• Required of All Master's Students• LIS501 Information Organization and Access• LIS502 (2 hrs only) Libraries, Information and

Society

• Required for the DC Concentration• LIS590DC Foundations of Data Curation• LIS590PD Digital Preservation• LIS453 Systems Analysis and Management• Field Experience Seminar (Req’d if taking a practicum, 2

hours)

• Required of All Master's Students• LIS501 Information Organization and Access• LIS502 (2 hrs only) Libraries, Information and

Society

• Required for the DC Concentration• LIS590DC Foundations of Data Curation• LIS590PD Digital Preservation• LIS453 Systems Analysis and Management• Field Experience Seminar (Req’d if taking a practicum, 2

hours)

Page 31: Sla2009 D Curation Heidorn

DCEP courses, cont’dDCEP courses, cont’d• DCEP List of Recommended Electives (Students required to take two, we recommend four)

• LIS452 Foundations of Information Processing in LIS• LIS590BDI Biodiversity and Ecoinformatics• LIS590DI Digital Libraries: Research and Practice• LIS590DM Document Modeling• LIS590IM Information Modeling• LIS590MD Metadata in Theory and Practice• LIS590OD Ontology Development• LIS590RO Representing and Organizing Information

Resources

• DCEP List of Recommended Electives (Students required to take two, we recommend four)

• LIS452 Foundations of Information Processing in LIS• LIS590BDI Biodiversity and Ecoinformatics• LIS590DI Digital Libraries: Research and Practice• LIS590DM Document Modeling• LIS590IM Information Modeling• LIS590MD Metadata in Theory and Practice• LIS590OD Ontology Development• LIS590RO Representing and Organizing Information

Resources

Page 32: Sla2009 D Curation Heidorn

Foundations of Data Curation

• Digital Data and Collections• Scholarly Communication and Scientific

Information Work • Lifecycles, Workflows; Data Re-use and

Value • Infrastructures and Repositories• Selection and Appraisal • Metadata, Standards and Protocols • Archiving and Preservation • Intellectual Property and Legal Issues • Policy, Collaboration and Cooperative

Alignments

Assignments on:• Analysis of Data Management Plans • Discipline-based data curation needs

assessment

Foundations of Data Curation

• Digital Data and Collections• Scholarly Communication and Scientific

Information Work • Lifecycles, Workflows; Data Re-use and

Value • Infrastructures and Repositories• Selection and Appraisal • Metadata, Standards and Protocols • Archiving and Preservation • Intellectual Property and Legal Issues • Policy, Collaboration and Cooperative

Alignments

Assignments on:• Analysis of Data Management Plans • Discipline-based data curation needs

assessment

Digital Preservation

• Archival Theory & Diplomatics • OAIS Reference Model • Data Formats • Digital Archival Objects• Data Curation• Preservation Strategies: • Emulation vs. migration • Authenticity, Integrity & Trust • Evaluation & Value• Digital Preservation & The Law

Assignments on:• Planning Grant Application • Trusted Repository Assessment

Core course contentCore course content

Page 33: Sla2009 D Curation Heidorn

Summer Institute in Data Curation 1

Summer Institute in Data Curation 1

• Seminar format• opportunities for small group discussion• hands-on session

• 10 presenters (GSLIS; National Snow and Ice Data Center; Purdue, UIUC, and Johns Hopkins Univ. Libraries)

• 6-person panel (3 librarians and 3 scientists): Librarians and Scientists

• Keynote by Anna Gold, Associate Dean for Public Services at the at California Polytechnic State University

• Seminar format• opportunities for small group discussion• hands-on session

• 10 presenters (GSLIS; National Snow and Ice Data Center; Purdue, UIUC, and Johns Hopkins Univ. Libraries)

• 6-person panel (3 librarians and 3 scientists): Librarians and Scientists

• Keynote by Anna Gold, Associate Dean for Public Services at the at California Polytechnic State University

Page 34: Sla2009 D Curation Heidorn

Field Work OpportunitiesField Work Opportunities

Internships 6 week, funded

placements project-oriented

Digital Research and Curation Center, Sheridan Libraries, Johns Hopkins University (2008)

National Agriculture Library, USDA (2009)

Distributed Data Curation Center, Purdue University Libraries (2009)

Smithsonian (2009)

Internships 6 week, funded

placements project-oriented

Digital Research and Curation Center, Sheridan Libraries, Johns Hopkins University (2008)

National Agriculture Library, USDA (2009)

Distributed Data Curation Center, Purdue University Libraries (2009)

Smithsonian (2009)

Practica100 hours, course

credit organizational

orientation; shadowing

Nat’l Snow and Ice Data Center (2009)

Practica100 hours, course

credit organizational

orientation; shadowing

Nat’l Snow and Ice Data Center (2009)

Page 35: Sla2009 D Curation Heidorn

New research directionsNew research directions

Focus on integration and scale

Informatics infrastructure as competitive edge

Sample areas of development

Landinformatics GroupAtmospheric science, hydrology, nutrient balance,

carbon cycle, ecology, agronomy

Focus on data integration problems across larger range of sciences

Focus on integration and scale

Informatics infrastructure as competitive edge

Sample areas of development

Landinformatics GroupAtmospheric science, hydrology, nutrient balance,

carbon cycle, ecology, agronomy

Focus on data integration problems across larger range of sciences

Page 36: Sla2009 D Curation Heidorn

ConclusionConclusion

Data is increasingly a scholarly product

Data is currently no managed for the long term

Libraries are the logical institutions to manage data

Additional Training will be neededLibrarians do not do it for free unless

they want to

Data is increasingly a scholarly product

Data is currently no managed for the long term

Libraries are the logical institutions to manage data

Additional Training will be neededLibrarians do not do it for free unless

they want to