Upload
bryan-heidorn
View
1.180
Download
2
Tags:
Embed Size (px)
DESCRIPTION
The Path to Enlightened Solutions for Biodiversity's Dark Data Keynote at Scripting Life: the science behind ViBRANT http://vbrant.eu/presentations
Citation preview
P. Bryan HeidornUniversity of Arizona and JRS Biodiversity Foundation
2011 Scripting Life: the science behind ViBRANTParis, France
20-21 January 2011
The Path to Enlightened Solutions for Biodiversity's
Dark Data
University of Arizona
Today: 25°CSunny
Thesis
Large amounts of data remain uncurated
Most of that data is from small data sets and is currently largely invisible – Dark Data
This data should be curated locally but not by scientists alone
Need for long-lived institutions
Cyberinfrastructure Vision
“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access.”
NSF Cyberinfrastructure Vision for 21st Century Discovery, Chapter 3
Recognition of need for data curation
“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.”
Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations
Recognition of the importance of Information
Recognition of the need for education
New work roles within traditional institutions
Interagency Working Group on Digital Data
Why Libraries and Museums
Long history of scholarly data management
Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri
Long-lived institutionsExisting overlap with museums and
archives
The problem
Recognition of the problemInformation is not in accessible format Computer Science, Information
Science and Technology has not addressed the problem
No training or incentive for data generators
Dark data is the data that we know is/was there but we can’t see it.
Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17
Related Ideas
John Porter: Deep verses Wide databases
Swanson: Undiscovered Public Knowledge
Science Commons: Big Verses Small science
f(x)=axk+o(xk)
Power Law of Science Data
f(x)=axk+o(xk)| X<.20
Dat
a V
olum
e
Science Projects and Initiatives
Does NSF’s Data Follow the Power Law?
I do not know but if $1 = X bytes…..
Awarded Amount 2007
$0
$1,000,000
$2,000,000
$3,000,000
$4,000,000
$5,000,000
$6,000,000
$7,000,000
1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776
20-80 Rule The small are big!
Total Grants 9347
$2,137,636,716
20% 80%
Number Grants 1869 7478
Total Dollars $1,199,088,125 $938,548,595
Range $6,892,810-$350,000
$350,000-$831
Bio
logy
200
9
#Grants: 1886 $Total: $744,168,471 ≈ €550,000,000Distribution 1266 < $.5 million ≈ €370,000Mode: $304,691 ≈ €225,000
Myth of the mega-project
Because it is high volumeBecause it is information rich – high
entropyWhile needs of large data are
understood small data and integration are not understood
Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. (http://hdl.handle.net/2142/9127).
Small data is big science
Where to find dark data
Scientist’s backpacks and desksLiterature/Biodiversity Heritage LibraryMuseum SpecimensField notesCitizen Observations
What is dark data good for?
Ecological Niche ModelingClimate Change niche change predictionTaxonomic Name ResolutionLiterature Search Support
Taxonomic intelligenceKey-like – character searching
Phenology and Phenology changeFood-web / trophic level
Problematic Transition
Personal Information Management vsKnowledge Organization
Pluralistic vs Unified (Hjørland, 2007)
Contrast in Styles (White, in press)
Personal Information ManagementOne-Few usersVisual/SpatialProject Oriented
Knowledge OrganizationMany usersLanguage basedLong-term orientation
New Information Disciplines
Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)
Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form
Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection
(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)
Roles
Skills
Library Roles
Life Cycle PhasesPlanCreateKeep Dispose
Data Management FunctionAccessDocumentOrganizeProtect
How to Organize at a higher level?
It is difficult to find what is already knownClonal specimens may be stored in
different museums around the worldDNA analysis may be conducted on one
but not the otherMicrographs may be in a databaseTaxonomic treatments or revisions may
exist
Biological Science Collections (BiSciCol) Tracker
S1: KNM
S2: MNHN
Muséum national d'histoire naturelle
Nairobi National Museum
S3: MBG
Living Collection: Missouri Botanical Garden
DeterminationDetermination
?
?
Gene SequenceGene Sequence
GENBANK
?
?
?
?ParasitismParasitism
Agave sisalana
?
BiSciCol Tracker
The Future is all about Data
How do we get it?How do we analyze it?How do we disseminate it (Maps, charts
tables..)?How do we keep it?
Provenance, Storage Weeding
How do we make it sustainable?
Digital/Data Curation Programs
University of IllinoisGraduate School of Library and Information
Science
University of ArizonaSchool of Information Resources and Library
Science
University of North CarolinaSchool of Information and Library Science
Education Needs
Biological Information Specialist
Concentration in Data Curation (MSLIS)
Certificate of Advanced Study in Data Curation for Libraries and Scientist
Information and professional education in biodiversity informatics
MSLIS Data Curation Concentration
Data Curation Educational Program (DCEP)
IMLS – Laura Bush 21st Century Librarian Program,
RE-05-06-0036-06 (Heidorn, PI)
Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations
Biological Information Specialists
At present:
Biologists at all degree levels self-trained in information technology
Information technologists at all degree levels self-trained in biology
(both with gaps in knowledge for many months, years)
Differing roles of BIS in large and small
Master of Science in Biological Informatics
Degree Program began September 2007
Part of campus-wide bioinformatics masters program
NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)
Combines Biology, Bioinformatics, Computer Science core with LIS courses
What does a BIS need to know?
Biological training and interest in solving biological research problems
Information skills Evaluation and implementation of information
systems: user based assessment and continual quality improvement for the development of tools that work and are used.
Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.
Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.
UIUC bioinformatics core coursework
Cross-disciplinary course distribution requirement
Bioinformatics: Computing in Molecular
BiologyAlgorithms in BioinformaticsPrinciples of Systematics
Computer Science: AlgorithmsDatabase Systems
Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling
Sample of existing LIS courses
Information Organization and Knowledge Representation
LIS 551 Interfaces to Information Systems
LIS 590DM Document Modeling LIS 590RO Representing and
Organizing Information Resources LIS590ON Ontologies in Natural
Science
Information Resources, Uses and users
LIS 503 Use and Users of Information
LIS 522 Information Sources in the Sciences
LIS 590TR Information Transfer and Collaboration in Science
Information Systems LIS 456 Information Storage
and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network
Information Systems LIS 590EP Electronic Publishing
Disciplinary Focus LIS 530B Health Sciences
Information Services and Resources
LIS 590HI Healthcare Informatics (Healthcare Infrastructure)
LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)
University of ArizonaGraduate Certificate in Digital
Records Management
Six Graduate Courses within MLA program
Focus on repositoriesCross over with Knowledge
Representation and Metadata
Workforce
Data Curation Workforce Summit Dec 6th at IDCC ChicagoIdentify the Skill sets needed to government
data curationDepartment of Energy, US National Science
Foundation, Institute of Museum and Library Services, Oak Ridge National Laboratory, USGS National Biological Information Infrastructure, CIESIN
The Future is Collaboration and Data Sharing
• Libraries
• Museums
• Government
• Universities
To bring the best data to the major problems and opportunities
of our time and the future
• NGO• Private Land
Holders• Ranches• Farms
MerciMerci