1
CalBug CalBug: Digitizing California’s Digitizing California’s Terrestrial Arthropods Terrestrial Arthropods Peter T Oboyski, Joan Ball, Rosemary Gillespie, Joyce Gross, Traci Grzymala, Gordon Nishida, Kipling Will Essig Museum of Entomology, University of California at Berkeley ,USA Summary Databasing of entomology collections has lagged behind that of other disciplines primarily due to large collection sizes and the highly abbreviated and inconsistent data on very small specimen labels. CalBug is a National Science Fundation funded collaboration of the eight major entomology collections in California* that intends to capture 1.1 million specimen-level data records from our combined holdings. Data from all institutions will be combined in a single online cache. We will analyze these data using geospatial technology to explore the relationship between changes in distribution and habitat modification. Developing time-saving methods and technology for getting data from specimen labels into databases is paramount. We have focused on developing and testing methods and workflows to increase the rate of data capture, while maximizing data quality. Digital imaging of labels provides an easy-to-view verbatim archive of specimen data and allows remote data entry from image files through manual entry, crowd-sourcing, and automated OCR and data parsing. Specimen handling remains a significant obstacle for efficient data capture from entomological collections because of costs in time and risk to specimens. Georeferencing is also a challenge due to the highly abbreviated and inconsistent nature of location data on specimen labels. To address these challenges we are exploring strategies that combine computer and human data handling. Label Image Capture Georeferen cing and Mapping *Collaborators: Bohart Museum – UC Davis, California Academy of Sciences, California State Collection of Arthropods, Entomology Research Museum – UC Riverside, Essig Museum of Entomology – UC Berkeley, LA County Natural History Museum, San Diego Natural History Museum, Santa Barbara Museum of Natural History Figure 6. Annual average high temperatures under a high emissions scenario of climate change (Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission). Records of arthropod collections over the past 100 years along with projections of future climates will be used to predict the impact of climate change on arthropod distributions. Methods Taxa and localities to database: Priority species were selected to address urgent environmental issues and target localities to examine changes in biodiversity at sites with long-term sampling, including Natural Area Reserves. Sort specimens by location and date (optional): A “carry-over” function reduces time spent typing when consecutive specimens have similar data. Digital imaging: DinoLite® digital microscopes (Figure 1) capture images of label data in JPEG format. Manual data entry into MySQL database: Label data are interpreted and entered into appropriate database fields (Figure 4). Error checking: Records are successively sorted by locality and date to identify typographic errors/inconsistencies. Georeference locality data: Database records are uploaded to BioGeomancer georeferencing software (Figure 5) which suggests coordinates and an error radius for each locality based on standardized protocols. Upload data to cache (in development): At the completion of the project each institution will upload records to a central cache for inter-institution analyses (Figure 4). Temporospatial analyses (in development): GIS tools will be used to correlate species distributions with climate and habitat factors and to predict changes in species distributions based on climate change projections (Figure 6). Workflow Optional step In development Databa se Assessment and Progress Specimen handling: A significant time expenditure includes retrieval of individual specimens, positioning of labels for viewing, adding a catalog number label, and returning the specimen to its unit tray. Digital Imaging: Protocols for entering data directly from specimens into a verbatim field followed by parsing into interpreted fields proved slow. Digital imaging of specimen labels provides advantages, including a true verbatim digital archive, the ability to enlarge labels onscreen, and the opportunity for remote data entry and/or Optical Character Recognition (OCR) to automate data extraction. Using a naming convention that includes the specimen catalog number, digital images are automatically linked to database records. Each specimen takes ~2 seconds to photograph, but naming and saving files adds ~7-10 seconds/specimen. Databasing: Several fields, including higher taxonomy and “higher geography” are automatically filled names already in the database. Data are carried-over from one specimen to the next (yellow fields in Figure 1). These features, along with pick lists and controlled fields, reduce errors. Progress: 27,000 Hymenoptera; 8,400 Odonata; 7,000 Lepidoptera entered into Essig Database. 4,000 specimens fully georeferenced. 36,000 images taken with 24,000 awaiting data entry. Improving image & data acquisition Minimize imaging time: We are currently developing high- throughput assembly lines to increase the rate of image capture by spatial arrangement of handling tasks and automating file naming and saving. Online crowd-sourcing: We are collaborating with the Zooniverse citizen science program to engage thousands of volunteers in label data entry from digital images. Multiple volunteers enter data multiple times for each label, which are then compared for consistency (as a proxy for accuracy). OCR and automated data parsing: We are developing user dictionaries for Optical Character Recognition software to increase percent recognition and accuracy. We are also looking for programmers to create a “smart” parsing program that can assign data elements to appropriate database fields based on context and dictionary terms. Developing a data cache: Data from each collaborating institution will be added to a combined online cache (see required fields in Figure 4). 1. Select taxa for databasing 2. Sort specimens by location & date 4. Take, name, and save digital image of labels 5a. Manually enter data into MySQL database with some error checking 7. Georeference locality 5b. Online crowd- sourcing of manual data entry 5c. Optical Character Recognition & data parsing 3. Tease apart labels to view all text, add catalog # label 6. Error Checking 9. Temporospatial analyses 8. Upload data to cache Collecting Event Data eventID (DC) country (DC) stateProvince (DC) county (DC) locality (DC) minimumElevationMeters (DC) maximumElevationMeters (DC) decimalLatitude (DC) decimalLongitude (DC) coordinateUncertaintyMet ers (DC) geodeticDatum (DC) verbatimCoordinateSystem (DC) georeferenceSources (DC) georeferencedBy (DC) georeferencedDate georeferenceRemarks (DC) collectionBeginDate (*) collectionEndDate (*) recordedBy (DC) = collectors samplingProtocol (DC) associatedTaxa (DC) sex (DC) individualCount (DC) Specimen Data catalogNumber (DC) institutionCode (DC) kingdom (DC) phylum (DC) class (DC) order (DC) family (DC) genus (DC) specificEpithet (DC) subspecies taxonIDCertainty scientificNameAuthorsh ip (DC) identifiedBy (DC) dateIdentified (DC) eventID (DC) Bold = required Normal = recommended (DC) = Darwin Core field (*) = Darwin Core recommends one field that accommodates several date options. We prefer “begin” and “end” dates. Figure 4. Each institution uses its own database system. Records will be collected into a Darwin Core-compliant, flat-file, cache with required fields for collecting event data and specimen data as indicated in the above tables from the Essig database. Labels are often highly abbreviated – unrecognized abbreviations are entered “as is” and bulk updated after data entry is completed. Figure 1. (upper left) DinoLite® digital microscope and software used to capture images of specimens and labels. (upper right) Essig database data entry screen with specimen image – clicking on image icon makes image appear in a separate movable window. Yellow fields are carried-over to the next specimen. (lower right) Dragonfly with labels removed for imaging. Figure 5. Semi-automated programs, such as BioGeomancer, estimate latitude- longitude coordinates with an adjustable error radius based on text descriptions (above example: 15 miles E of Cloverdale, CA). Queries of georeferenced specimens are mapped “on- the-fly” using Berkeley Mapper (right example: specimens near Sacramento, California of Libellula luctuosa Burmeister dragonflies in the Essig Database). Figure 3. General workflow for image capture, databasing, georeferencing, and analysis. See Methods for workflow details. © Joyce Gross © Joyce Gross © Joyce Gross © PT Oboyski © PT Oboyski Response to climate change

CalBugDigitizing California’s Terrestrial Arthropods CalBug: Digitizing California’s Terrestrial Arthropods Peter T Oboyski, Joan Ball, Rosemary Gillespie,

Embed Size (px)

Citation preview

Page 1: CalBugDigitizing California’s Terrestrial Arthropods CalBug: Digitizing California’s Terrestrial Arthropods Peter T Oboyski, Joan Ball, Rosemary Gillespie,

CalBugCalBug: Digitizing California’s Terrestrial Digitizing California’s Terrestrial ArthropodsArthropods

Peter T Oboyski, Joan Ball, Rosemary Gillespie, Joyce Gross, Traci Grzymala, Gordon Nishida, Kipling WillEssig Museum of Entomology, University of California at Berkeley ,USA

SummaryDatabasing of entomology collections has lagged behind that of other disciplines primarily due to large collection sizes and the highly abbreviated and inconsistent data on very small specimen labels. CalBug is a National Science Fundation funded collaboration of the eight major entomology collections in California* that intends to capture 1.1 million specimen-level data records from our combined holdings. Data from all institutions will be combined in a single online cache. We will analyze these data using geospatial technology to explore the relationship between changes in distribution and habitat modification. Developing time-saving methods and technology for getting data from specimen labels into databases is paramount. We have focused on developing and testing methods and workflows to increase the rate of data capture, while maximizing data quality. Digital imaging of labels provides an easy-to-view verbatim archive of specimen data and allows remote data entry from image files through manual entry, crowd-sourcing, and automated OCR and data parsing. Specimen handling remains a significant obstacle for efficient data capture from entomological collections because of costs in time and risk to specimens. Georeferencing is also a challenge due to the highly abbreviated and inconsistent nature of location data on specimen labels. To address these challenges we are exploring strategies that combine computer and human data handling.

Label Image CaptureGeoreferencing and Mapping

*Collaborators: Bohart Museum – UC Davis, California Academy of Sciences, California State Collection of Arthropods, Entomology Research Museum – UC Riverside, Essig Museum of Entomology – UC Berkeley, LA County Natural History Museum, San Diego Natural History Museum, Santa Barbara Museum of Natural History

Figure 6. Annual average high temperatures under a high emissions scenario of climate change (Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission). Records of arthropod collections over the past 100 years along with projections of future climates will be used to predict the impact of climate change on arthropod distributions.

MethodsTaxa and localities to database: Priority species were selected to address urgent environmental issues and target localities to examine changes in biodiversity at sites with long-term sampling, including Natural Area Reserves. Sort specimens by location and date (optional): A “carry-over” function reduces time spent typing when consecutive specimens have similar data. Digital imaging: DinoLite® digital microscopes (Figure 1) capture images of label data in JPEG format. Manual data entry into MySQL database: Label data are interpreted and entered into appropriate database fields (Figure 4). Error checking: Records are successively sorted by locality and date to identify typographic errors/inconsistencies. Georeference locality data: Database records are uploaded to BioGeomancer georeferencing software (Figure 5) which suggests coordinates and an error radius for each locality based on standardized protocols. Upload data to cache (in development): At the completion of the project each institution will upload records to a central cache for inter-institution analyses (Figure 4). Temporospatial analyses (in development): GIS tools will be used to correlate species distributions with climate and habitat factors and to predict changes in species distributions based on climate change projections (Figure 6).

Workflow

Optional stepOptional step In development

Database

Assessment and Progress Specimen handling: A significant time expenditure includes retrieval of individual specimens, positioning of labels for viewing, adding a catalog number label, and returning the specimen to its unit tray.Digital Imaging: Protocols for entering data directly from specimens into a verbatim field followed by parsing into interpreted fields proved slow. Digital imaging of specimen labels provides advantages, including a true verbatim digital archive, the ability to enlarge labels onscreen, and the opportunity for remote data entry and/or Optical Character Recognition (OCR) to automate data extraction. Using a naming convention that includes the specimen catalog number, digital images are automatically linked to database records. Each specimen takes ~2 seconds to photograph, but naming and saving files adds ~7-10 seconds/specimen.Databasing: Several fields, including higher taxonomy and “higher geography” are automatically filled names already in the database. Data are carried-over from one specimen to the next (yellow fields in Figure 1). These features, along with pick lists and controlled fields, reduce errors.Progress: 27,000 Hymenoptera; 8,400 Odonata; 7,000 Lepidoptera entered into Essig Database. 4,000 specimens fully georeferenced. 36,000 images taken with 24,000 awaiting data entry.

Improving image & data acquisitionMinimize imaging time: We are currently developing high-throughput assembly lines to increase the rate of image capture by spatial arrangement of handling tasks and automating file naming and saving. Online crowd-sourcing: We are collaborating with the Zooniverse citizen science program to engage thousands of volunteers in label data entry from digital images. Multiple volunteers enter data multiple times for each label, which are then compared for consistency (as a proxy for accuracy). OCR and automated data parsing: We are developing user dictionaries for Optical Character Recognition software to increase percent recognition and accuracy. We are also looking for programmers to create a “smart” parsing program that can assign data elements to appropriate database fields based on context and dictionary terms. Developing a data cache: Data from each collaborating institution will be added to a combined online cache (see required fields in Figure 4).

1. Select taxa for databasing1. Select taxa for databasing

2. Sort specimens by location & date

2. Sort specimens by location & date

4. Take, name, and save digital image of labels

4. Take, name, and save digital image of labels

5a. Manually enter data into MySQL database

with some error checking

5a. Manually enter data into MySQL database

with some error checking

7. Georeference locality7. Georeference locality

5b. Online crowd-sourcing of manual data entry

5b. Online crowd-sourcing of manual data entry

5c. Optical Character Recognition & data parsing

5c. Optical Character Recognition & data parsing

3. Tease apart labels to view all text, add catalog # label

3. Tease apart labels to view all text, add catalog # label

6. Error Checking6. Error Checking

9. Temporospatial analyses9. Temporospatial analyses

8. Upload data to cache8. Upload data to cache

Collecting Event Data eventID (DC) country (DC) stateProvince (DC) county (DC) locality (DC) minimumElevationMeters (DC) maximumElevationMeters (DC) decimalLatitude (DC) decimalLongitude (DC) coordinateUncertaintyMeters (DC) geodeticDatum (DC) verbatimCoordinateSystem (DC) georeferenceSources (DC) georeferencedBy (DC) georeferencedDate georeferenceRemarks (DC) collectionBeginDate (*) collectionEndDate (*) recordedBy (DC) = collectors samplingProtocol (DC) associatedTaxa (DC) sex (DC) individualCount (DC)

Specimen Data catalogNumber (DC) institutionCode (DC) kingdom (DC) phylum (DC) class (DC) order (DC) family (DC) genus (DC) specificEpithet (DC) subspecies taxonIDCertainty scientificNameAuthorship (DC) identifiedBy (DC) dateIdentified (DC) eventID (DC)

Bold = requiredNormal = recommended

(DC) = Darwin Core field(*) = Darwin Core recommends one field that accommodates several date options. We prefer “begin” and “end” dates.

Figure 4. Each institution uses its own database system. Records will be collected into a Darwin Core-compliant, flat-file, cache with required fields for collecting event data and specimen data as indicated in the above tables from the Essig database. Labels are often highly abbreviated – unrecognized abbreviations are entered “as is” and bulk updated after data entry is completed.

Figure 1. (upper left) DinoLite® digital microscope and software used to capture images of specimens and labels. (upper right) Essig database data entry screen with specimen image – clicking on image icon makes image appear in a separate movable window. Yellow fields are carried-over to the next specimen. (lower right) Dragonfly with labels removed for imaging.

Figure 5. Semi-automated programs, such as BioGeomancer, estimate latitude-longitude coordinates with an adjustable error radius based on text descriptions (above example: 15 miles E of Cloverdale, CA). Queries of georeferenced specimens are mapped “on-the-fly” using Berkeley Mapper (right example: specimens near Sacramento, California of Libellula luctuosa Burmeister dragonflies in the Essig Database).

Figure 3. General workflow for image capture, databasing, georeferencing, and analysis. See Methods for workflow details.

© Joyce Gross © Joyce Gross© Joyce Gross

© PT Oboyski© PT Oboyski

Response to climate change