2016 BIMF/FBIP FORUM Data Quality Control Improvement methodsbiodiversityadvisor.sanbi.org/wp-content/uploads/... · The sources of data quality problems in specimen / occurrence

Data Quality Control &

Improvement methods

By 1. Alpheus Mothapo 2. Hannelie Snyman

3. Michelle Smith Date: 12 May 2016

2016 BIMF/FBIP FORUM

What are the sources of data quality problems in specimen / occurrence data?

Insufficient data from the collector labels.

Incorrect interpretation of specimen label information.

Deciphering information resulting from hand writing.

Information in foreign language.

When scientists produce new publications with new names the database and the physical specimens must be updated.

The sources of data quality problems in specimen / occurrence data

Inconsistencies in data formatting – i.e. spacing, date format, co-ordinate formatting.

Inconsistencies can happen across herbaria if different standards are followed.

BRAHMS usually provides standard fields/drop downs to select appropriate values.

The inconsistencies occur in memo fields if proper standards are not followed.

The sources of data quality problems in specimen / occurrence data

Errors in data capture from specimen labels – incorrect spelling of species names, collectors, locality details etc.

This result from collector’s mistakes or not allowing typist to correct mistakes done by collectors.

Different collectors use different types of field labels that do not match the database, therefore data capturers sometimes get confused as to where the information belongs.

Collectors providing incorrect information on labels that the specialist knows is not correct, thus the whole data capturing process is incorrect from the beginning.

Incorrect capturing of GPS readings and incorrect llunit selected

Example of hand-written field labels



Example of SANBI field label

Georeferencing/inaccuracies SANBI’s policy is to copy all the information on the specimen

label (even if it is incorrect) as not to loose information.

A dedicated field in BRAHMS is used to record incorrect information after the record is updated to the correct information.

A dedicated team at SANBI is responsible for updating the QDS, lat and long values.

One of the best ways to identify identification or locality errors, is to map the species/specimens distributions.

How is data quality checked? For data capturing, draft labels are printed and the data

quality controller will check the field labels, draft labels and the database to see if there are any errors.

By allocating batches to staff knowledgeable of this work – the batches have field labels and draft printouts of typed labels.

Avoiding a situation where the same person who captures the information, QC the information.

Using BRAHMS screen to double check the QC draft labels information while also using original data source (i.e. field label).

Typed field label

Quality Controlled draft label

Final transferred label

Typing & error identification

QC and fox-pro update

Final label after corrections and fox-pro

How is data quality checked?

How is data quality checked?

At SANBI two processes are followed:

Data curation: Specimens updated to reflect new revisions and name changes.

Verification: Physical specimens used to verify data in the database.

Doing corrections

Errors are highlighted in red pen as this makes it easy for the typist to make corrections.

Corrections can only be done on the live system.

Should there be a way of tracking changes made?

Currently there is a way of tracking changes made in BRAHMS (edithist field)

Are there tools or short cuts that can be used to do quality checking?

Data quality has no short cuts - this will compromise the data quality.

Data quality checking is a very important and necessary process.

The quality of data must be of a high standard.

Problems must be identified and dealt with during the encoding process as it costs more work to fix the data quality problems later.

Resource requirements What are the resource requirements for quality control and data cleaning with a large data set such as SANBI’s plant specimen database?

Policies and procedures to ensure consistency in data cleaning and quality control

standards across all herbaria. Standardized process and procedures for typing, Standardized Quality Control process and Standardized final transfer process.

Effective tools such as georeferencing tools (Mapping software, electronic maps).

Training to be provided to make data input more accurate.

Well trained staff with dedication and determination for quality of work. Skilled staff who are interested in data quality.

Physical resources: Hardware (computers, scanners etc.) An efficient user friendly database with minimal network and technical problems.

Improvement methods

It depends on the labels one is checking. It is easy to QC SANBI designed field labels as all information from the field is ticked and not written.

It makes the process simple with small written additional information.

Data quality control depends on the section of data concerned(Taxa, People, Botanical Records)

Depending on quantity of information on label, handwriting of collector and experience of quality controller, it can take anything from 1.5 – 3 hours per batch of 50 specimens.

How long does it take, how many people are involved?

Data Capture: Must have’s Geographical knowledge and interpretation (e.g. Grids references and

coordinates).

Geo-referencing (understanding the quarter degree system and able to do geo-referencing – e.g. readings: 30.17°S vs 30° 17.34’S).

Attention to details (e.g. differentiate between collectors – in case they have same initials, names, etc. )

Familiarize with herbarium codes (in case material donated to PRE).

Basic requirements of flagging type specimens within database (able to use relevant protologue or basionym publication/reference).

Understanding of botany and botanical terms (e.g. scrub vs shrub).

Good reading and writing skills of English to decipher and interpret handwriting on labels.

The End,

thank you..

Documents

2016 BIMF/FBIP FORUM Data Quality Control Improvement methodsbiodiversityadvisor.sanbi.org/wp-content/uploads/... · The sources of data quality problems in specimen / occurrence