34
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions Chris&an Gendreau, David Shorthouse & Peter Desmet

Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Embed Size (px)

DESCRIPTION

TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions. Authors : Christian Gendreau, David P. Shorthouse, Peter Desmet

Citation preview

Page 1: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data quality challenges in the Canadensys network of

occurrence records: examples, tools, and solutions

Chris&an  Gendreau,  David  Shorthouse  &  Peter  Desmet  

Page 2: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Game  plan  •  Introduc&on  to  Canadensys  •  Data  quality  @  Canadensys  •  Canadensys  processing  solu&ons  •  Numbers  from  Canadensys  •  Hopes  and  expecta&ons  

Page 3: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

A Network Of people and collections

Page 4: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Canadensys Headquarters Université de Montréal Biodiversity Centre

Page 5: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

data.canadensys.net/vascan  

Page 6: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

data.canadensys.net/ipt  

Page 7: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

data.canadensys.net/explorer  

Page 8: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data quality related activities From an aggregator perspective

Page 9: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

During  data  entry  •  Help  to  avoid  typographical  errors  •  Help  to  convert  verba&m  data  

Actor : data entry person

Page 10: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Before  publica&on  

Actor : data publisher

•  Detect  file  character  encoding  issue  •  Detect  duplicate  or  missing  IDs  

Previous Activity: Data entry

Page 11: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

During  aggrega&on  •  Process  data:  valida&on,  cleaning  •  Produce  structured  reports  :  quality  control    

Actor : data aggregator

Previous Activity: Before publication

Page 12: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

AKer  aggrega&on  •  Allow  and  facilitate  community  feedback  •  Help  data  publisher  to  integrate  correc&ons  

Actor : users and community

Previous Activity: Aggregation

Page 13: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Canadensys  tools  during  data  entry  

data.canadensys.net/tools  

Page 14: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Why  do  we  process  data?  •  Enrich  our  Explorer,  h"p://data.canadensys.net  •  Provide  structured  reports  to  data  providers  

•  Help  iden&fy  records  that  need  re-­‐examina&on  •  Help  to  improve  data  entry  procedure  

Page 15: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  processing  

Page 16: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Processing  solu&ons  Narwhals  to  the  rescue  

Narwhal image Public Domain

Page 17: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

The  narwhal-­‐processor  approach  ●  Single  field  processing  to  allow  complex  

processing  (combined  fields)  ●  Processors  with  common  interface  ease  

integra&on  and  usage  ●  Collabora&on  

https://github.com/Canadensys/narwhal-processor

Page 18: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  before  processing  

92%  

60%  

96%  

44%  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

country  text   state/province  text   coordinates   dates  

%  of  n

on-­‐null  clean

 verba

>m  data  

Page 19: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  aKer  processing  

•  7%  of  provided  country  text        

USA   ISO  3166-­‐2:US,  United  States  

Page 20: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  aKer  processing  

•  7%  of  provided  country  text  •  16%  of  provided  state/province  text        

Qué   ISO  3166-­‐2  CA-­‐QC,  Quebec  

Page 21: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  aKer  processing  

•  7%  of  provided  country  text  •  16%  of  provided  state/province  text  •  4%  of  provided  coordinates      

45°  32'  25"  N,  129°  40'  31"  W  

45.5402778,  -­‐129.6752778  

Page 22: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  aKer  processing  

•  7%  of  provided  country  text  •  16%  of  provided  state/province  text  •  4%  of  provided  coordinates  •  42%  of  provided  dates      

2008  VI  13   2008-­‐06-­‐13  

Page 23: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Data  usability  including  processed  data  

92%  

60%  

96%  

44%  

7%  

16%  

4%  

42%  

0%  

10%  

20%  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

country  text   state/province  text   coordinates   dates  

%  of  n

on-­‐null  provide

d  

Page 24: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Projects  With  Data  Quality  Tools  •  Atlas  of  living  Australia  •  GBIF  Norway,  GBIF  Spain,  Na&onal  Biodiversity  Network,  BioVeL  …    

•  GBIF  libraries  •  Most  nodes  have  their  own  data  quality  rou&ne  

Page 25: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Hopes  and  expecta&ons  

Page 26: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

•  Maintain  taxonomic  authority  files  •  Maintain  country,  province  and  city  lists  

We  do  not  want  to  

Page 27: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

•  Efficiently  use  specialized  resources/services  •  Provide  report,  quality  indices  

We  prefer  to  

Page 28: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Help  from  Seman&c  Web  •  Data  in  other  languages  (French,  Spanish,  …)  

should  not  be  flagged  as  error  •  Misspellings  should  be  shared  as  a  common  

resource  (e.g.  SKOS)  •  Understand  historical  data  (e.g.  collected  in  

USSR  in  1980)  

Page 29: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Repor&ng  and  log  •  DarwinCore  annota&ons  for  processed  data  •  Shared  vocabulary  for  structured  reports  and  

quality  indices  

Page 30: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Summary  •  Tools  available  for  sharing  •  Use,  review,  contribute  •  Opportunity  for  broad  coordina&on  and  increased  efficiencies  

Page 31: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Thanks  

Anne Bruneau, Institut de recherche en biologie végétale and Département de Sciences Biologiques, Université de Montréal

Page 32: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Contact    hrp://www.canadensys.net    hrp://github.com/Canadensys    @Canadensys  

Gulo gulo, Larry Master (www.masterimages.org)

Page 33: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Mul&-­‐field  processing  DwC  Field   Raw  data   Processed  data  

verba&mLa&tude   45°30ʹ′N    45.5  

verba&mLongitude   73°34ʹ′W   -­‐73.5666667  

country   Canada   Canada  

stateProvince   QC   Quebec  

municipality   Montreal  City   Montreal  

Page 34: Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions

Mul&-­‐field  processing  1.  Get  informa&on  on  coordinates  

45.5,-­‐73.5666667  2.  Compare  with  processed  data  3.  Assert  that  these  coordinates  are  in  Montréal