47
Solving informatics challenges to advance plant ecology: a vision for the next 100 years Brian McGill ESA Baltimore August 14, 2015

McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Solving informatics challenges to advance plant ecology: a vision for 

the next 100 years

Brian McGill

ESA Baltimore August 14, 2015

Page 2: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

First point

5‐10

Page 3: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Where I’m coming from

• 1990s work in computer consulting world doing datawarehousing (building databases for business analytics)

• 2000s analyze BBS and other large datasets• 2010s part of numerous ecoinformatics projects

– BIEN – Botanical Information and Ecology Network• 3 iterations• 20 million records

– ETE – Evolution of Terrestrial Ecosystems– E&O – Environment and Organisms– Consultation with lots of others

Page 4: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

The trend

0

200

400

600

800

1000

1200

1400

1600

1990 1995 2000 2005 2010 2015

Papers in ecology with 'database*'

Page 5: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Why the rise of ecoinformatics? ‐ push

The capacity of hard disks double every 23 months

Soberón & Peterson (2004)Data growing >exponentially

Page 6: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Why the rise of ecoinformatics? ‐ pull

• Conservation & policy– Large scales– Long time series– Inventory mentality

• Basic ecology– Macroecology– Ecosystems ecology

• Global change

Page 7: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

The data

Site

Time

Site

Time

Biotic measurements Abiotic (environmental measurements

Page 8: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 9: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

The real order

1a

2

3

5

4 6 7 8 1b9 10

Page 10: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect

Scrub Jo 

inStore

Update

Analyze

ManageStaff

Fund

Amount of Work Gartner Group70% of datwarehousingis in data preparation

Page 11: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 12: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Collecting data sources

• Some are open and on the web• Rest is a complex social process

– Use evolution of sociality1. Compel/punish (journal, funder requirements)2. Trust/reputation (small homogeneous group)3. Joint fate (group selection)

• Write out an agreement in advance!• Most ecoinformatics projects evolve towards 

openness

Page 13: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 14: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Gartner Group70% of datwarehousingis in data preparation

Page 15: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

4 dimensions• Values

– 100 cm of rainfall yesterday– 0 for NA– 1.00 vs 10.0 (transcription errors)– Instrument errors– Data filling?

• Space– Geocoding (Convention center Baltimore39.2883N, 76.6181W)– Geoscrubbing

• 42,100 for North America• 100, 42 for North America• 0,0• State centers

• Time– Best tools, but amazing how often 6/14/2015 vs 2015/6/14 

• Taxonomy– Misspellings– Synonymy

Page 16: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Synonyms and errors

Soberon & Peterson 2004

Page 17: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Taxonomic Scrubbing in BIEN

• 2.5M records 600,000 “species” in New World!

• 600,000 names300,000 standardized names after synonymy and misspelling (fuzzy matching)

• TNRS service Boyle et al 2013

Page 18: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Geoscrubbing ‐ SynonymyMEX ISO 3166‐1 alpha‐3MX ISO 3166‐1 alpha‐2Mexico “official” geonames.org (and gadm.org) nameMEXICO capitalization insensitiveMéxico geonames.org alternate nameMéxico recognizable misencoding of MéxicoMéxico translatable HTML character codeMexi not matched

439 country “names”62 (14%) unrecognizably misspelled377 recognized 193 recognized countries (49% synonyms)

43% canonical names

Page 19: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

GeoScrubbing in BIEN

• ~ 1/3 of records had no lat/lon (or 0/0)• >1/2 had + longitude• About 1% had other obvious lat/lon errors• 15% not in the right country• 25% not in the right state/province

• 67%*85%*75%=41% correct!

Page 20: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 21: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Joining data

• Cleaning & synonyms – Pinus strobus in USFIA vs Pinus strobus L. in MOBOT

• Semantic joining– USFIA has # stems per 0.04 ha plot– MOBOT has a specimen card/occurrence

• Record connecting– The easy part – databases do this well

Page 22: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

How to assemble data

• Three approaches:– Semantic web – automated assembly over network

– Standards – agree on database format, individuals contribute

• Bottom‐up (standards committees)• Top‐down• Only works for relatively uniform data

– Datawarehousing – specific people pull it together and maintain

Cost $ Effectiveness

Page 23: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Two laws of scrubbing & joining

• Gartner’s law #1 – expect it to be 70% of your work

•McGill’s law #2 – expect ~50% of the data to be wrong

Page 24: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 25: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

OLTP vs OLAP• OLTP=OnLine Transaction Processing

– The preeminent need in business (since 1960s)– Frequent entry of new facts, frequent recall of individual facts, infrequent sweeping analyses

– Bank account transactions, phone call recording & billing, order entry, accounting

• OLAP=OnLine Analytical Processing– Growing need in business (since 1980s, coined in 1995)– Analysis of information stored in OLTP systems– Increasing recognition that OLAP systems are not just a query function on OLTP

– Datawarehousing, multidimensional‐databases

Page 26: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

OLAP context

OLTPDatabase

OLAPDatabase

Periodic,IncrementalUpdates

Page 27: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Fear 3rd normal form

• All undergraduate computer scientists are told to “normalize” their database schema

• They tell all ecologists to do this• Ecologists always nod their heads knowingly

• Except experienced database people know normalization is a trade‐off gradient and it is really good for OLTP and really bad for OLAP

Page 28: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

OLAP Schema (Star/Snowflake)=Dimensional modelling

Fact Table

AbundanceDate recorded

Species

SpeciesGenusFamilyAuthorityBody sizeFunctional group

Plot

LatitudeLongitudeElevationLandcover

Site

MATMAPBiome

Time

Family

Diet

Order

Page 29: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Katge et al 2011 (MEE) – TRY database

Page 30: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Multidimensional Databases  – the conceptual idea?

Site

Time

Site

Time

Biotic measurements Abiotic (environmental measurements

Page 31: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Spatial databases• Spatial indexing on lat/lon• SQL extensions• Distance, area/length, point in polygon• PostGIS

SELECT binomial, body_size FROM species WHERE ST_Intersects(geom, ST_GeomFromText('POINT(45 -100)',4326));

SELECT state.code, species.binomial FROM states JOIN species ON ST_INTERSECTS(states.geom, species.range);

Page 32: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 33: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Updating the database

• Key decision– Never (one‐time)– Snapshot (fixed points in time)– Continuous

• More work than you think– Expect a 50% reduction each time you update

Amount of work Benefit

Page 34: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 35: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

The fun part

• Lamanna et al 2014

Page 36: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Serving data to other scientists

• Pretty mapped screens are nice …

• But I want a subset of the data!

• Options– Downloadable dumps– Query & download– RESTFUL API– R API

Page 37: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 38: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

“Over the wall”

Scientists

Page 39: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Iterative prototyping

Software Development

UserFeedback

Initial Analysis & Design

Page 40: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

This is now 1‐2 hours/week of your life

Page 41: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Software engineering

• Use cases• Design reviews• Version control• Separate develop and test• Automated testing• Bug tracking software• Steering committees

Page 42: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 43: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Students or not students

• Pros of having students do it– Valuable career skill– Need to train next generation

• Cons– Often years of work without papers or science– Often much time spent on training

• Computer scientists do cutting edge computer science, not boring data scrubbing & databases

• Consultants are an underconsidered alternative

Page 44: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Student training

• Not everybody is good at it. Not everybody wants to do it. Not everybody should do it. But the ones who do should be rewarded (or at least not penalized).– Paraphrasing Stephen Jackson

• Ecologistscomputers or techiesecology?– Both – but make sure the ecology is in there

• Every ecology student needs training as a consumer of databases (simple SQL)

Page 45: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

McGill’s 9 easy steps of ecoinformatics

Collect Scrub Join Store Update Analyze

Manage

Staff

Fund

Page 46: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Funding

• Good luck!

• More seriously– More grants need to build in data management costs (should NSF give free 15% supplements?)

– NSF needs to figure way to fund maintaining infrastructure

Page 47: McGill ESA 2015 ecoinformatics - Amazon S3...The trend 0 200 400 600 800 1000 1200 1400 1600 1990 1995 2000 2005 2010 2015 Papers in ecology with 'database*' Why the rise of ecoinformatics?

Top community priorities

1. Better scrubbing tools– Taxonomy, space, time are pretty standard …

2. Developing capacity/training3. Better spatio‐temporal tools4. Imagery/raster tools5. Broader exposure to more database types6. Don’t reinvent the wheel – software best 

practices exist!7. Keep the ecology in the drivers seat