Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Solving informatics challenges to advance plant ecology: a vision for
the next 100 years
Brian McGill
ESA Baltimore August 14, 2015
First point
5‐10
Where I’m coming from
• 1990s work in computer consulting world doing datawarehousing (building databases for business analytics)
• 2000s analyze BBS and other large datasets• 2010s part of numerous ecoinformatics projects
– BIEN – Botanical Information and Ecology Network• 3 iterations• 20 million records
– ETE – Evolution of Terrestrial Ecosystems– E&O – Environment and Organisms– Consultation with lots of others
The trend
0
200
400
600
800
1000
1200
1400
1600
1990 1995 2000 2005 2010 2015
Papers in ecology with 'database*'
Why the rise of ecoinformatics? ‐ push
The capacity of hard disks double every 23 months
Soberón & Peterson (2004)Data growing >exponentially
Why the rise of ecoinformatics? ‐ pull
• Conservation & policy– Large scales– Long time series– Inventory mentality
• Basic ecology– Macroecology– Ecosystems ecology
• Global change
The data
Site
Time
Site
Time
Biotic measurements Abiotic (environmental measurements
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
The real order
1a
2
3
5
4 6 7 8 1b9 10
McGill’s 9 easy steps of ecoinformatics
Collect
Scrub Jo
inStore
Update
Analyze
ManageStaff
Fund
Amount of Work Gartner Group70% of datwarehousingis in data preparation
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Collecting data sources
• Some are open and on the web• Rest is a complex social process
– Use evolution of sociality1. Compel/punish (journal, funder requirements)2. Trust/reputation (small homogeneous group)3. Joint fate (group selection)
• Write out an agreement in advance!• Most ecoinformatics projects evolve towards
openness
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Gartner Group70% of datwarehousingis in data preparation
4 dimensions• Values
– 100 cm of rainfall yesterday– 0 for NA– 1.00 vs 10.0 (transcription errors)– Instrument errors– Data filling?
• Space– Geocoding (Convention center Baltimore39.2883N, 76.6181W)– Geoscrubbing
• 42,100 for North America• 100, 42 for North America• 0,0• State centers
• Time– Best tools, but amazing how often 6/14/2015 vs 2015/6/14
• Taxonomy– Misspellings– Synonymy
Synonyms and errors
Soberon & Peterson 2004
Taxonomic Scrubbing in BIEN
• 2.5M records 600,000 “species” in New World!
• 600,000 names300,000 standardized names after synonymy and misspelling (fuzzy matching)
• TNRS service Boyle et al 2013
Geoscrubbing ‐ SynonymyMEX ISO 3166‐1 alpha‐3MX ISO 3166‐1 alpha‐2Mexico “official” geonames.org (and gadm.org) nameMEXICO capitalization insensitiveMéxico geonames.org alternate nameMéxico recognizable misencoding of MéxicoMéxico translatable HTML character codeMexi not matched
439 country “names”62 (14%) unrecognizably misspelled377 recognized 193 recognized countries (49% synonyms)
43% canonical names
GeoScrubbing in BIEN
• ~ 1/3 of records had no lat/lon (or 0/0)• >1/2 had + longitude• About 1% had other obvious lat/lon errors• 15% not in the right country• 25% not in the right state/province
• 67%*85%*75%=41% correct!
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Joining data
• Cleaning & synonyms – Pinus strobus in USFIA vs Pinus strobus L. in MOBOT
• Semantic joining– USFIA has # stems per 0.04 ha plot– MOBOT has a specimen card/occurrence
• Record connecting– The easy part – databases do this well
How to assemble data
• Three approaches:– Semantic web – automated assembly over network
– Standards – agree on database format, individuals contribute
• Bottom‐up (standards committees)• Top‐down• Only works for relatively uniform data
– Datawarehousing – specific people pull it together and maintain
Cost $ Effectiveness
Two laws of scrubbing & joining
• Gartner’s law #1 – expect it to be 70% of your work
•McGill’s law #2 – expect ~50% of the data to be wrong
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
OLTP vs OLAP• OLTP=OnLine Transaction Processing
– The preeminent need in business (since 1960s)– Frequent entry of new facts, frequent recall of individual facts, infrequent sweeping analyses
– Bank account transactions, phone call recording & billing, order entry, accounting
• OLAP=OnLine Analytical Processing– Growing need in business (since 1980s, coined in 1995)– Analysis of information stored in OLTP systems– Increasing recognition that OLAP systems are not just a query function on OLTP
– Datawarehousing, multidimensional‐databases
OLAP context
OLTPDatabase
OLAPDatabase
Periodic,IncrementalUpdates
Fear 3rd normal form
• All undergraduate computer scientists are told to “normalize” their database schema
• They tell all ecologists to do this• Ecologists always nod their heads knowingly
• Except experienced database people know normalization is a trade‐off gradient and it is really good for OLTP and really bad for OLAP
OLAP Schema (Star/Snowflake)=Dimensional modelling
Fact Table
AbundanceDate recorded
Species
SpeciesGenusFamilyAuthorityBody sizeFunctional group
Plot
LatitudeLongitudeElevationLandcover
Site
MATMAPBiome
Time
Family
Diet
Order
Katge et al 2011 (MEE) – TRY database
Multidimensional Databases – the conceptual idea?
Site
Time
Site
Time
Biotic measurements Abiotic (environmental measurements
Spatial databases• Spatial indexing on lat/lon• SQL extensions• Distance, area/length, point in polygon• PostGIS
SELECT binomial, body_size FROM species WHERE ST_Intersects(geom, ST_GeomFromText('POINT(45 -100)',4326));
SELECT state.code, species.binomial FROM states JOIN species ON ST_INTERSECTS(states.geom, species.range);
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Updating the database
• Key decision– Never (one‐time)– Snapshot (fixed points in time)– Continuous
• More work than you think– Expect a 50% reduction each time you update
Amount of work Benefit
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
The fun part
• Lamanna et al 2014
Serving data to other scientists
• Pretty mapped screens are nice …
• But I want a subset of the data!
• Options– Downloadable dumps– Query & download– RESTFUL API– R API
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
“Over the wall”
Scientists
Iterative prototyping
Software Development
UserFeedback
Initial Analysis & Design
This is now 1‐2 hours/week of your life
Software engineering
• Use cases• Design reviews• Version control• Separate develop and test• Automated testing• Bug tracking software• Steering committees
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Students or not students
• Pros of having students do it– Valuable career skill– Need to train next generation
• Cons– Often years of work without papers or science– Often much time spent on training
• Computer scientists do cutting edge computer science, not boring data scrubbing & databases
• Consultants are an underconsidered alternative
Student training
• Not everybody is good at it. Not everybody wants to do it. Not everybody should do it. But the ones who do should be rewarded (or at least not penalized).– Paraphrasing Stephen Jackson
• Ecologistscomputers or techiesecology?– Both – but make sure the ecology is in there
• Every ecology student needs training as a consumer of databases (simple SQL)
McGill’s 9 easy steps of ecoinformatics
Collect Scrub Join Store Update Analyze
Manage
Staff
Fund
Funding
• Good luck!
• More seriously– More grants need to build in data management costs (should NSF give free 15% supplements?)
– NSF needs to figure way to fund maintaining infrastructure
Top community priorities
1. Better scrubbing tools– Taxonomy, space, time are pretty standard …
2. Developing capacity/training3. Better spatio‐temporal tools4. Imagery/raster tools5. Broader exposure to more database types6. Don’t reinvent the wheel – software best
practices exist!7. Keep the ecology in the drivers seat