12
An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011 Kaizhi Tang, Ph.D., David Mihalcik, Thomas Wavering, Roger Xu Intelligent Automation Inc Prof. Stacey Harper, OSU Sue Pan, SAIC Sponsor Agency: Dr. Jeff Steevens, Army ERDC

An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Embed Size (px)

Citation preview

Page 1: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

An Integrated and Comprehensive Data Mining System for Studying Environmental

Impact of Nanomaterials: NEIMiner

Nano Working Group Presentation

10/13/2011

Kaizhi Tang, Ph.D., David Mihalcik,Thomas Wavering, Roger Xu

Intelligent Automation Inc

Prof. Stacey Harper, OSUSue Pan, SAIC

Sponsor Agency: Dr. Jeff Steevens, Army ERDC

Page 2: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Outline

Motivation and proposed approach

NEI modeling framework

Design of NEIMiner information system

NEIMiner

Page 3: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Motivation and proposed approach of NEIMiner

NEED: To reduce the risk of nanomaterials in military use, NM environmental impact analysis requires a comprehensive NEI modeling framework, centralized NEI database, powerful model discovering tool and integrated model composition strategy.

KEY COMPONENTS OF THE PROPOSED APPROACH• Flexible data integration based on the ETL (Extract,

Transform, Load) strategy of data warehouse.

• Integrated and collaborative data management utilizing modern content management system

• Optimized data mining process with many algorithms and parameters with huge computational burden

• Flexible model composition based on unified model abstraction reusing FRAMES

DELIVERABLES• Conceptual framework of NEI analysis• Collaborative NEI information system with model discovery

and composition capability

VALUE TO THE CUSTOMER /TRANSITION CUSTOMER• Environmental impact estimation tool for nanomaterials• Easy access to large amount of NEI data in a centralized data

warehouse and the available model generation tool• Potentially useful evaluation models of NEI

Page 4: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Collaboratory of Structural Nanobiology

NEI Data

NEI Data Mining Models

Scope of NEI Modeling

Page 5: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

NEIMiner System Architecture

NEI Data

NEI Data Mining Models

Page 6: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Available NEI Data and Schemas

Nanomaterial-Biological Interactions Knowledgebase– http://nbi.oregonstate.edu/

Cancer Nanotechnology Laboratory portal (caNanoLab)– NCI, https://cananolab.nci.nih.gov/caNanoLab/

ICON: International Council on Nanotechnology– Rice University, http://icon.rice.edu

Nano-Tab– tab-delimited spreadsheet type based on EBI

and ISA-TAB

NanoParticle Ontology(NPO)– Implemented in OWL

Most complete characterization capture

Largest number of publications, limited characterization capture

Wide range of characterization and health impact data

Most complete characterization capture

Largest number of publications, limited characterization capture

Page 7: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Other Data and Schemas

OECD Database on Research into Safety of Manufactured Nanomaterials– http://webnet.oecd.org

National Institute for Occupational Safety and Health (NIOSH)– http://www.cdc.gov/niosh/topics/nanotech/NIL.html

SAFENANO - Institute of Occupational Health (UK)– http://www.safenano.org/AdvancedSearch.aspx

University of Wisconsin - Madison: Nanoscale Science and Engineering Center– http://www.nanoceo.net/nanorisks

National Reference Center for Bioethics Literature - Georgetown University, Kennedy Institute of Ethics

– http://bioethics.georgetown.edu/

Nanomedicine Research Portal– http://www.nano-biology.net/

Center on Nanotechnology and Society (Chicago-Kent College of Law in the Illinois Institute of Technology)

– http://www.nano-and-society.org/

Page 8: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Data Extraction Methods

Data extraction via web services– Example: caNanoLab

Data extraction via web scraping– Examples: ICON, NBI– Approaches

Human copy-and-pasteHTTP programmingText grepping and regular expression matchingHTML parsers

Page 9: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Design philosophy of NEI data Warehouse

Data Warehouse– Centralized data from multiple data

sources for analysis=> multiple nano risk related data sources with different formats

– Consists of an ETL tool, a Database, a Reporting tool, Data Modeling

=> tools useful for NM data integration and mining

– Subject oriented data organization=> risk assessment for nano materials

– Multi-dimensional=> various nanomaterial properties

– Star schema=> extendible schema design

Page 10: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

NEI Model Discovery

• Physical properties• Material Type• Particle size distribution• PDI • Shape• Structure

• Chemical properties• Surface reactivity• Surface charge• Water solubility

• Exposure and Study scenario• Duration• Continuity• Exposure route• Number of nanoparticles• Number of ligands

• Biological Properties• Species, age, gender, weight

• Environmental ecosystem response

• Fate and transport• Bioavailability and

uptake• Biomagnificiation

• Biological response• Genomic response• Cell death

Correlation?

Prediction?

Page 11: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Interesting Mining Problems and Solutions

How to handle missing data– Median on numerical values– Median-frequency categories– Classification or regression using existing data

How to determine attribute significance– Compare gain ratio for classification– Compare relief ratio for numerical prediction

How to select algorithms and their parameters for training– Meta-optimization on algorithms and parameters

How to split the data sets for high-quality models– Comparing various splitting strategies– Clustering as a preprocessing step

Page 12: An Integrated and Comprehensive Data Mining System for Studying Environmental Impact of Nanomaterials: NEIMiner Nano Working Group Presentation 10/13/2011

Demonstration of NEIMiner

12