Upload
cory-mitchell
View
213
Download
1
Embed Size (px)
Citation preview
Pat LangleyPat LangleySchool of Computing and InformaticsSchool of Computing and Informatics
Arizona State UniversityArizona State UniversityTempe, ArizonaTempe, Arizona
Institute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise
Palo Alto, CaliforniaPalo Alto, California
Challenges for the ComputationalChallenges for the ComputationalDiscovery of Scientific KnowledgeDiscovery of Scientific Knowledge
Thanks to K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeroski, and Thanks to K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeroski, and L. Todorovski for their contributions to this research, which is funded by a grant from L. Todorovski for their contributions to this research, which is funded by a grant from the National Science Foundation.the National Science Foundation.
Drawbacks of Scientific Data MiningDrawbacks of Scientific Data Mining
generates models in forms inappropriate to most sciencesgenerates models in forms inappropriate to most sciences
makes incorrect assumptions about the available inputsmakes incorrect assumptions about the available inputs
focuses on convenient algorithmic issues, not scientists’ needsfocuses on convenient algorithmic issues, not scientists’ needs
Because it borrows from work on commercial applications, most Because it borrows from work on commercial applications, most work on scientific data mining: work on scientific data mining:
We need to redirect attention toward a broader range of discovery We need to redirect attention toward a broader range of discovery tasks that actually arise in scientific fields. tasks that actually arise in scientific fields.
Data-mining researchers would benefit from looking at the older Data-mining researchers would benefit from looking at the older literature on literature on computational scientific discoverycomputational scientific discovery. .
Claim 1: Scientific NotationsClaim 1: Scientific Notations
NPPc = month max (E·IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
----
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
Traditional data-mining notations are not easily understood by or Traditional data-mining notations are not easily understood by or communicated to domain scientists. communicated to domain scientists.
Most sciences state and communicate models in formalisms they Most sciences state and communicate models in formalisms they have used for decades. have used for decades.
We need more work on discovering scientific knowledge cast in We need more work on discovering scientific knowledge cast in communicablecommunicable forms (Dzeroski & Todorovski, 2007). forms (Dzeroski & Todorovski, 2007).
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
Claim 2: Background KnowledgeClaim 2: Background Knowledge
Scientists often have initial knowledge that should influence the Scientists often have initial knowledge that should influence the discovery process. discovery process.
Ignoring this knowledge can produce models that scientists reject Ignoring this knowledge can produce models that scientists reject as nonsensical (Pazzani et al., 2001). as nonsensical (Pazzani et al., 2001).
ModelModelRevisionRevision
Initial modelInitial model
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
----
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
ObservationsObservations
Revised modelRevised model
×
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
--
++ ++
--
--psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
++
LightLight
++ ×
Claim 3: Small Data SetsClaim 3: Small Data Sets
Most data-mining work assumes that large data sets are available. Most data-mining work assumes that large data sets are available.
But in many scientific domains, data are rare and hard to obtain. But in many scientific domains, data are rare and hard to obtain.
Discovering scientific knowledge from small data sets raises an Discovering scientific knowledge from small data sets raises an entirely different set of challenges (Lee et al., 1998). entirely different set of challenges (Lee et al., 1998).
We need more research on this important aspect of discovery. We need more research on this important aspect of discovery.
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
Number of variables
Number of equations
Number of parameters
Number of samples
8
11
20
303
Number of variablesNumber of variables
Number of initial linksNumber of initial links
Number of possible linksNumber of possible links
Number of samplesNumber of samples
99
1111
7070
2020
Claim 4: Scientific ExplanationClaim 4: Scientific Explanation
Most work on data mining finds models that, although accurate, Most work on data mining finds models that, although accurate, merely merely describedescribe the observations. the observations.
However, scientists often want models that However, scientists often want models that explainexplain their data using their data using familiar concepts. familiar concepts.
Explanatory models can include theoretical entities and processes Explanatory models can include theoretical entities and processes that link back to domain knowledge (Langley et al., 2002). that link back to domain knowledge (Langley et al., 2002).
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
NPPc
IPAR
PET
T1T2We_max
E
EET
Tempc
Topt
NDVI
SOLAR
AHI
A
PETTWM
SR
FPAR
VEG
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
Claim 5: Interactive DiscoveryClaim 5: Interactive Discovery
Most data-mining work focused on entirely automated algorithms. Most data-mining work focused on entirely automated algorithms.
But most scientists want computational aids rather than systems that But most scientists want computational aids rather than systems that would replace them. would replace them.
We need more work on interactive discovery (Bridewell et al., 2007).We need more work on interactive discovery (Bridewell et al., 2007).
ModelModelRevisionRevision
Initial modelInitial model
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
----
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
ObservationsObservations
Revised modelRevised model
×
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
--
++ ++
--
--psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
++
LightLight
++ ×
Domain userDomain user
The PThe PROMETHEUSROMETHEUS System System(Bridewell et al., 2007)(Bridewell et al., 2007)