8
Pat Langley Pat Langley School of Computing and Informatics School of Computing and Informatics Arizona State University Arizona State University Tempe, Arizona Tempe, Arizona Institute for the Study of Learning and Expertise Institute for the Study of Learning and Expertise Palo Alto, California Palo Alto, California Challenges for the Challenges for the Computational Computational Discovery of Scientific Discovery of Scientific Knowledge Knowledge K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeros K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeros ski for their contributions to this research, which is funded by a g ski for their contributions to this research, which is funded by a g al Science Foundation. al Science Foundation.

Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Embed Size (px)

Citation preview

Page 1: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Pat LangleyPat LangleySchool of Computing and InformaticsSchool of Computing and Informatics

Arizona State UniversityArizona State UniversityTempe, ArizonaTempe, Arizona

Institute for the Study of Learning and ExpertiseInstitute for the Study of Learning and Expertise

Palo Alto, CaliforniaPalo Alto, California

Challenges for the ComputationalChallenges for the ComputationalDiscovery of Scientific KnowledgeDiscovery of Scientific Knowledge

Thanks to K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeroski, and Thanks to K. Arrigo, D. Billman, M. Bravo, S. Borrett, W. Bridewell, S. Dzeroski, and L. Todorovski for their contributions to this research, which is funded by a grant from L. Todorovski for their contributions to this research, which is funded by a grant from the National Science Foundation.the National Science Foundation.

Page 2: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Drawbacks of Scientific Data MiningDrawbacks of Scientific Data Mining

generates models in forms inappropriate to most sciencesgenerates models in forms inappropriate to most sciences

makes incorrect assumptions about the available inputsmakes incorrect assumptions about the available inputs

focuses on convenient algorithmic issues, not scientists’ needsfocuses on convenient algorithmic issues, not scientists’ needs

Because it borrows from work on commercial applications, most Because it borrows from work on commercial applications, most work on scientific data mining: work on scientific data mining:

We need to redirect attention toward a broader range of discovery We need to redirect attention toward a broader range of discovery tasks that actually arise in scientific fields. tasks that actually arise in scientific fields.

Data-mining researchers would benefit from looking at the older Data-mining researchers would benefit from looking at the older literature on literature on computational scientific discoverycomputational scientific discovery. .

Page 3: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Claim 1: Scientific NotationsClaim 1: Scientific Notations

NPPc = month max (E·IPAR, 0)

E = 0.56 · T1 · T2 · W

T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2

T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]

W = 0.5 + 0.5 · EET / PET

PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0

PET = 0 if Tempc < 0

A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49

IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver

FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]

SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++ ++

--

----

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

Traditional data-mining notations are not easily understood by or Traditional data-mining notations are not easily understood by or communicated to domain scientists. communicated to domain scientists.

Most sciences state and communicate models in formalisms they Most sciences state and communicate models in formalisms they have used for decades. have used for decades.

We need more work on discovering scientific knowledge cast in We need more work on discovering scientific knowledge cast in communicablecommunicable forms (Dzeroski & Todorovski, 2007). forms (Dzeroski & Todorovski, 2007).

Ecosystem modelEcosystem model Gene regulation modelGene regulation model

Page 4: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Claim 2: Background KnowledgeClaim 2: Background Knowledge

Scientists often have initial knowledge that should influence the Scientists often have initial knowledge that should influence the discovery process. discovery process.

Ignoring this knowledge can produce models that scientists reject Ignoring this knowledge can produce models that scientists reject as nonsensical (Pazzani et al., 2001). as nonsensical (Pazzani et al., 2001).

ModelModelRevisionRevision

Initial modelInitial model

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++ ++

--

----

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

ObservationsObservations

Revised modelRevised model

×

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

--

++ ++

--

--psbA1psbA1

psbA2psbA2

cpcBcpcB

++--

++

LightLight

++ ×

Page 5: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Claim 3: Small Data SetsClaim 3: Small Data Sets

Most data-mining work assumes that large data sets are available. Most data-mining work assumes that large data sets are available.

But in many scientific domains, data are rare and hard to obtain. But in many scientific domains, data are rare and hard to obtain.

Discovering scientific knowledge from small data sets raises an Discovering scientific knowledge from small data sets raises an entirely different set of challenges (Lee et al., 1998). entirely different set of challenges (Lee et al., 1998).

We need more research on this important aspect of discovery. We need more research on this important aspect of discovery.

Ecosystem modelEcosystem model Gene regulation modelGene regulation model

Number of variables

Number of equations

Number of parameters

Number of samples

8

11

20

303

Number of variablesNumber of variables

Number of initial linksNumber of initial links

Number of possible linksNumber of possible links

Number of samplesNumber of samples

99

1111

7070

2020

Page 6: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Claim 4: Scientific ExplanationClaim 4: Scientific Explanation

Most work on data mining finds models that, although accurate, Most work on data mining finds models that, although accurate, merely merely describedescribe the observations. the observations.

However, scientists often want models that However, scientists often want models that explainexplain their data using their data using familiar concepts. familiar concepts.

Explanatory models can include theoretical entities and processes Explanatory models can include theoretical entities and processes that link back to domain knowledge (Langley et al., 2002). that link back to domain knowledge (Langley et al., 2002).

Ecosystem modelEcosystem model Gene regulation modelGene regulation model

NPPc

IPAR

PET

T1T2We_max

E

EET

Tempc

Topt

NDVI

SOLAR

AHI

A

PETTWM

SR

FPAR

VEG

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++ ++

--

--

--

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

Page 7: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

Claim 5: Interactive DiscoveryClaim 5: Interactive Discovery

Most data-mining work focused on entirely automated algorithms. Most data-mining work focused on entirely automated algorithms.

But most scientists want computational aids rather than systems that But most scientists want computational aids rather than systems that would replace them. would replace them.

We need more work on interactive discovery (Bridewell et al., 2007).We need more work on interactive discovery (Bridewell et al., 2007).

ModelModelRevisionRevision

Initial modelInitial model

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

++

++ ++

--

----

psbA1psbA1

psbA2psbA2

cpcBcpcB

++

++--

--

LightLight

++

ObservationsObservations

Revised modelRevised model

×

DFRDFR

NBLANBLANBLRNBLR

RRRR PhotoPhoto

PBSPBS

HealthHealth

--

--

++ ++

--

--psbA1psbA1

psbA2psbA2

cpcBcpcB

++--

++

LightLight

++ ×

Domain userDomain user

Page 8: Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona Institute for the Study of Learning and Expertise Palo Alto, California

The PThe PROMETHEUSROMETHEUS System System(Bridewell et al., 2007)(Bridewell et al., 2007)