AIRFrame - hawaii.edu

AIRFrame: Astrobiology Integrative Research Framework

Lisa Miller and Rich GazanDepartment of Information and Computer SciencesUH-NASA Astrobiology [email protected], [email protected]

September 13, 2010

mailto:[email protected]

mailto:[email protected]

Overview● AIRFrame project rationale● AIRFrame project goals● Current project status

– Textpresso system● Ontology● Database

● Current and future work direction– Clustering/Classification

– Visual representations of data

– Graphical interface

Project RationaleNature of Astrobiology● System-level science

– Concerned with complex, multidisciplinary, multi-phenomena behaviors of large physical and biological systems

– Information technology needed to consolidate and represent knowledge and data across many disciplines

● New field– No centralized repositories of knowledge

– No established, standardized vocabulary

Project RationaleInterdisciplinary Collaboration● Proven to be difficult

– Different disciplines have different:● Vocabularies/terminologies● Methods and formats for sharing and

presenting research● Assumed levels of precision

– Institutional and cultural boundaries

● Concern about duplication of research– Lack of access to existing knowledge

– Lack of discipline-specific knowledge to efficiently access what is available

A computer sees only your footprints: published research, websites, graphic records…

● Inputs: words, data, images…

● Processes: algorithms, categorizations, associations…

● Outputs: representations

Probabilistic, not deterministic

One way a computer sees NASA Astrobiology on the Web…

● Watch it happen: MIT Personas project– http://personas.media.mit.edu/personasWeb.html

● Inputs: words from Web documents● Processes: harvest words, associate them with

color-coded broader concepts● Outputs: representation of a person,

organization etc. by its proportion of constituent components

Personas

Another way computers see astrobiology…

● As of April 2010, under 10K scholarly publications indexed with astrobiology and related terms across major scholarly databases

– NASA Astrophysics Data System (ADS): 3717– Web of Science: 1092– Meteorological & Geoastrophysical Abstracts: 496– ScienceDirect: 3638

Astrobiology: how interdisciplinary?

● One (research) question

● Multiple scientists, datasets, disciplines, literatures, vocabularies…and ways to measure interdisciplinarity

Astrobiology publication profile

● Subject area of most current 1000 pubs indexed with “astrobiology” (ISI Web of Science)

Vocabulary problem● Same concept, different

words

● New fields like astrobiology have low query term overlap (QTO) until authors, indexers, journals, mapping algorithms etc. ‘catch up’

● QTO: ratio of docs findable with a given query term to the number of docs on the topic

Web of Science, April 2010

Keyword searches on Elsevier's ScienceDirect using some astrobiologically relevant synonyms

Keywords Number of results An article found only with this search

“prebiotic synthesis” 566 The origin of life – a review of facts and speculations

“prebiotic chemistry” 611 A possible circular RNA at the origin of life

“abiotic synthesis” 312 Recent advances in the chemical evolution and the origins of life

“organic synthesis” 53,434 Prebiotic organic synthesis in early Earth and Mars atmospheres

Keyword Searching Inadequate without a well-defined vocabulary

Ontology● Any arbitrary knowledge structure

● Which elements are included and excluded?

● How are the elements represented and related?

NASA ADS: over-associated ontology

Web of Science: broad ontology

ScienceDirect: narrow ontology

AIRFrame Project Goals

● Clustering not cloistering: – Discover and relate diverse research

concepts as a high level activity

● Eliminate the need to search for data using combinations of specific keywords

● Show users conceptual and functional relationships between diverse research documents

Astrobiology Integrative Research Framework

Textpresso-based ● Open-source information retrieval and extraction

system

– Developed at CalTech for biological researchers

– Deployed in 17 different focused literatures

● Two major components

– Ontology cataloging types of objects, abstract concepts, and relationships

– Database of documents marked up with XML based on the ontology

Current statusProof-of-concept system in place

Textpresso Ontology

● A shallow ontology catalogs terms

ConceptsConcepts DescriptionsDescriptions RelationshipsRelationships

Purpose

●Fulfill●Make●Govern●Produce● ...

NucleicAcid

●Adenine●Cytosine●DNA●Guanine● ...

Comparison

●Dissimilar●Equal sized●Like●Related● ...

Textpresso Database:Articles marked up with XML

All Documents

Single Document

Abstract

Body Text

Authors

Title

Individual sentencesWe report the synthesis of glycine on interstellar iceanalog films composed of water,

methylamine (MA), and carbon dioxide ...

XML MarkupWe report the <biologicalprocess>synthesis</biologicalprocess> of <aminoacid>glycine</aminoacid>

on interstellar iceanalog films <action>composed</action> of <volatile><water>water</water></volatile>, methylamine ( MA ), and

<volatile><gases><chemicalelement>carbon</chemicalelement> dioxide</gases></volatile>

Ontology

Textpresso system query

Allows search by meaning rather than specific keyword

Search by:● keyword,

●keyword + synonyms, ●and/or whole categories

Ontology categories

System Query - Results

Results ranked by number of sentenceswith matching terms

Category 2 Keyword Category 1

Category 2Synonym

Textpresso adoption challenges

● Other implementations use preexisting ontologies

● Other implementations are narrowly focused

● Many biological journals have open full-text access through a single source

Textpresso adoption challenges● Current interface is rudimentary (at best)

- Ugly!- Not intuitive

- Results spread across pages --Hard to see overall picture

Current Work

● To allow more breadth to the ontology – Convert to SKOS standard developed by

Vocabulary Explorer at IVOA

– Adapt Textpresso to directly read SKOS ontology

● Standards based ontology can– Ease porting existing ontologies to ours

– Allow easy sharing of our data with other systems

Ontology Standards Adoption

Why SKOS?

● Allows fast addition of terms from existing sources such as International Astronomical Union Thesaurus and IVOA.

● Created to allow conversion from other formats

● SKOS allows the kind of relationships we want to leverage with AIRFrame

Database Building

Workflow

New articles AIRFrameontology

Mark updocuments

&index

Mine for newTerms

Outsideontologies

AIRFramesearchabledatabase

Document Classification

● Use machine learning methods to discover connections in the data

– Possibly use several to give users many views.

● Developing a classifier will allow a user to enter a document and get related ones back

● My research area is in information theoretic clustering

– Information Bottleneck Method

Information Bottleneck Method● Based on Shannon's Theory of Mutual

Information– Measures the reduction in uncertainty or

the distortion between an original signal and its compressed representation

– Where uncertainty about an entity is its entropy

I [ x , y ]=H [ x ]−H [ x∣y ]

H [ x ]=∑x

N

p x log 1 / p x

Information Bottleneck Method 2

● Data is compressed so that information about a quantity of interest (in this case words) is kept maximally.

● Relative entropy or K-L divergence emerges as the distortion function

● The optimal cluster assignment rule is a Boltzmann distribution

max p c∣x[ I y , c −TI x , c]

DKL[ p y∣x ∣∣p y∣c]=∑y

p y∣x log2 [ p y∣x p y∣c ]

pc∣x=pc

Z x ,T exp

−1TDKL [ p y∣x ∣∣p y∣c]

Information Bottleneck ClusteringPreliminary Results

Cluster number

Database clustered into 16 clustersBubble size = number of documents in cluster

0 2 4 6 8 10 12 14 16 18

0

20

40

60

80

100

120

140

160

Num DifferentJournals

Information Bottleneck ClusteringPreliminary Results

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cluster Journal Disciplines

OtherCompute/ElectrionicsBIOPHYSICSChemAstronomy/AstropysicsGeo

Cluster Number

Where are we going with this?● Visualizations!

Future Plans● Create a cleaner, browsable interface which

displays results and links in an easily understandable way

● Allow users to input a document and get back information such as:

– Connections to other researchers

– Relevant NASA/NAI goals

– Related articles

How do we get there?● Continue work on automatic document

classification/clustering– Focus on top-level classification first

– Need at least an idea of how data fits into classes for visualization planning

– Look at relationships that arise from unsupervised methods and how to represent those.

● Continue to gather documents for the database

● Continue to build and improve the ontology

● Project is at a point where formal working guidelines can be put in place

– Investigate user interface requirements

– Build a set of use cases/scenarios

● Need usability studies for various visualization paradigms

– Need to discover where visualizations will/will not work in the system

● Hoping to collaborate with graduate level HCI class this semester

System Interface Design Process

Open Questions● How best to use our semantic markup to

present our data? ● How to visualize our information to allow

discovery and understanding of connections?

● How best to search for and represent● Category links between terms?● Authorship linkages between projects?● NASA/NAI goal links to research?

● Ontology building– Can it be at least partially automated?

Long-term goals

● Iteratively build/design interface with user participation

● Integrate fine-grained Textpresso search with high-level visualizations

● Finish standardization of ontology and offer searchable interface or visualization for it

● Integrate Steve Freeland's amino acid database

– mechanism partly in place now

Thank you!● For updates visit our site at:

– www.ifa.hawaii.edu/airframe

– “Proof-of-concept” AIRFrame/Textpresso is currently up and functioning

● www.ifa.hawaii.edu/airframe/textpresso

● Mahalo to:

– Justin Rajkowski & Kathryn Aiko Arinaga, LIS

– Liz Jayne, NAI REU participant

http://www.ifa.hawaii.edu/airframe