Upload
others
View
40
Download
0
Embed Size (px)
Citation preview
AIRFrame: Astrobiology Integrative Research Framework
Lisa Miller and Rich GazanDepartment of Information and Computer SciencesUH-NASA Astrobiology [email protected], [email protected]
September 13, 2010
Overview● AIRFrame project rationale● AIRFrame project goals● Current project status
– Textpresso system● Ontology● Database
● Current and future work direction– Clustering/Classification
– Visual representations of data
– Graphical interface
Project RationaleNature of Astrobiology● System-level science
– Concerned with complex, multidisciplinary, multi-phenomena behaviors of large physical and biological systems
– Information technology needed to consolidate and represent knowledge and data across many disciplines
● New field– No centralized repositories of knowledge
– No established, standardized vocabulary
Project RationaleInterdisciplinary Collaboration● Proven to be difficult
– Different disciplines have different:● Vocabularies/terminologies● Methods and formats for sharing and
presenting research● Assumed levels of precision
– Institutional and cultural boundaries
● Concern about duplication of research– Lack of access to existing knowledge
– Lack of discipline-specific knowledge to efficiently access what is available
A computer sees only your footprints: published research, websites, graphic records…
● Inputs: words, data, images…
● Processes: algorithms, categorizations, associations…
● Outputs: representations
Probabilistic, not deterministic
One way a computer sees NASA Astrobiology on the Web…
● Watch it happen: MIT Personas project– http://personas.media.mit.edu/personasWeb.html
● Inputs: words from Web documents● Processes: harvest words, associate them with
color-coded broader concepts● Outputs: representation of a person,
organization etc. by its proportion of constituent components
Personas
Another way computers see astrobiology…
● As of April 2010, under 10K scholarly publications indexed with astrobiology and related terms across major scholarly databases
– NASA Astrophysics Data System (ADS): 3717– Web of Science: 1092– Meteorological & Geoastrophysical Abstracts: 496– ScienceDirect: 3638
Astrobiology: how interdisciplinary?
● One (research) question
● Multiple scientists, datasets, disciplines, literatures, vocabularies…and ways to measure interdisciplinarity
Astrobiology publication profile
● Subject area of most current 1000 pubs indexed with “astrobiology” (ISI Web of Science)
Vocabulary problem● Same concept, different
words
● New fields like astrobiology have low query term overlap (QTO) until authors, indexers, journals, mapping algorithms etc. ‘catch up’
● QTO: ratio of docs findable with a given query term to the number of docs on the topic
Web of Science, April 2010
Keyword searches on Elsevier's ScienceDirect using some astrobiologically relevant synonyms
Keywords Number of results An article found only with this search
“prebiotic synthesis” 566 The origin of life – a review of facts and speculations
“prebiotic chemistry” 611 A possible circular RNA at the origin of life
“abiotic synthesis” 312 Recent advances in the chemical evolution and the origins of life
“organic synthesis” 53,434 Prebiotic organic synthesis in early Earth and Mars atmospheres
Keyword Searching Inadequate without a well-defined vocabulary
Ontology● Any arbitrary knowledge structure
● Which elements are included and excluded?
● How are the elements represented and related?
NASA ADS: over-associated ontology
Web of Science: broad ontology
ScienceDirect: narrow ontology
AIRFrame Project Goals
● Clustering not cloistering: – Discover and relate diverse research
concepts as a high level activity
● Eliminate the need to search for data using combinations of specific keywords
● Show users conceptual and functional relationships between diverse research documents
Astrobiology Integrative Research Framework
Textpresso-based ● Open-source information retrieval and extraction
system
– Developed at CalTech for biological researchers
– Deployed in 17 different focused literatures
● Two major components
– Ontology cataloging types of objects, abstract concepts, and relationships
– Database of documents marked up with XML based on the ontology
Current statusProof-of-concept system in place
Textpresso Ontology
● A shallow ontology catalogs terms
ConceptsConcepts DescriptionsDescriptions RelationshipsRelationships
Purpose
●Fulfill●Make●Govern●Produce● ...
NucleicAcid
●Adenine●Cytosine●DNA●Guanine● ...
Comparison
●Dissimilar●Equal sized●Like●Related● ...
Textpresso Database:Articles marked up with XML
All Documents
Single Document
Abstract
Body Text
Authors
Title
Individual sentencesWe report the synthesis of glycine on interstellar iceanalog films composed of water,
methylamine (MA), and carbon dioxide ...
XML MarkupWe report the <biologicalprocess>synthesis</biologicalprocess> of <aminoacid>glycine</aminoacid>
on interstellar iceanalog films <action>composed</action> of <volatile><water>water</water></volatile>, methylamine ( MA ), and
<volatile><gases><chemicalelement>carbon</chemicalelement> dioxide</gases></volatile>
Ontology
Textpresso system query
Allows search by meaning rather than specific keyword
Search by:● keyword,
●keyword + synonyms, ●and/or whole categories
Ontology categories
System Query - Results
Results ranked by number of sentenceswith matching terms
Category 2 Keyword Category 1
Category 2Synonym
Textpresso adoption challenges
● Other implementations use preexisting ontologies
● Other implementations are narrowly focused
● Many biological journals have open full-text access through a single source
Textpresso adoption challenges● Current interface is rudimentary (at best)
- Ugly!- Not intuitive
- Results spread across pages --Hard to see overall picture
Current Work
● To allow more breadth to the ontology – Convert to SKOS standard developed by
Vocabulary Explorer at IVOA
– Adapt Textpresso to directly read SKOS ontology
● Standards based ontology can– Ease porting existing ontologies to ours
– Allow easy sharing of our data with other systems
Ontology Standards Adoption
Why SKOS?
● Allows fast addition of terms from existing sources such as International Astronomical Union Thesaurus and IVOA.
● Created to allow conversion from other formats
● SKOS allows the kind of relationships we want to leverage with AIRFrame
Database Building
Workflow
New articles AIRFrameontology
Mark updocuments
&index
Mine for newTerms
Outsideontologies
AIRFramesearchabledatabase
Document Classification
● Use machine learning methods to discover connections in the data
– Possibly use several to give users many views.
● Developing a classifier will allow a user to enter a document and get related ones back
● My research area is in information theoretic clustering
– Information Bottleneck Method
Information Bottleneck Method● Based on Shannon's Theory of Mutual
Information– Measures the reduction in uncertainty or
the distortion between an original signal and its compressed representation
– Where uncertainty about an entity is its entropy
I [ x , y ]=H [ x ]−H [ x∣y ]
H [ x ]=∑x
N
p x log 1 / p x
Information Bottleneck Method 2
● Data is compressed so that information about a quantity of interest (in this case words) is kept maximally.
● Relative entropy or K-L divergence emerges as the distortion function
● The optimal cluster assignment rule is a Boltzmann distribution
max p c∣x[ I y , c −TI x , c]
DKL[ p y∣x ∣∣p y∣c]=∑y
p y∣x log2 [ p y∣x p y∣c ]
pc∣x=pc
Z x ,T exp
−1TDKL [ p y∣x ∣∣p y∣c]
Information Bottleneck ClusteringPreliminary Results
Cluster number
Database clustered into 16 clustersBubble size = number of documents in cluster
0 2 4 6 8 10 12 14 16 18
0
20
40
60
80
100
120
140
160
Num DifferentJournals
Information Bottleneck ClusteringPreliminary Results
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Cluster Journal Disciplines
OtherCompute/ElectrionicsBIOPHYSICSChemAstronomy/AstropysicsGeo
Cluster Number
Where are we going with this?● Visualizations!
Future Plans● Create a cleaner, browsable interface which
displays results and links in an easily understandable way
● Allow users to input a document and get back information such as:
– Connections to other researchers
– Relevant NASA/NAI goals
– Related articles
How do we get there?● Continue work on automatic document
classification/clustering– Focus on top-level classification first
– Need at least an idea of how data fits into classes for visualization planning
– Look at relationships that arise from unsupervised methods and how to represent those.
● Continue to gather documents for the database
● Continue to build and improve the ontology
● Project is at a point where formal working guidelines can be put in place
– Investigate user interface requirements
– Build a set of use cases/scenarios
● Need usability studies for various visualization paradigms
– Need to discover where visualizations will/will not work in the system
● Hoping to collaborate with graduate level HCI class this semester
System Interface Design Process
Open Questions● How best to use our semantic markup to
present our data? ● How to visualize our information to allow
discovery and understanding of connections?
● How best to search for and represent● Category links between terms?● Authorship linkages between projects?● NASA/NAI goal links to research?
● Ontology building– Can it be at least partially automated?
Long-term goals
● Iteratively build/design interface with user participation
● Integrate fine-grained Textpresso search with high-level visualizations
● Finish standardization of ontology and offer searchable interface or visualization for it
● Integrate Steve Freeland's amino acid database
– mechanism partly in place now
Thank you!● For updates visit our site at:
– www.ifa.hawaii.edu/airframe
– “Proof-of-concept” AIRFrame/Textpresso is currently up and functioning
● www.ifa.hawaii.edu/airframe/textpresso
● Mahalo to:
– Justin Rajkowski & Kathryn Aiko Arinaga, LIS
– Liz Jayne, NAI REU participant