50
Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics Department of Computer Science The University of Iowa

Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Embed Size (px)

Citation preview

Page 1: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Machete

Shannon Bradshaw, Marc Light, and Brian AlmquistDepartment of Management SciencesSchool of Library and Information ScienceDepartment of Linguistics Department of Computer ScienceThe University of Iowa

Page 2: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Reducing Info Management Problems

• Evolutionary biology – Many organisms– Many proteins– Many pathways

• Many information management problems

• A veritable goldmine for people like us

Page 3: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Knowledge Management (KM)

• Key idea: – Reduce duplicated effort in an organization or

community

• Simple example:– Bob has a question – An effective KM framework will point Bob to Alice

or Sharon who both know the answer and will share it– Ineffective KM would require Bob to invest a great

deal of time deciphering the answer for himself

• Want to reuse the experiences and previous efforts of a community to help an individual

Page 4: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Text Mining

• Extracting structured information from prose

• Example:– A table of protein-protein interactions

distilled from individual interactions described in sentences scattered across several documents

Page 5: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

KM and Text Mining

• Large research communities in both spaces

• Want to interleave them in a single tool

• Targeted to bioscience literature

• We call this tool Machete

Page 6: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Example

• Context: – Experiments identified many genes that were

ankyrins or contained ankyrin repeats.

• Need: – Learn about ankyrins and ankyrin repeats

Page 7: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 8: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 9: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 10: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 11: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Knowledge Artifact

Page 12: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 13: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 14: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 15: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics
Page 16: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Using Artifacts: Personal level

Page 17: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Using Artifacts: Organizational level

Page 18: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Instead of doing the digging again

Page 19: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Lab members can reuse this

Page 20: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Using Artifacts: Community level

Page 21: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Finding documents

Page 22: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Reference Directed Indexing (RDI)

• Objective: To combine strong measures of both relevance and significance in a single metric

• Intuition: The opinions of authors who cite a document effectively distinguish both what a document is about and how important a contribution it makes

• Builds on the idea of using of anchor text to index Web documents

Page 23: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Example

• Paper by Andrade, Perez-Iratxeta, and Ponting on protein repeats

Page 24: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

A single reference to Andrade

The ankyrin repeat motif mediates protein–protein interactions and is found in a diverse array of protein families, including transcription factors, cytoskeletal proteins, proteins which regulate development, and toxins (Andrade et al., 2001).

Page 25: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Leveraging multiple citations

• For any document cited more than once…

• We can compare the words of all authors

• Terms used by many referrers make good index terms for a document

• Phrases and statements in citation sentences bring to the surface important findings

Page 26: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Repeated use of words and phrases

Repeat proteins mediate numerous key protein–protein interactions in nature.[1. and 2.] Their repetitive architecture permits the adaptation of their size…

The ankyrin repeat motif mediates protein–protein interactions such as ankyrin and ß-propeller repeats [42]

Ankyrin repeats are thought to be important for protein–protein interaction events between integral membrane proteins and cytoskeletal proteins [Andrade et al., 2001].

The ankyrin repeat motif mediates protein–protein interactions

Page 27: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

A voting technique

• RDI treats each citing document as a voter

• The presence of a query term in referential text is a vote of “yes”

• The absence of that term, a “no”

• The documents with the most votes for the query terms rank highest

Page 28: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Extraction possibilities

• In addition to retrieval, citation sentences may also provide a valuable source of data for information extraction

• However, for the time being we are focusing on the content of documents for extraction purposes

Page 29: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Finding information within documents

Page 30: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Text Mining

• Summarize gene function• Support for GO assignment• Speculative passages

PassageRetrieval

Page 31: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Retrieve Docs by First Finding Genes

• Associate words with genes• Collect word counts from user

query doc set• Return genes for which counts of

associated words went up• For each such gene, return docs

where associated words were found

Page 32: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Retrieve Docs by First Finding Genes(DNA AND repair)

c words

4 Lung6 CPD

c words

6 Lung2 CPD2 TTD

(STAT6 …

(XPD xeraderma…c words

3 mRNA4 IL-4

Look at the XPD gene and documents containing the Lung and CPD words

Page 33: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Text Mining

InfoExtraction

• prot interacts-with prot• prot located-in organella• gene associated phenotype

Page 34: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

•Linguistics knowledgeAnkyrins bind to cell adhesion molecules of the CD44 family and the L1 CAM family…

This facilitates assembly of a repressor complex containing HDAC, Rb, and E2F that blocks transcription of the gene for IGF-1…

•Semantic knowledge –dictionaries and ontologies

•Counts –co-occurrence statistics–redundancy, e.g., that x interacts with y is mentioned 345 times

How It Works

Page 35: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Protein Interaction Extraction System We’re Building

• Inputs: – Pubmed query (“Ankyrins”) – List(s) of proteins

• Output:– Table of interacting protein pairs and links

Page 36: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Screen Shots

Page 37: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Clause-based Extraction

Collectively, these mutations also suppressed association of VDR with the coactivators GRIP1 and steroid receptor coactivator 1 in vitro but had little or no effect on ligand binding, heterodimerization with the retinoid X receptor, or association with a VDR-specific DNA recognition element

Page 38: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Method Section Mining

What fold concentrated Taq DNA polymerase buffer is optimal for the PCR reaction?

What plasmid DNA concentrations are needed for restriction digests?

In preparation for a Western blot, how long should GST lysate columns be incubated?

• We’re trying to build a system that can find answers to such questions

Page 39: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Dictionary Construction

• You people use so many words for the same thing: abbreviations, different uses of punctuation, totally different names– histone deacetylase 4, HDAC, HD, KIAA0288

• What is a poor computer to do?• Computers need synonym lists and other

information about words

Page 40: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

The Info Is Out There(it just needs to be collated)

• Gene and protein entries in SwissProt, HUGO, GDB, OMIM, GenAtlas, LocusLink, InterPro have aliases

• They are all stored in different formats• They each contain some of the synonyms• They are only partially cross-referenced• Genes from non-model organisms are less likely to

be in some database somewhere (unless there is a homolog) (???)

• Grunt work is required

Page 41: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Nuts + bolts

Page 42: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

PDF: Human sees

Page 43: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Machine sees

Page 44: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Challenges

• Each word/few words placed using x,y coordinates• Acrobat is just painting a picture. It has no sense of

the content of documents.• Difficult to:

– Follow flow of prose• Single or multi-column?• Some text spans multiple columns• Headers/footers

– Determine section breaks– Distinguish image/figure caption from body text– To parse bibliography entries

• Every document has a different layout format

Page 45: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Article has 3 columns, but text in PDF file may flow from left to right

Is this one block of text or part of two columns?

Is this part of the body or footer information?

Is this part of the article?

Page 46: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

PDF Highlighting

• Multivalent Browser Annotations– Primarily useful for highlighting– Alternative annotations

• Highlighting with comments

– Stored separately from document• Local to user/machine

• How would this information be shared?

• Can they be “fused” with the document?

Page 47: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Multivalent Interface

Page 48: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

Multivalent

• Highly extensible– We have some degree of freedom to modify– Interface is treated as part of the viewed

document

Page 49: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

PDF: Inserting Hyperlinks

• Current system– Finds specified terms– Adds specified hyperlinks as an overlay over

each instance of a search term– Outputs modified PDF

<<links to Before/After files???>>

Page 50: Machete Shannon Bradshaw, Marc Light, and Brian Almquist Department of Management Sciences School of Library and Information Science Department of Linguistics

PDF: Inserting Hyperlinks

• Design Goals– Multi-platform support– Web-based interface

• Maintaining list of terms/URLs

• Submitting PDFs/URLs to URLs

– Extend to other forms of annotations

• Limitations– Certain PDFs cannot be converted to Text: (scanned

image, certain PostScript and DVI conversions)– Search is not robust: no hyphenations