100
© Hayes/Smyth: Introduction to Biomedical Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

Embed Size (px)

Citation preview

Page 1: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 1

Introduction to Biomedical Informatics

Text Mining

Page 2: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 2

Outline

• Introduction and Motivation

• Techniques– Document classification– Document clustering – Topic discovery from text – Information extraction

• Additional Resources and Recommended Reading

Page 3: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 3

Motivation for Text Mining

• PubMed– PubMed approximately 1 million new articles per year– Human annotation cannot keep up – increased demand for automation– Problem is even greater in other domains (e.g., Web search in general)

Page 4: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 4

From Jensen, Saric, Bork, Nature Reviews Genetics, 2006

Page 5: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 5

Text Mining Problems

• Classification: automatically assign a document into 1 or more categories– “easy” problem: is an email spam or non-spam?– “hard” problem: assign MesH terms to new PubMed articles

• Clustering: Group a set of documents into clusters– e.g., automatically group docs in search results by theme

• Topic Discovery: Discover themes in docs and index docs by theme– E.g., discover new scientific concepts not yet covered by MeSH terms

• Information Extraction: Extract mentions of entities from documents– “easy” problem: extract all gene names mentioned in a document– “hard” problem: extract a set of facts relating genes in a document, e.g.,

statements such as “gene A activates gene B”

Page 6: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 6

Classification and Clustering

• We already discussed these methods in the context of general data mining in earlier lectures. Now we want to apply these techniques specifically to text data

• Recall:– Given a vector of features x, a classifier maps x to a target variable y,

where y is categorical, e.g., y = {has cancer, does not have cancer}

– Learning a classifier consists of being given a training data set of pairs of x’s and y’s, and learning how to map x to y

– Clustering is similar, but our target data doesn’t have any target y values – we have to discover the target values (the “clusters”) automatically

Page 7: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 7

Classification and Clustering for Text Documents

• Document representation– Most classification and clustering algorithms assume that each object

(here a document) to be classified can be represented as a fixed length vector of variables/feature/attribute values

– So how do we convert documents into fixed-length vectors?

– “Bag of Words” representation• Each vector entry represents whether term j occurred in a document, or how

often it occurred• Same idea as for information retrieval• Ignores word order, document structure….but found to work well in practice

and has considerable computational advantages over working with the document directly

Once we have our vector (bag of words) representation we can use any classification or clustering method on our documents

Page 8: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 8

Document Classification

Page 9: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 9

Document Classification

• Document classification has many applications– Spam email detection– Automated tagging of biomedical articles (e.g., in PubMed)– Automated creation of Web-page taxonomies

• Data Representation– “Bag of words” most commonly used: either counts or binary– Can also use “phrases” for commonly occurring combinations of words

• Classification Methods– Naïve Bayes widely used (e.g., for spam email)

• Fast and reasonably accurate– Support vector machines (SVMs)

• Often the most accurate method in research studies• But more complex computationally than other methods

– Logistic Regression (regularized)• Widely used in industry, often competitive with SVMs

Page 10: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 10

Trimming the Vocabulary

• Stopword removal: – remove “non-content” words

• very frequent “stop words” such as “the”, “and”….– remove very rare words, e.g., that only occur a few times in 1 million documents

• Often results in removal of 30% or more of the original unique words

• Stemming:– Reduce all variants of a word to a single term– e.g., {draw, drawing, drawings} -> “draw”– Can use Porter stemming algorithm (1980)

• This still often leaves us with 10000 to 1 million unique terms => a very high-dimensional classification problem!

Page 11: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 11

Classifying Term Vectors

• Typically multiple different terms or words may be helpful – Class = “finance”– Words = “stocks”, “return”, “interest”, “rate”, etc.– Thus, classifiers that combine multiple features often do well, e.g,

• Naïve Bayes, Logistic regression, SVMs, etc(compared to decision trees, for example, which would branch on 1 word at a time)

• Linear classifiers often perform well in high-dimensions– Typically we have a large number of features/dimensions in text classification– Theory and experiments tell us linear classifiers do well in high dimensions – So naïve Bayes, logistic regression, linear SVMS, are all useful

• Main questions in practice are: – which terms to use in the classifier?– which linear classifier to select?

Page 12: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 12

Feature Selection

• Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms– See classification results later in these slides

• Greedy search– Start from empty set or full set and add/delete one at a time– Heuristics for adding/deleting

• Information gain (mutual information of term with class)• Chi-square• Other ideas

– Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

Page 13: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 13

Example of Role of Feature Selection (from Chakrabarti, Chapter 5)

9600 documents from US Patent database20,000 raw features (terms)

Page 14: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 14

Types of Classifiers

Let c be the class label and let x be a vector of features

• Generative/Probabilistic– Model p(x | c) for each class, then estimate p(c | x)– e.g., naïve Bayes model

• Conditional Probability/Regression– Model p(c | x) directly, e.g., – e.g., logistic regression

• Discriminative– Look for decision boundaries in input space x directly

• No probabilities– e.g., perceptron, linear discriminants, SVMs, etc

Page 15: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 15

Probabilistic “Generative” Classifiers

• Model p( x | ck ) for each class and perform classification via Bayes rule,

c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }

• How to model p( x | ck )?– p( x | ck ) = probability of a “bag of words” x given a class ck

• Two commonly used approaches (for text):– Naïve Bayes: treat each term xj as being conditionally independent, given ck

– Multinomial: model a document with N words as N tosses of a p-sided die

Page 16: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 16

Naïve Bayes Classifier for Text• Naïve Bayes classifier = conditional independence model

– Assumes conditional independence assumption given the class: p( x | ck ) = P p( xj | ck )

– Note that we model each term xj as a discrete random variable

– Binary terms (Bernoulli): p( x | ck ) = P p( xj = 1 | ck ) P p( xj = 0 | ck )

Page 17: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 17

Multinomial Classifier for Text• Multinomial Classification model

Assume that the data are generated by a p-sided die (multinomial model)

where N = total number of terms

nj = number of times term j occurs in the document

Here we have a single random variable for each class, and the p( xj = i | ck ) probabilities sum to 1 over i (i.e., a multinomial model)

jN

j

kjk

ncxpcp

1

)|()|(x

Page 18: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 18

Highest Probability Terms in Multinomial Distributions

Page 19: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 19

Common Data Sets used for Evaluation

• Reuters– 10700 labeled documents – 10% documents with multiple class labels

• Yahoo! Science Hierarchy – 95 disjoint classes with 13,598 pages

• 20 Newsgroups data– 18800 labeled USENET postings– 20 leaf classes, 5 root level classes

• WebKB– 8300 documents in 7 categories such as “faculty”, “course”, “student”.

• Industry– 6449 home pages of companies partitioned into 71 classes

Page 20: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 20

Comparing Naïve Bayes and Multinomial models

McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text

classification experiments

Page 21: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 21

Comparing Multinomial and Bernoulli on Reuter’s Data (from McCallum and Nigam, 1998)

Page 22: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 22

Comparing Bernoulli and Multinomial (slide from Chris Manning, Stanford)

Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different classes

Page 23: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 23

WebKB Data Set

• Train on ~5,000 hand-labeled web pages– Cornell, Washington, U.Texas, Wisconsin

• Crawl and classify a new site (CMU)• Results:

Student Faculty Person Project Course DepartmtExtracted 180 66 246 99 28 1Correct 130 28 194 72 25 1Accuracy: 72% 42% 79% 73% 89% 100%

Page 24: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 24

Comparing Bernoulli and Multinomial on Web KB Data

Page 25: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 25

Comments on Generative Models for Text(Comments applicable to both Naïve Bayes and Multinomial classifiers)

• Simple and fast => popular in practice– e.g., linear in p, n, M for both training and prediction

• Training = “smoothed” frequency counts, e.g.,

– e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for

spam email)

• Numerical issues– Typically work with log p( ck | x ), etc., to avoid numerical underflow– Useful trick:

• when computing S log p( xj | ck ) , for sparse data, it may be much faster to – precompute S log p( xj = 0| ck ) – and then subtract off the log p( xj = 1| ck ) terms

• Note: both models are “wrong”: but for classification are often sufficient

mk

kjkjk

n

nxcp

,)1|(

Page 26: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 26

Beyond independence• Naïve Bayes and multinomial both assume conditional independence of words

given class

• Alternative approaches try to account for higher-order dependencies– Bayesian networks:

• p(x | c) = Px p(x | parents(x), c) • Equivalent to directed graph where edges represent direct dependencies• Various algorithms that search for a good network structure• Useful for improving quality of distribution model• ….however, this does not always translate into better classification

– Maximum entropy models• p(x | c) = 1/Z Psubsets f( subsets(x) | c)• Equivalent to undirected graph model• Estimation is equivalent to maximum entropy assumption• Feature selection is crucial (which f terms to include) – • can provide high accuracy classification• …. however, tends to be computationally complex to fit (estimating Z is difficult)

optional

Page 27: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 27

Basic Concepts of Support Vector Machines

Circles = support vectors = points on convex hull that are closest to hyperplane

M = margin = distance of support vectors from hyperplane

Goal is to find weight vectorthat maximizes M

Page 28: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 28

• 21578 documents, labeled manually• 9603 training, 3299 test articles • 118 categories

– An article can be in more than one category– Learn 118 binary category distinctions

• Example “interest rate” article2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in

answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct.

Common categories(#train, #test)

Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Page 29: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 29

Dumais et al. 1998: Reuters - Accuracy

NBayes Trees LinearSVMearn 95.9% 97.8% 98.2%acq 87.8% 89.7% 92.8%money-fx 56.6% 66.2% 74.0%grain 78.8% 85.0% 92.4%crude 79.5% 85.0% 88.3%trade 63.9% 72.5% 73.5%interest 64.9% 67.1% 76.3%ship 85.4% 74.2% 78.0%wheat 69.7% 92.5% 89.7%corn 65.3% 91.8% 91.1%

Avg Top 10 81.5% 88.4% 91.4%Avg All Cat 75.2% na 86.4%

Page 30: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 30

Precision-Recall for SVM (linear),Naïve Bayes, and NN (from Dumais 1998)using the Reuters data set

Page 31: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 31

Comparing Text Classifiers

• Naïve Bayes or Multinomial models – Low time complexity (training = single linear pass through the data)– Generally good, but not always best performance– Widely used for spam email filtering

• Linear SVMs, Logistic Regression– Often produce best results in research studies– But more computationally complex to train (particularly SVMs)

• Others– decision trees: less widely used, but can be useful

Page 32: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 32

Learning with Labeled and Unlabeled documents• In practice, obtaining labels for documents is time-consuming, expensive,

and error prone– Typical application: small number of labeled docs and a very large number of

unlabeled docs

• Idea:– Build a probabilistic model on labeled docs– Classify the unlabeled docs, get p(class | doc) for each class and doc

• This is equivalent to the E-step in the EM algorithm– Now relearn the probabilistic model using the new “soft labels”

• This is equivalent to the M-step in the EM algorithm– Continue to iterate until convergence (e.g., class probabilities do not change

significantly)– This EM approach to classification shows that unlabeled data can help in

classification performance, compared to labeled data alone

optional

Page 33: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 33

Learning with Labeled and Unlabeled Data

(from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006)

Page 34: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 34

Other issues in text classification

• Real-time constraints:– Being able to update classifiers as new data arrives– Being able to make predictions very quickly in real-time

• Document length– Varying document length can be a problem for some classifiers– Multinomial tends to be better than Bernoulli for example

• Multi-labels and multiple classes– Text documents can have more than one label– SVMs for example can only handle binary data

Page 35: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 35

Other issues in text classification (continued)

• Feature selection– Experiments have shown that feature selection (e.g., by greedy

algorithms using information gain) can often improve results

• Linked documents– Can view Web documents as nodes in a directed graph– Classification can now be performed that leverages the link structure,

• Heuristic = class labels of linked pages are more likely to be the same– Optimal solution is to classify all documents jointly rather than

individually– Resulting “global classification” problem is typically computationally

complex

Page 36: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 36

Background Resources: Document Classification

• S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008– Chapters 13 to 15 on text classification– (and chapters 16 and 17 on text clustering) – http://nlp.stanford.edu/IR-book/information-retrieval-book.html

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector Machines: Methods,

Theory and Algorithms, Kluwer, 2002

Page 37: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 37

Document Clustering

Page 38: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 38

Document Clustering

• For clustering we can use either– Vectors to represent each document (E.g., bag of words)

• Useful for clustering algorithms such as k-means, probabilistic clustering, or– For N documents, define an N x N similarity matrix

• Doc-doc similarity can be defined in different ways (e.g., TF-IDF)• Useful for clustering methods such as hierarchical clustering

• Unlike classification, there is typically less concern with selecting the “best” vocabulary for clustering– remove common stop words and infrequent words

Page 39: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 39

Case Study: Clustering of 2 Million PubMed Articles

Reference: Boyack et al, Clustering more than Two Million Biomedical Publications…, PLoS One, 6(3), March 2011

• Data Set : 2.15 million articles in PubMed– all articles published between 2004 and 2008, with at least 5 MeSH terms

• Data for each document– MeSH terms– Words from titles and abstracts

• Preprocessing– MEDLINE stopword list of 132 words + 300 words commonly used at NIH– Terms appearing in less than 4 documents were removed– 272,926 unique terms and 175 million document-term pairs

Page 40: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 40

Methodology

• Methods compared– Data sources: MeSH only versus Title/Abstract words only– Similarity metrics

• Tf-idf cosine (see earlier lectures)• Latent semantic indexing/analysis (see earlier lectures)• Topic modeling (discussed later in these slides)• Self-organizing map (neural network method)• Poisson-based model

– 25 million similarity pairs computed• Approximately top-12 most similar documents for each document

• 9 sets of clusters compared• 9 combinations of clustering data-source+similarity metric evaluated• Hierarchical (single-link) clustering applied to each of the 9 similarity sets• Heuristics used to determine when clusters can no longer be merged

Page 41: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 41

Evaluation Methods and Results

• Evaluation metrics– Textual coherence within a cluster (see paper)– Coherence of papers within a cluster in terms of funding source (Question: how reliable are these metrics?)

• Conclusions (from the paper)– PubMed’s own related article approach (PMRA) generated the most

coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts.

– Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

Page 42: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 42

Textual Coherence

Results

Page 43: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 44

Two-dimensional map of the highest-scoring cluster solution, representing nearly 29,000 clusters and over two million articles.

Page 44: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 45

Background Resources: Document Clustering

• Papers– Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, et al. 2011

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029

– Douglass R. Cutting, David R. Karger, Jan O. Pedersen and John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of ACM SIGIR '92.

– Ying Zhao and George Karypis (2005) Hierarchical clustering algorithms for document data sets, Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp. 141 - 168, 2005.

• MALLET (Software)– Java-based package for classification, clustering, topic modeling, and more…– http://mallet.cs.umass.edu/

Page 45: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 46

Topic Modeling

Some slides courtesy of David Newman, UC Irvine

Page 46: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 47

Topics = Multinomial Distributions

Page 47: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 48

What is Topic Modeling?

• Topic = probability distribution (multinomial) over words• Document is assumed to be a mixture of topics

– Each document is represented by a probability distribution over topics– Note that this is different to clustering, which assigns each doc to 1 cluster

• Learning algorithm is completely unsupervised– No labels required

• Output of learning algorithm– T topics, each represented by a probability distribution over words– For each document, what its mixture of topics is– For each word in each document, what topic it is assigned to

Page 48: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 49

What is Topic Modeling useful for?

• Summarize large document collections

• Retrieve documents

• Automatically index documents

• Analyze text

• Measure trends

• Generate topic-based networks

Page 49: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 50

Topic Modeling vs. Other Approaches• Clustering (summarization)

– Topic model typically outperforms clustering

• Clustering: document belongs to single cluster

• Topic modeling: document is composed of multiple topics

• Latent semantic indexing (theme discovery)

– Topics tend to be more interpretable than LSI “vectors”

• TF-IDF (information retrieval)

– Keywords are often too specific. Topics (or a combination of topics and keywords) are usually better

Page 50: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 51

Topic Modeling vs. ClusteringThe Topic Model has the advantage over clustering that it can assign multiple topics per document.

Topic Modeling > Theory

Hidden Markov Models in Molecular Biology: New Algorithms and ApplicationsPierre Baldi, Yves Chauvin, Tim Hunkapiller, Marcella A. McClure

Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.

[cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov

[topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling

[topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes

Multiple TopicsOne Cluster

Page 51: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 52

HistoryTopic Modeling

1990

1999

2003

2004

2004+

Latent Semantic Analysis (Deerwester et al)

Probabilistic Latent Semantic Analysis (Hoffman)

Latent Dirichlet Allocation (Blei, Ng, Jordan)

Gibbs sampling (Griffiths & Steyvers)

A variety of extensions and applications…….

Page 52: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 53

How are the Topics Learned?

• Most widely used algorithm is based on Gibbs Sampling– This is essentially a stochastic search technique from statistics

• Sketch of algorithm, for learning T topics– Initialize each word in each document to a number from 1 to T– Cycle through each word and resample its assignment, based on

• Which other topics are assigned to words in this document• Which other words are assigned to this topic

– Continue to cycle through the data until convergence or for a fixed number of iterations

– Typically takes 20 to 100 iterations to converge– Each iteration is linear in the number of word tokens in the corpus (good!)– Algorithm can be easily parallelized

(Parallelization algorithm invented at UCI, now used by Google, Yahoo, etc)

Page 53: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 54

Equation for Sampling the Topic for each Word

The topic modeling algorithm uses a probabilistic model to stochastically and iteratively assign topics to every word in the corpus.

' '' '

)|(w

itw

iwt

t

idt

itd

ii Wn

n

Tn

nztzp

count of word w assigned to topic tcount of topic t

assigned to doc d

probability that word i is assigned to topic t

optional

Page 54: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 55

Word/Document countsfor 16 Artificial Documents

River Stream Bank Money Loan123456789

10111213141516

Can we recover the original topics and topic mixtures from this data?

docu

men

ts

stream river bank money loan

Page 55: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 56

Example of Gibbs Sampling

River Stream Bank Money Loan123456789

10111213141516

• Assign word tokens randomly to 2 topics:

stream river bank money loan

Page 56: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 57

River Stream Bank Money Loan123456789

10111213141516

After 1 iteration

stream river bank money loan

Page 57: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 58

River Stream Bank Money Loan123456789

10111213141516

After 4 iterations

stream river bank money loan

Page 58: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 59

River Stream Bank Money Loan123456789

10111213141516

After 32 iterations

stream .40 bank .39bank .35 money .32river .25 loan .29

topic 1 topic 2

stream river bank money loan

Page 59: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 60

Explaining Topic Modeling with New York Times Documents

Topic Modeling begins with the text from a corpus of documents…

About a month after the attacks of Sept. 11, firefighters widows began showing up at an obscure investment firm on Long...¶ To hear Raymond W. Baker tell it, companies in the United States and throughout the West have it exactly backward when i...¶ You dont have to be honest to use the Internet. E-mail scams are a growing problem. Youve almost certainly heard from ...¶ In a city where banks sprout like mushrooms, and die just as fast, the money-moving network described in the unfolding B...¶ The federal governments chief witness against Robert E. Brennan, the former penny-stock promoter, concluded his testimo...¶ From the well-appointed offices of Lehman Brothers in Manhattan, federal prosecutors say, Consuelo Marquez, a Barnard Co...¶ Appearing calm and rested, and smiling occasionally at one another and their lawyers, Edwards and Berlin described the o. Cary Stayner was sentenced Thursday to die for committing three murders near Yosemite National Park in 1999. For FBI age ...¶ As murders go, it was the sort guaranteed to fetch modest attention. Two men were held up at gunpoint in a restaurant pa ...¶ Teachers loaded some of the elementary school children onto buses. Another group was hustled down the hill to the old wo ...¶ It started as a whisper from nowhere. Then the whispers became voices. James Michael Brown had no choice but to listen. ...¶ A day is worth more than 24 hours to Coral Eugene Watts. Its worth 72, a three-for-one bargain, a triple-time countdown ....¶ The middle-aged man and the teenager were footloose traveling companions on a fathomless mission of horror. For three we ...¶ A jury convicted Michael C. Skakel on Friday of murder in the 1975 bludgeoning death of Martha Moxley, using more than a ...¶ On July 13, 1990, an expert testified in a Manhattan court that DNA analysis of semen found in the victim in the Central ...¶ As the man suspected of killing as many as 18 women sat in a downtown prison cell, investigators fielded calls from othe ...¶ For Johnson, an ex-Marine, stucco specialist and associate pastor of a small Kansas City church, the next couple of hour ...¶

Page 60: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 62

Initially, topics are randomly assigned to the words in each document

For this example, we’ve set the number of topics to 10.

19757 bludgeoning2 convicted4 death10 Friday6 jury2 Martha9 Michael3

more3 Moxley9 murder6 Skakel5 using2 ¶ against4 Brennan3 chief1

concluded3 federal8 former2 governments4 penny-stock6 promoter3

Robert9 testimony4 witness1...

Page 61: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 63

Words get assigned probabilistically to topics

19757 bludgeoning2 convicted4 death10 Friday6 jury2 Martha9 Michael3

more3 Moxley9 murder6 Skakel5 using2 ¶ against4 Brennan3 chief1

concluded3 federal8 former2 governments4 penny-stock6 promoter3

Robert9 testimony4 witness1...

turns into

19753 bludgeoning3 convicted3 death3 Friday5 jury3 Martha1 Michael9

more5 Moxley3 murder3 Skakel3 using6 ¶ against6 Brennan8 chief8

concluded5 federal8 former8 governments9 penny-stock8 promoter8

Robert1 testimony8 witness3...

Page 62: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 64

#3: murder police victim killing shot killed death crime violence shooting found suspect killer kill woman told case night scene investigator

#8: money million business dollar account cash paid pay scandal financial trust fraud corruption bank scheme laundering illegal deal profit transfer

19753 bludgeoning3 convicted3 death3 Friday5 jury3 Martha1

Michael9 more5 Moxley3 murder3 Skakel3 using6 ¶ against6

Brennan8 chief8 concluded5 federal8 former8 governments9

penny-stock8 promoter8 Robert1 testimony8 witness3...

Topics are expressed as lists of words likely to co-occur in individual documents

Page 63: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 65

Humans Interpret Topics

After the model produces the topics, a human domain expert interprets the list of most likely words and chooses a topic name.

topic #3 = “Crime”

topic #8 = “Financial

Corruption”

topic #3: murder police victim killing shot killed death crime violence shooting found suspect killer kill woman …

topic #8:money million business dollar account cash paid pay scandal financial trust fraud corruption bank scheme laundering illegal …

Page 64: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 66

Topic Modeling can be applied to various types of text …

Collection # docs Description

New York Times

300,000 News articles from New York Times and other newspapers 2000-2002

Austen 1,400 The six Jane Austen novels, broken up into 100-line sections

Blogs 4,000 Blog entries harvested from Daily Kos

Bible 1,200 Chapters in the bible (KJV)

Police Reports

250,000 Police accident reports from North Carolina

CiteSeer 750,000 Abstracts from research publications in computer science and engineering

Search Queries

1,000,000 Queries issued to web search engine

Enron 250,000 Enron emails seized by the US Government for the federal case against the company

Library of Congress

240,000 Metadata records from Library of Congress American Memory Collection

Page 65: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 67

… sample topicsCollection Sample Topic

New York Times

[WMD] IRAQ iraqi weapon war SADDAM_HUSSEIN SADDAM resolution UNITED_STATES military inspector U_N UNITED_NATION BAGHDAD inspection action SECURITY_COUNCIL

Austen [SENTIMENT] felt comfort feeling feel spirit mind heart ill evil fear impossible hope poor distress end loss relief suffering concern dreadful misery unhappy

Blogs [ELECTIONS] november poll house electoral governor polls account ground republicans trouble

Bible [COMMANDS] thou thy thee shalt thine lord god hast unto not shall

Police Reports

[RAN OFF ROAD] v1 off road ran came rest ditch traveling struck side shoulder tree overturned control lost

CiteSeer [GRAPHS] graph edge vertices edges vertex number directed connected degree coloring subgraph set drawing

Search Queries

[CREDIT] credit card loans bill loan report bad visa debt score

Enron [ENERGY CRISIS] state california power electricity utilities davis energy prices generators edison public deregulation billion governor federal consumers commission plants companies electric wholesale crisis summer

Library of Congress

[ARMY] military camp army war officer personnel british regiment tent crimean crimea soldier guard general infantry

Page 66: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 68

Case Study: Analysis of DNA Microarray Literature

• Government-sponsored study of > 50,000 papers from PubMed that are related to DNA microarrays– Papers date from early 1990’s to 2012– 48,872 documents– 59,408 unique words– 6.1 million word tokens– 191,569 unique authors– 4,455 different journals

• Ran a topic model with T=100 topics to gain insight into what the different sub-areas are within the field of DNA microarrays and how these have changed over time

Page 67: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 69

Examples of Topics Learned

Displayed below are the top 5 highest probability words for 5 selected topics

Microarray Chip Technology

ClassificationMethods

Databasesand

Annotation

RegulatoryNetworks

Patients and Survival

detection classification databases network patient

surface selection tool regulatory tumor

fluorescence cancer annotation pathway cancer

hybridization algorithm data set interaction survival

array feature web transcriptional prognostic

From basic technology to applications

Page 68: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 70

How Topics Change over Time

1990 1995 2000 2005 20100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Basic Technology

YEAR

FR

AC

TIO

N O

F W

OR

DS

PE

R T

OP

IC

Base Rate

microarray technology

hybridizationprobe technology

sequencing

1990 1995 2000 2005 20100

0.005

0.01

0.015

0.02

0.025

Applications

YEAR

FR

AC

TIO

N O

F W

OR

DS

PE

R T

OP

IC

Base Rate

patients and treatment

classificationpathways and networks

stem cells

Selected topics with largest changes in words assigned to the topic: negative (left plot) and positive (right plot)

Page 69: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 71

team 0.028 tour 0.039 holiday 0.071play 0.015 rider 0.029 gift 0.050game 0.013 riding 0.017 toy 0.023season 0.012 bike 0.016 season 0.019final 0.011 team 0.016 doll 0.014games 0.011 stage 0.014 tree 0.011point 0.011 race 0.013 present 0.008series 0.011 won 0.012 giving 0.008player 0.010 bicycle 0.010 special 0.007coach 0.009 road 0.009 shopping 0.007playoff 0.009 hour 0.009 family 0.007championship 0.007 scooter 0.008 celebration 0.007playing 0.006 mountain 0.008 card 0.007win 0.006 place 0.008 tradition 0.006

LAKERS 0.062 LANCE-ARMSTRONG 0.021 CHRISTMAS 0.058SHAQUILLE-O-NEAL 0.028 FRANCE 0.011 THANKSGIVING 0.018KOBE-BRYANT 0.028 JAN-ULLRICH 0.003 SANTA-CLAUS 0.009PHIL-JACKSON 0.019 LANCE 0.003 BARBIE 0.004NBA 0.013 U-S-POSTAL-SERVICE 0.002 HANUKKAH 0.003SACRAMENTO 0.007 MARCO-PANTANI 0.002 MATTEL 0.003RICK-FOX 0.007 PARIS 0.002 GRINCH 0.003PORTLAND 0.006 ALPS 0.002 HALLMARK 0.002ROBERT-HORRY 0.006 PYRENEES 0.001 EASTER 0.002DEREK-FISHER 0.006 SPAIN 0.001 HASBRO 0.002

Basketball HolidaysTour de France

New York Times 2000-02: Sample Topics

Page 70: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 72

attack 0.033 agent 0.029 sniper 0.024tower 0.025 investigator 0.028 shooting 0.019firefighter 0.020 official 0.027 area 0.010building 0.018 authorities 0.021 shot 0.009worker 0.013 enforcement 0.018 police 0.007terrorist 0.012 investigation 0.017 killer 0.006victim 0.012 suspect 0.015 scene 0.006rescue 0.012 found 0.014 white 0.006floor 0.011 police 0.014 victim 0.006site 0.009 arrested 0.012 attack 0.005disaster 0.008 search 0.012 case 0.005twin 0.008 law 0.011 left 0.005ground 0.008 arrest 0.011 public 0.005center 0.008 case 0.010 suspect 0.005fire 0.007 evidence 0.009 killed 0.005plane 0.007 suspected 0.008 car 0.005

WORLD-TRADE-CTR 0.035 FBI 0.034 WASHINGTON 0.053NEW-YORK-CITY 0.020 MOHAMED-ATTA 0.003 VIRGINIA 0.019LOWER-MANHATTAN 0.005 FEDERAL-BUREAU 0.001 MARYLAND 0.013PENTAGON 0.005 HANI-HANJOUR 0.001 D-C 0.012PORT-AUTHORITY 0.003 ASSOCIATED-PRESS 0.001 JOHN-MUHAMMAD 0.008RED-CROSS 0.002 SAN-DIEGO 0.001 BALTIMORE 0.006NEW-JERSEY 0.002 U-S 0.001 RICHMOND 0.006RUDOLPH-GIULIANI 0.002 FLORIDA 0.001 MONTGOMERY-CO 0.005PENNSYLVANIA 0.002 MALVO 0.005CANTOR-FITZGERALD 0.001 ALEXANDRIA 0.003

September 11 Attacks FBI Investigation DC Sniper

New York Times 2000-02: Sample Topics

Page 71: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 73

Topic Trends (New York Times data set)

0

20

40kw

ord

s Basketball

0

100

200Sept-11-Attacks

0

10

20

kwo

rds Tour-de-France

0

100

200Anthrax

0

20

40

kwo

rds Oscars

0

50

100DC-Sniper

Jan00 Jan01 Jan02 Jan030

20

40

kwo

rds Quarterly-Earnings

Jan00 Jan01 Jan02 Jan030

50

100Enron

Page 72: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 74

From Dr. David NewmanComputer Science DepartmentUC Irvine

https://app.nihmaps.org/nih/browser/

Page 73: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 75

Background Resources: Topic Modeling

• D. Blei, Introduction to Probabilistic topic models, Communications of the ACM, preprint, 2011– http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

• D. Blei and J. Lafferty, Topic models, in A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications . Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009.– http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf

• Steyvers, M. & Griffiths, T. (2007) Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum

• General resources (papers, code, newsgroup, etc)– http://www.cs.princeton.edu/~blei/topicmodeling.html

Page 74: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 76

Additional Backup Slides on Application of Topic Modeling to Domain-Specific Browsing

Page 75: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 77

Application: Domain-Specific Browsers

• Collect a large set of documents in a particular domain– Build a domain-specific topic model for the documents– Build a browser based on the topic model

• Example: Schizophrenia Research– 40,000 abstracts from PubMed mention schizophrenia– We built a topic model with 200 topics– Also automatically extracted

• Gene names/symbols• Brain regions

– Developed a browser/search-tool that uses the topics to allow a user to “navigate” through the world schizophrenia

– Combines statistical model with real-time search on documents

– Master’s thesis work by Vasanth Kumar, ICS MS 2006

Page 76: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 78

Page 77: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 79

Topics are discoveredautomatically, and span the “space” of

schizophrenia research.Only topic naming is done manually

Page 78: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 80

Page 79: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 81

Page 80: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 82

Page 81: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 83

Page 82: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

P. Smyth: Fraunhofer IAIS, May 2010: 84

Page 83: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 85

Information Extraction

Page 84: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 86

Information Extraction• Broad goal:

– Convert unstructured text into structured form (e.g., database)– Many applications, e.g.,

• Mining of free-form narratives in clinical diagnoses• Extraction of scientific information from biomedical papers

Page 85: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 87

Information Extraction• Broad goal:

– Convert unstructured text into structured form (e.g., database)– Many applications, e.g.,

• Mining of free-form narratives in clinical diagnoses• Extraction of scientific information from biomedical papers

• Main Categories:– Named entity recognition (NER)

• E.g., recognize disease names, genes, proteins, drugs, in text• Can be challenging: e.g., genes are referred to in multiple ways

– Relation extraction• Given named entities, detect how they are related, e.g.,

– Gene A regulates Gene B– Drug X can cause adverse symptoms Y and Z

• Harder than named entity recognition– Much variety and subtlety in how we describe relational information

Page 86: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 88

Example of Information Extraction

From Sarawagi,Foundations and Trends in Databases, 2008

Page 87: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 89

Information Extraction Techniques

• Dictionaries– E.g., lists of names of genes, proteins, diseases, drugs

• Natural Language Processing– High-quality parsers can parse sentences into parts of speech– E.g., noun-phrases– Subject-object-verb triples useful for relation extraction

• Rule-based Algorithms– E.g., if word is capitalized and a noun, then its an entity

• Machine learning approaches– e.g., use human-labeled gene name mentions to train a classifier– Features can be parts of speech, nearby words, capitalization, etc

Page 88: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 90

Example of Part of Speech (POS) Tagging(from Chris Manning, Stanford)

When

RB

Mr.

NNP

Holly

NNP

last

RB

wrote

VBD

,

,

many

JJ

years

NNS

Page 89: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 91

Example of Phrase Parsing(from Chris Manning, Stanford)

Bills on ports and immigration were submitted by Senator Brownback

NP

S

NP

NNP NNP

PP

IN

VP

VP

VBN

VBD

NNCCNNS

NPIN

NP PP

NNS

Page 90: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 92

From Krallinger, Valencia, Hirschman, Genome Biology, 2008

Page 91: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 93

From Savova et al, J. Am. Med. Inform. Assoc, 2010

Page 92: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 94

From Erhadt, Schneider, Blaschke, Drug Discovery Today, 2006

Page 93: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 95

From Jensen, Saric, Bork, Nature Reviews Genetics, 2003

Page 94: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 96

From Feldman, Regev, Hurvitz, and Finkelstein-Landau, Biosilico, 2003

Page 95: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 97

From Feldman, Regev, Hurvitz, and Finkelstein-Landau, Biosilico, 2003

Page 96: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 98

“Buzzword Hunting”From Jensen, Saric, Bork, Nature Reviews Genetics, 2003

Page 97: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 99

Commercial Software, e.g., IBM Medical Record Text Analytics

Page 98: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 100

Challenges in Information Extraction

• Fundamentally hard– Ambiguity in natural language is difficult for computers to handle– Current systems are reasonably accurate …. but room for improvement

• Validation – Measuring recall for example can be difficult

• Calibration– How do we calibrate a method so that we know how reliable it is?

• Integration– How should text-based indicators be combined with actual experimental

data?

Page 99: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 101

Resources in Information Extraction

• Review paper on basic technology:– Information Extraction, by Sunita Sarawagi, in Foundations and Trends in

Databases, 2008

• Reviews with a biomedical focus:– Erhardt, Schneider, and Blaschke, Status of text-mining techniques

applied to biomedical text. Drug Discovery Today, 11(7/8), pp.315-325, 2006

– Jensen, Saric, Bork, Literature mining for the biologist: from information to biological discovery, Nature Reviews: Genetics, 7, Feb 2006.

• Additional papers on the class Web site

Page 100: © Hayes/Smyth: Introduction to Biomedical Informatics: 1 Introduction to Biomedical Informatics Text Mining

© Hayes/Smyth: Introduction to Biomedical Informatics: 102

Software for Natural Language Processing (lists from Chris Manning, Stanford)

• NLP Packages– NLTK Python

• http://www.nltk.org/– OpenNLP

• http://incubator.apache.org/opennlp/– Stanford NLP

• http://nlp.stanford.edu/software/– LingPipe

• http://alias-i.com/lingpipe/

• NLP Frameworks– GATE – General Architecture for Text Engineering (U. Sheffield)

• http://gate.ac.uk/• Java, quite well maintained (now)

– UIMA – Unstructured Information Management Architecture. Originally IBM; now Apache project• http://uima.apache.org/• Professional, scalable, etc.

– NLTK – Natural Language To0lkit (started by Steven Bird)• http://www.nltk.org/• Big community; large Python package; corpora and books about it