Upload
maryann-barnett
View
231
Download
2
Embed Size (px)
Citation preview
Information extraction from bioinformatics related
documents
Extracting structured information from unstructured and/or semi-structured m/c-readable documents.
Processing human language texts by means of NLP methods.
Text/Images/audio/video
Introduction
Computation on previously unstructured data.
from an online news sentence such as:
Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."
Logical reasoning to draw inferences Text simplification
Goals
oNamed entityoCo-reference resolutionoRelationship extractiono Language and vocabulary analysis extractionoAudio extraction
Subtasks
Approaches Hand-written regular expressions Classifiers
◦ Generative: naïve Bayes ◦ Discriminative: maximum entropy models
Sequence models◦ Hidden Markov model
Field of CS, AI and CL concerned with the interactions between computers and Natural languages.
Major Focus
HCI NLU NLG
Natural Language Processing (NLP) Introduction
Hand written rules Statistical inference algorithms to produce
models ◦ robust to unfamiliar input (e.g. containing words or
structures that have not been seen before) ◦ erroneous input (e.g. with misspelled words or words
accidentally omitted). Methods used are
stochastic, probabilistic and statistical Methods for disambiguation often involve the use
of corpora and Markov models.
NLP Methods
Automatic summarization Discourse analysis Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding
Major tasks in NLP
Native Language Identification Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing
Applications of NLP
Biomedical text mining (BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain.
It is a rather recent research field on the edge of NLP, bioinformatics, medical informatics and computational linguistics.
NLP Techniques for Bioinformatics
There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.
Motivation
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu S
NP
P-N
John
VP
V
run
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu S
NP
P-N
John
VP
V
runPred: RUN Agent:John
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu S
NP
P-N
John
VP
V
runPred: RUN Agent:John
John is a student.He runs.
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Domain AnalysisAppelt:1999
Tokenization
Part of Speech Tagging
Term recognition
Inflection/Derivation
Compounding
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Lexicons Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects ofInformation
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Most words in Englishare ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Structural Ambiguities
Predicate-argument Ambiguities
number of methods for determining context automatic topic detection/theme extraction.
"what" is being discussed. Nouns and noun phrases to define context. Named entity recognition and extraction.
Nouns, Verbs extraction from textual documents
large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a distinct concept.
Synsets are interlinked by means of conceptual-semantic and lexical relations.
The resulting network of meaningfully related words and concepts can be navigated with the browser.
freely and publicly available for download. WordNet's structure makes it a useful tool for CL and NLP works.
Wordnet for synonym finding
WordNet similarity to thesaurus (words and meanings)
WordNet interlinks not just word forms—strings of letters—but specific senses of words. ◦ words that are found in close proximity to one another in
the network are semantically disambiguated. Semantic relations among words, whereas the
groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.
CATEGORIZATION / CLASSIFICATION
Given:◦ A description of an instance, xX, where X is the
instance language or instance space. e.g: how to represent text documents.
◦ A fixed set of categories C = {c1, c2,…, cn}
Determine:◦ The category of x: c(x)C, where c(x) is a categorization
function whose domain is X and whose range is C.
A GRAPHICAL VIEW OF TEXT CLASSIFICATION
NLP
Graphics
AI
Theory
Arch.
TEXT CLASSIFICATIONThis concerns you as a patient.
Our medical records indicate you have had a history of illness. We are now encouraging all our patients to use this highly effective and safe solution.
Proven worldwide, feel free to read the many reports on our site from the BBC & ABC News.
We highly recommend you try this Anti-Microbial Peptide as soon as possible since its world supply is limited. The results will show quickly.
Regards, http://www.superbiograde.us/bkhog/
85% of all email!!
EXAMPLES OF TEXT CATEGORIZATION
LABELS=BINARY◦ “spam” / “not spam”
LABELS=TOPICS◦ “finance” / “sports” / “asia”
LABELS=OPINION◦ “like” / “hate” / “neutral”
LABELS=AUTHOR◦ “Shakespeare” / “Marlowe” / “Ben Jonson”◦ The Federalist papers
Methods (1)
Manual classification◦ Used by Yahoo!, Looksmart, about.com, ODP, Medline◦ very accurate when job is done by experts◦ consistent when the problem size and team is small◦ difficult and expensive to scale
Automatic document classification◦ Hand-coded rule-based systems
Reuters, CIA, Verity, … Commercial systems have complex query languages
(everything in IR query languages + accumulators)
Methods (2)
Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, …
Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But can be built (and refined) by amateurs
Bayesian Methods Learning and classification methods based on
probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how
data is produced Uses prior probability of each category given no
information about an item. Categorization produces a posterior probability
distribution over the possible categories given a description of an item.
Bayes’ Rule
)()|()()|(),( CPCXPXPXCPXCP
)(
)()|()|(
XP
CPCXPXCP
Maximum a posteriori Hypothesis
)|(argmax DhPhHh
MAP
)(
)()|(argmax
DP
hPhDPh
HhMAP
)()|(argmax hPhDPhHh
MAP
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely, we only
need to consider the P(D|h) term:
)|(argmax hDPhHh
ML
Naive Bayes Classifiers
Task: Classify a new instance based on a tuple of attribute values
nxxx ,,, 21
),,,|(argmax 21 njCc
MAP xxxcPcj
),,,(
)()|,,,(argmax
21
21
n
jjn
CcMAP cccP
cPcxxxPc
j
)()|,,,(argmax 21 jjnCc
MAP cPcxxxPcj
Naïve Bayes Classifier: Assumptions P(cj)
◦ Can be estimated from the frequency of classes in the training examples.
P(x1,x2,…,xn|cj) ◦ Need very, very large number of training
examples Conditional Independence Assumption:
Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.
Flu
X1 X2 X5X3 X4
feversinus coughrunnynose muscle-ache
The Naïve Bayes Classifier
Conditional Independence Assumption: features are independent of each other given the class: )|()|()|()|,,( 52151 CXPCXPCXPCXXP
Learning the Model
Common practice:maximum likelihood◦ simply use the frequencies in the data
)(
),()|(ˆ
j
jiiji cCN
cCxXNcxP
C
X1 X2 X5X3 X4 X6
N
cCNcP jj
)()(ˆ
Feature selection via Mutual Information We might not want to use all words, but just
reliable, good discriminators In training set, choose k words which best
discriminate the categories. One way is in terms of Mutual Information:
◦ For each word w and each category c
}1,0{ }1,0{ )()(
),(log),(),(
w ce e cw
cwcw epep
eepeepcwI
OTHER APPROACHES TO FEATURE SELECTION T-TEST CHI SQUARE TF/IDF (CFR. IR lectures) Yang & Pedersen 1997: eliminating features
leads to improved performance
tfidf(t, d) = tf(t, d) · idf(t)
Tf·idf term-document matrix.
where Nt,d is the number of occurrences of a term t in a document d, and the denominator is the sum of occurrences of all terms in that document d
where W(t) is the number of documents containing the term t
Used to evaluate the independence between two events. The relevance of a term t in a class c can be estimated by the following formula
Chi- Square Statistics
F11: #documents belonging to c and containing t;F10: #documents which are not in c but containing t;F01: #documents belonging to c but not containing t;F00: #documents which are not in c and not containing t.
Classification task: to decide which class to choose. Measure importance of term t for a class c.
MAP Estimates
where Nt|c and Nt are the numbers of term t in the class c and in the entire corpus, respectively. Nc is the number of distinct classes.
where Nd|c is the number of documents in the scene class c, and Nd is the entire number ofdocuments. Note that α1 and α2 are the smoothing parameters that are typically determinedempirically.
OTHER CLASSIFICATION METHODS
K-NN DECISION TREES LOGISTIC REGRESSION SUPPORT VECTOR MACHINES