74
Text Entailment Darsh Shah Pratyaksh Sharma

Recognizing Text Entailment - Tutorial

Embed Size (px)

DESCRIPTION

Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively.

Citation preview

  • Text EntailmentDarsh ShahPratyaksh Sharma

  • Introduction to Textual Entailment Textual Entailment can be defined as the

    phenomenon of inferring a text from another

    A text t entails hypothesis h if h is true in every circumstance of possible world in which t is true

  • Definition Continued

    This definition is very strict it requires truthfulness of h in all the instances where t is true

    ExampleT: Sachin received an award for

    batsmanship, from the ICC.H:The God of Cricket received an award.

  • Definition Continued

    T entails H only when Sachin is Sachin Tendulkar. This is the more likely situation, but not true always.

    So a modified definition is required Applied Definition:A text t entails

    hypothesis h if human reading t will infer that h is mostly likely true

  • Mathematical Definition

    Hypothesis h is entailed by text t if P(h is true|t) > P(h is true)

    Where P(h is true | t) is the Entailment Confidence and can be considered as a measure of surety of entailment

  • Entailment Triggers

    Semantic phenomena significant to Textual Entailment T: Sachin achieved the milestone of 100 centuries in his career.H: Sachin attained the milestone of 100 centuries in his career.The two words are synonyms

  • Generalizations or specializations of concepts in Text or Hypothesis can affect entailment

    ExampleT: Sachin Tendulkar is a cricketer.H:Sachin Tendulkar is a sportsman.Here sportsman is a generalization of sportsman

  • Other triggers are of the form Verb Entailment, entailment through change of Quantifiers, could trigger entailment

    Polarity, factivity, implicative verbs, iteratives could also lead to entailment

  • Applications of Textual Entailment

    Many natural language processing applications, like Question Answering(QA), Information Extraction(IE), (multi-document) summarization and machine translation (MT) evaluation

  • Information Retrieval

    Textual entailment impacts IR in at least two ways

    Notion of relevance bears strong similarity with that of entailment

    Textual entailment can be used to find affinities between various words, that can used to compute an extended similarity between documents and queries

  • Question Answering

    A given text T is retrieved for Question Q All entities in text T are substituted as

    potential answers to obtain candidate hypothesis H1,H2,...Hn.

    We then pick the best entailed Hi for the given text T to be the answer for the question Q.

  • Machine Translation Evaluation

    Machine Translation evaluation involves comparing the machine translated sentence with the reference output

    Textual entailment helps in this case as it gives a measure of similarity of the information conveyed by the reference and machine output

  • Miscellaneous

    Equivalence between two text pairs can be used by applying textual entailment from both sides. This is useful for novelty detection,copying detection etc

    Text simplification, substituting complex phrases by simpler phrases producing sentences that are grammatically correct and convey the meaning in a simpler way

  • Some basic approaches Implemented

    Plain word matching1. Calculate matching words between the text

    and the hypothesis2. Score = (# matching_words)/(# words in

    hypothesis)

  • Results

    We need to calculate an entailment threshold, above which well declare entailment.We find the best accuracy giving threshold on the training set.With a threshold = 0.55, We get accuracy = 0.6138 on the RTE2 development set.

  • Test CasesT:The Rolling Stones kicked off their latest tour on Sunday with a concert at Boston's Fenway Park.

    H:The Rolling Stones have begun their latest tour with a concert in Boston.

    YesCorrectly identifies

  • Test CasesT:Craig Conway, fired as PeopleSoft's chief executive officer before the company was bought by Oracle, was in England last week.

    H:Craig Conway works for Oracle.

    NOFails to identify, calls a yes

  • Conclusion Too inaccurate a method

    Cant differentiate between a sentence and its negation in the simplest form, might say that the entailment is true

  • Some basic approaches

    Plain lemma matching1. Lemmatize the text and the hypothesis2. Calculate matching lemmas between the

    two3. Score = (# matching_lemmas)/(# lemmas in

    hypothesis)

  • Results

    We need to calculate an entailment threshold, above which well declare entailment.

    With a threshold = 0.63, We get accuracy = 0.625 on the RTE2 development set.

  • Test CaseH:Sunday's earthquake was felt in the southern Indian city of Madras on the mainland, as well as other parts of south India. The Naval meteorological office in Port Blair said it was the second biggest aftershock after the Dec. 26 earthquake.T:The city of Madras is located in Southern India.YESEntails correctly

  • Test CaseH:ECB spokeswoman, Regina Schueller, declined to comment on a report in Italy's La Repubblica newspaper that the ECB council will discuss Mr. Fazio's role in the takeover fight at its Sept. 15 meeting.T:Regina Schueller works for Italy's La Repubblica newspaper.NOEntails incorrectly

  • Observations Again not dependable for even moderately

    complicated sentences

  • Some basic approaches

    Lemma + POS matching1. Lemmatize the text and the hypothesis2. Label with POS tags3. Calculate number of matching (lemma,

    POS_tag) between the two4. Score = (# matches)/(# lemmas in

    hypothesis)

  • Results

    We need to calculate an entailment threshold, above which well declare entailment.

    With a threshold = 0.63, We get accuracy = 0.6225 on the RTE2 development set

  • Test CaseH:It is also an acronym that stands for Islamic Resistance Movement, a militant Islamist Palestinian organization that opposes the existence of the state of Israel and favors the creation of an Islamic state in Palestine.T:The Islamic Resistance Movement is also known as the Militant Islamic Palestinian Organization.NOFails to entail correctly

  • Some basic approaches

    Using the BLEU algorithmBasically, the algorithm looks for n-gram coincidences between a candidate text.It can be used as a basic lexical level benchmark for other textual entailment methods

  • BLEU algorithm

    For several values of N (typically from 1 to 4), calculate the percentage of n-grams from the hypothesis which appears in any of the text.

    Combine the marks obtained for each value of N, as a weighted linear average.

  • BLEU algorithm Apply a brevity factor to penalise short texts

    (which may have n-grams in common with the references, but may be incomplete).

    Higher the BLEU score, higher the entailment.Learn a threshold for the bleu score from the development score.

  • Results from BLEULearned threshold = 0.0585Which means only 5.85% of n-gram matches and we declare entailment!

    Still, accuracy on RTE2 development set with this parameter = 0.6050

  • Test CasesH:Patricia Amy Messier and Eugene W. Weaver were married May 28 at St. Clare Roman Catholic Church in North Palm Beach.T:Eugene W. Weaver is the husband of Patricia Amy.

    YesEntails CorrectlyPossibly because of very low threshold used. Other systems fail to predict this

  • Conclusion

    Fails to understand deep semantic relations of sentence pairs like the previous ones

    It can be used as a baseline technique,quick to evaluate

  • A Discourse Commitment-Based Framework for Recognizing Textual Entailment

    New framework for recognizing Textual Entailment, that depends on the set of publicly held beliefs known as discourse commitments- that can be ascribed to the author of a text or a hypothesis

  • Inspiration for the approach

    Shallow approaches had been moderately successful in the previous 2 RTEs

    These approaches would fail as the sentences became larger and more syntactically complex

  • Formal Definition of the Problem

    Given a commitment set {ct} consisting of the set of discourse commitments inferable from a text t and a hypothesis h, define the task of RTE as a search for the commitment c {ct} which maximizes the likelihood that t textually entails h

  • System Architecture

  • Extracting Discourse Commitments

    After preprocessing, some heuristics are used to extract discourse commitments

    Sentence Segmentation,Syntactic Decomposition,Supplementary Expressions,Relational Extraction, Coreference Resolution

  • Commitment Selection

    Following Commitment Extraction,a word alignment technique first introduced in (Taskar et al., 2005b) was used in order to select the commitment extracted from t (henceforth, ct) which represents the best alignment for each of the commitments extracted from h (henceforth, ch)

  • The alignment of two discourse commitments can be cast as a maximum weighted matching problem in which each pair of words (ti ,hj ) in an commitment pair (ct ,ch) is assigned a score sij (t, h) corresponding to the likelihood that ti is aligned to hj

  • In order to compute a set of parameters w which maximize the number of correct alignment predictions (y) in a given training set (x)

  • Features used in the model

    string features (including Levenshtein edit distance, string equality, and stemmed string equality)

    lexico-semantic features (including WordNet Similarity and named entity similarity equality)

    word association features

  • Following alignment,the method uses the sum of the edge scores

    Search for ct that represents the reciprocal best hit

    That is,selecting a commitment pair (ct , ch) where ct was the top scoring alignment candidate for ch and ch was the top-scoring alignment candidate for ct

  • Entailment and Results

    Textual entailment selection is done based on the decision tree shown in the system

    architecture The following shows the results on RTE-3

    test dataset

  • IKOMA

    One of the best performing submissions in RTE-7 (Text Analysis Conference 2011)

    Title: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features

    Had the highest F-measure (48.00) on the dataset. Next best was 45.13

  • Approach

    First, calculate an entailment score based on lexical-level matching.Combine it with machine learning based filtering using various features obtained from lexical-level, chunk-level and predicate argument structure-level information.

  • Approach

    Role of filtering: to discard T-H pairs that have high entailment score but are not actually. Using higher features than lexical level.

    SENNA is used for analyzing POS of words, word chunks, NER and predicate-argument structures

  • Knowledge resources used

    Acronyms extracted from the corpus: created for organizational names with more than three words.

    WordNet CatVar: contains categorical variations of

    English lexemes.

  • Lexical Entailment Score

    R: set of knowlegde resources. Tt and Ht = set of words in each T and H.

    freq(t) is frequency of t in corpus.

  • Lexical Entailment Score

    match(t, Tt, R) takes 1 if word t corresponds to a word in Tt (also consider synonyms and derived words from R); otherwise match() takes the value 0.

  • Lexical Entailment Score

    The Lexical Entailment Score is calculated for all H-T pairs in the development set and a threshold is chosen which gives the highest micro-average F-measure. Experiments are also done to find the optimum value of in equation (1). By testing: we find = 1.8 to be optimal

  • Filtering stage

    We train a model that classifies T-H pairs having high LES into false-positive or true-positive. If the model predicts a T-H pair as false-positive, then we discard that pair from entailment T-H pair candidates.

  • Features for classifier

    The lib-svm package is used, with features like: lexical-level:

    Entailment Score ent_sc Cosine similarity Entailment score, comparing only words with same

    POS tag

  • Features for classifier

    Chunk level Matching ratios for each chunk types (e.g. NP and

    VP) in all corresponding chunk pairs PAS level

    Matching ratio for each argument type (A0, A1) in all corresponding PAS pairs for each semantic relation of two predicates

  • Features for classifier

    Chunk level Matching ratios for each chunk types (e.g. NP and

    VP) in all corresponding chunk pairs PAS level: For all corresponding PAS pairs:

    Matching ratio for each argument type (A0, A1) Number of negation mismatch Number of modal verb mismatch Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment.

  • Features for classifier Chunk level

    Matching ratios for each chunk types (e.g. NP and VP) in all corresponding chunk pairs

    PAS level: For all corresponding PAS pairs: Matching ratio for each argument type (A0, A1)Semantic relation of two predicates can be: same-expression, synonym, antonym, entailment or no-relation

  • Computing the features

    For acquiring the above features in chunk and PAS level, we need to detect corresponding pairs that should be checked for testing whether the pairs have entailment.

    Also need to detect whether such corresponding pairs are in entailment relation.

  • For the first problem

    1. Transform all words contained in PAS into a word vector using bag of words representation

    2. Calculate the cosine similarity for all PAS pairs that are generated by combining PAS from each T and H.

    3. We regard the most similar PAS from T for each PAS from H as corresponding pairs.

  • For the latter problem

    1. For each corresponding pair, we calculate our lexical entailment score between the words of each argument type of the PAS from H (as H in equation 1) and the words of the same argument type of the PAS from T (as T in equation 1)

    2. Apply a threshold (pre-defined) to identify entailment

  • Results

    Three solver were submitted: 1. IKOMA1: lexical entailment score + filtering

    with threshold set empirically2. IKOMA2: same as IKOMA1 with threshold 03. IKOMA3: lexical entailment score only

  • Results

  • MaxSim: An automatic metric for Machine Translation Evaluation based on maximum similarity The metric calculates a similarity score

    between a pair of English system-reference sentences by comparing information items such as n-grams across the sentence pair

  • Unlike most metrics, MAXSIM computes a similarity score between items

    Then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence

    Evaluation on the WMT07, WMT08, and MT06 datasets show that MAXSIM achieves good correlations with human judgment

  • Given a pair of English sentences to be compared, MaxSim performs tokenization, lemmatization using Word-Net and Part of Speech (POS) tagging

    Next, all non-alphanumeric tokens are removed

    Set of wordnet synonyms are gathered for each word, which are used for computing similarity

  • To calculate a similarity score for a pair of system reference translation sentences, MAXSIM extracts and compares n-gram information

    Based on these comparisons or matches across the sentence pairs, MaxSim computes precision and recall

    Matching Using N-gram Information

  • Phases of n-grams

    To match n-grams, MAXSIM goes through a sequence of three phases: lemma and POS matching, lemma matching, and bipartite graph matching

    We will illustrate the matching process using unigrams, then describe the extension to bigrams and trigrams

  • Lemma and POS-tag matching:An exact match on n-gram and POS-tag is applied

    In all n-gram matching, each n-gram in the system translation can only match at most one n-gram in the reference translation

    Lemma Matching: For the remaining unmatched n-grams, a relaxed condition of just lemma match is used

    Bipartite Graph Matching: For the remaining

  • unmatched unigrams, matches are made by constructing a weighted complete bipartite graph The remaining unigrams form the nodes of

    the graph The weights are the a sum of the wordnet

    similarity between two word nodes and the identity function on whether or not they have the same POS tag

  • Calculation of F-score

  • Scoring a Sentence Pair and the Whole Corpus For a sentence pair s, the MaxSim score is

    calculated as, where Fs,n is the F score defined previously for n-gram

  • For the entire corpus, the sim-score is just an arithmetic mean over all the individual sentence pairs score

  • Evaluation and Results

    An alpha of 0.9 is used for these evaluations

  • References

    1. Diana Perez and Enrique Alfonseca, Application of the Bleu algorithm for recognising textual entailments

    2. Dan Roth, Recognizing Textual Entailment3. Yee Seng Chan and Hwee Tou Ng, MAXSIM:

    An Automatic Metric for Machine Translation Evaluation Based on Maximum Similarity

  • References

    4. Masaaki Tsuchida and Kai Ishikawa, IKOMA at TAC2011: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features5. Andrew Hickl and Jeremy Bensley, A Discourse Commitment-Based Framework for Recognizing Textual Entailment