View
222
Download
2
Category
Tags:
Preview:
Citation preview
Comparable Corpora
Kashyap Popat(113050023)
Rahul Sharnagat(11305R013)
Outline Motivation Introduction: Comparable Corpora Types of corpora Methods to extract information from
comparable corpora Bilingual dictionary Parallel sentences
Conclusion
Motivation Corpus: the most basic requirement in
statistical NLP Large amount of bilingual text on web Bilingual Dictionary generation
One to one correspondence between words Parallel Corpus generation
One to one correspondence between sentences Very rare resource (Hindi – Chinese)
Comparable corpora[7]
“A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.”
Characteristics of Comparable corpora No parallel sentences No parallel paragraphs Fewer overlapping terms and words
Definition by EAGLES
Spectrum of Corpora
Unrelated corpora
Comparable corpora
Parallel corporaTranscription
- sentence by sentence aligned
A comparable corpora
Application of comparable corpora Generating bilingual lexical entries
(dictionary) Creating parallel corpora
Generating bilingual lexical entries
Basic postulates[1] Words with productive context in one
language translate to word with productive context in second language
e.g., table मे�ज़ Words with rigid context translate into words
with rigid context e.g., Haemoglobin रक्ता�णु� Correlation: between co-occurrence pattern in
different languages
Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus, Fung, 1995
Co-occurrence patterns[4]
If a term A co-occurs with another term B in some text T then its translation A' also co-occurs with B‘(translation of B) in some other text T'
Automatic Identification of Word Translations from Unrelated English and German Corpora. R. Rapp, 1999
T T’AB
A’
B’
Co-occurrence Histogram[2] For the word ‘Debenture’
Words
Count
Finding terminology translations from non-parallel corpora. Fung, 1997
Basic Approach[3]
Calculate the co-occurrence matrix for all the words in source language L1 and target language L2
Word order of the L1 matrix is permuted until the resulting pattern is most similar to that of the L2 matrix
Identifying word translations in nonparallel, Rapp, R. ,1995
English co-occurrence matrix L1 matrix
1 2 3 4 5 6
Book 1
Garden 2
Plant 3
School 4
Sky 5
Teacher 6
Hindi co-occurrence matrix L2 matrix
1 2 3 4 5 6
आका�श 1
पा�ठश�ला� 2
शिशक्षका 3
बगी�चा� 4
किकाता�ब 5
पा�धा� 6
Hindi co-occurrence matrix L2 matrix: after permutations
5 4 6 2 1 3
किकाता�ब 5
बगी�चा� 4
पा�धा� 6
पा�ठश�ला� 2
आका�श 1
शिशक्षका 3
Result Comparing the order of the words in L1 matrix
and permuted L2 matrix
L1 index Word in L1 Word in L2 L2 index
1 Book किकाता�ब 5
2 Garden बगी�चा� 4
3 Plant पा�धा� 6
4 School पा�ठश�ला� 2
5 Sky आका�श 1
6 Teacher शिशक्षका 3
Problems Permuting co-occurrence matrix is expensive Size of the vector # of unique terms in the
language
A New Method[2]
Dictionary entries are used as seed words to generate correlation matrices
Algorithm: A bilingual list of known translation pairs (seed
words) is given Step-1: For every word ‘e’ in L1, find its correlation
vector (M1 ) with every word of L1 in the seed words Step-2: For every words ‘c’ in L2, find its correlation
vector (M2 ) with every word of L2 in the seed words Step-3: Compute correlation(M1, M2); if it is high,
‘e’ and ‘c‘ are considered as a translation pair
Finding terminology translations from non-parallel corpora. Fung, 1997
Co-occurrence
English Hindi
Garden बगी�चा�
… …
Plant पा�धा�
… …
… …
Sky आका�श
… …
… …
… …
Seed word List
Flower फू� ला
Crux of the Algorithm Two main steps:
Formation of co-occurrence matrix Measuring the similarity between vectors
Different possible methods to calculate above two steps
Advantage: vector size reduces to # of unique words in the seed list
Improvements Window Size for co-occurrence calculation[2]
Should it be same for all the words ?
Co-occurrence Counts Similarity Measure
)(
1_
swfrequencysizewindow
Finding terminology translations from non-parallel corpora. Fung, 1997
Co-occurrence count Mutual Information (Church & Hanks, 1989) Conditional Probability (Rapp, 1996) Chi-Square Test (Dunning, 1993) Log-likelihood Ratio (Dunning, 1993) TF-IDF (Fung et al 1998)
Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999
Mutual Information[2] (1/2)
k11 = # of segments where both, ws and wt occur
k12 = # of segments where only ws occur
k21 = # of segments where only wt occur
k22 = # of segments where neither words occur
Segments: sentences, paragraphs, or string groups delimited by anchor paints
22211211
1211)1Pr(kkkk
kkws
22211211
2111)1Pr(kkkk
kkwt
22211211
11)1,1Pr(kkkk
kww ss
Finding terminology translations from non-parallel corpora. Fung, 1997
Mutual Information[2] (2/2) Weighted mutual information
)1Pr()1Pr(
)1,1Pr(log)1,1Pr(),( 2
sx
sxsxsx ww
wwwwwwW
Similarity Measures (1/2) Cosine similarity (Fung and McKeown,1997) Jaccard similarity (Grefenstette,1994) Dice similarity (Rapp, 1999) L1 norm / City block distance (Jones & Furnas,
1987) L2 norm / Euclidean distance (Fung, 1997)
Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999
Similarity Measures (2/2) L1 norm / City block distance
L2 norm / Euclidian distance
Cosine Similarity
Jaccard Similarity
BA
BABAJ
),(
BA
BAcos
n
iii qpqpd
1
2)(),(
n
iii qpqpd
1
),(
Problems with the approach[5]
Coverage: only few corpus words are covered by the dictionary
Synonymy / Polysemy: several entries have the same meaning (synonymy), or an entry has several meanings (polysemy)
Similarities w.r.t. synonyms should not be independent
Improvements in the form of Geometric approaches Projects the co-occurrence vectors of source and
target word on a dictionary entries Measures the similarity between the projected vectors
A geometric view on bilingual lexicon extraction from comparable corpora. Gaussier, et al., 2004
Results
Paper Approach Method Corpus Accuracy
Fung et al. 1996
Word list based
Best candidate
English/Japanese
29%
Word list based
Top 20 candidate
output
English/Japanese
50.9%
Gaussier et al. 2004
Geometric Avg. Precision
English/French
44%
R. Rapp et al. 1999
Word list based
100Test words
English /French
72%
Generating parallel corpora
Generating Parallel Corpora Involves aligning the sentences in the
comparable corpora to form a parallel corpora Ways to do:
Dictionary matching Statistical methods
Ways to do alignment Dictionary matching
If the words in given two sentences are translation of each other, it is most likely that the sentences are translation of each other
Process is very slow Accuracy is high but cannot be applied to large corpus
Statistical methods To predict the alignment, these methods make use of
distribution of length of sentence in corpus either in terms of words (Brown, 1996) or characters (Gale and Church, 1991)
Makes no use of any lexical resources Fast and accurate
Length based statistical approach Preprocessing
Segment the text into tokens Combine the token into groups (nothing but
sentences) Find anchor point
Find points in corpus, where we are sure that start and end points in one language of the corpus aligns to start and end points in other language of the corpus
Finding these points require analysis of corpus
Example Brown et al., already had anchors in their
corpus Used UK parliament proceedings ‘Hansards‘ as a
parallel corpus Each proceeding start with a comment, time of
proceeding, who was giving the speech etc. This information provides the anchor points.
Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996
Aligning anchor points Anchor points are not always perfect
Some may be missing Some may be garbled
To find the alignment between these anchors, Dynamic programming technique is used
We find an alignment of the major anchors in the two corpora with the least total cost
Beads Upper level view can be that corpus is
sequence of sentence lengths occasionally separated by paragraph markers
Each of these groupings is called a bead
Bead is a type of sentence grouping
Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996
Beads
Bead type Content
e Only one English sentence
f Only one French sentence
ef One English and one French sentence
eef Two English and one French sentence
eff One English and two French sentence
¶e One English paragraph
¶f One French paragraph
¶e¶f One English and one French paragraph
Example of beads
Problem Formulation Sentences between the anchored points get
generated by two random processes1. Producing a sequence of beads 2. Choosing the length of the sentence(s) in each
bead Bead generation can be modeled using a two
state Markov model One sentence can align to zero, one or two
sentence in the other side Allows any of the eight beads as shown in
previous table Assumptions , f)¶Pr(e)¶Pr(
)Pr()Pr(
)Pr()Pr(
eefeff
fe
Modeling length of sentence Model probability of length of sentence given its bead Assumptions are made: e-beads and f-beads: Probability of le or lf is same
as probability of le or lf in the whole corpus ef-bead:
English sentence: length le with probability Pr(le) French sentence: that log ratio of French to English
sentence length is normal distributed with mean µ and variance
Where r = log(lf | le )
)2
)(exp()|Pr(
2
2
r
ll ef
2
Contd.. eef-bead:
English sentence: drawn from Pr(le) French sentence: r is distributed according to same
normal distribution
eff-bead: same uniform distribution holds with r as English sentence: drawn from Pr(le) French sentence: r is distributed according to same
normal distribution
Given the sum of lengths of French sentences, probability for particular pair lf1 and lf2 is proportional to
)log(21 ee
f
ll
lr
)log( 21
e
ff
l
llr
)Pr()Pr(21 ff ll
Parameter Estimation Using EM Algorithm, estimate the parameters
of the Markov model Following results were obtained
Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996
Results In a random sample of 1000 sentences, only 6
were not translation of each other Brown et al. have also studied the effect of
anchors points According to them,
with paragraph marker but no anchor points, 2.0% error rate is expected
with anchor points but no paragraph marker, 2.3% error rate is expected
with neither anchor point nor paragraph marker, 3.2% error is rate expected
Conclusion Comparable corpora can be used to generate
bilingual dictionary and parallel corpora Generating bilingual dictionary
Polysemy and sense disambiguation still remains a major challenge
Generating parallel corpora Given the aligned points, aligner is likely to give
good results The experiments were very specific to corpora,
hard to generalize the accuracy The sentences of length which has a highest
chance to get aligned but with completely wrong translation might confuse the aligner
References 1. Fung, P. (1995). Compiling bilingual lexicon entries from a
non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora, Boston, Massachusetts, 173-183
2. Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora, Hong Kong, 192-202.
3. R. Rapp (1995). Identifying word translations in nonparallel texts. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics. Cambridge, Massachusetts, 320-322.
4. R. Rapp. (1999). Automatic Identification of Word Translations from Unrelated English and German Corpora. Proceedings of the ACL-99. pp. 1–17. College Park, USA.
References 5. Gaussier, Eric, Jean-Michel Renders, Irina
Matveeva, Cyril Goutte, and Herve Dejean. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain.
6. Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (ACL '91).
7. http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html
Recommended