Upload
impact-centre-of-competence
View
190
Download
0
Embed Size (px)
DESCRIPTION
Slides of the presentation of the paper An approach to Unsupervised Historical Text Normalisation by Petar Mitankin, Stefan Gerdjikov and Stoyan Mihov in DATeCH 2014. #digidays
Citation preview
An approach to unsupervised historical text normalisation
Petar MitankinSofia University
FMI
Stefan GerdjikovSofia University
FMI
Stoyan MihovBulgarian Academy
of SciencesIICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
An approach to unsupervised historical text normalisation
Petar MitankinSofia University
FMI
Stefan GerdjikovSofia University
FMI
Stoyan MihovBulgarian Academy
of SciencesIICT
DATeCH 2014, Maye 19 - 20, Madrid, Spain
May
Contents
● Supervised Text Normalisation– CULTURA
– REBELS Translation Model
– Functional Automata
● Unsupervised Text Normalisation– Unsupervised REBELS
– Experimental Results
– Future Improvements
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English
● CULTURA: CULTivating Understanding and Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
Co-funded under the 7th Framework Programme of the European Commission
● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English
● CULTURA: CULTivating Understanding and Research through Adaptivity
● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
Supervised Text Normalisation
● Manually created ground truth– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical language to modern language combines:– Translation model
– Language model
Supervised Text Normalisation
● Manually created ground truth– 500 documents from the 1641 Depositions
– All words: 205 291
– Normalised words: 51 133
● Statistical Machine Translation from historical language to modern language combines:– Translation model
– Language model
REgularities Based Embedding of Language Structures
sheeREBELSTranslationModel
he / -1.89se / -1.69she / -9.75shea / -10.04
Automatic Extraction of Historical Spelling Variations
Training ofThe REBELS Translation Model
● Training pairs from the ground truth:
(shee, she), (maye, may), (she, she),
(tyme, time), (saith, says), (have, have),
(tho:, thomas), ...
Training ofThe REBELS Translation Model
● Deterministic structure of all historical/modern subwords
● Each word has several hierarchical decompositions in the DAWG:
Hierarchical decomposition of each
historical word
Hierarchical decomposition of each
modern word
Training ofThe REBELS Translation Model
● For each training pair (knowth, knows) we find a mapping between the decompositions:
● We collect statistics about
historical subword -> modern subword
● We collect statistics about
historical subword -> modern subword
REgularities Based Embedding of Language Structures
sheeREBELSTranslationModel
he / -1.89se / -1.69she / -9.75shea / -10.04
REBELS generates normalisation candidates for
unseen historical words
shee
REBELS
knowth
REBELS
me
REBELS
shee knowth me
relevance score (he knuth my) =
REBELS TM (he knuth my) * C_tm +
Statistical Language Model (he knuth my)*C_lm
Combination of REBELS with Statistical Bigram Language Model
● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off
– Gutengberg English language corpus
Functional Automata
L(C_tm, C_lm) is represented with Functional Automata
Automatic Construction of Functional Automaton For The
Partial Derivative w.r.t. x
L(C_tm, C_lm) is optimised with the Conjugate Gradient method
Supervised Text Normalisation
REBELSTranslationModel
SearchModule Based on Functional Automata
GroundTruth
TrainingModuleBased on Functional Automata
Historical
text Normalised
text
Unsupervised Text Normalisation
REBELSTranslationModel
Unsupervised Generation of Training Pairs(knoweth, knows)
Historical
text Normalised
text
SearchModule Based on Functional Automata
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
Unsupervised Generation of the Training Pairs
● We use similarity search to generate training pairs:– For each historical word H:
● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then
● Find each modern word M that is at distance 3 from H and generate (H,M).
● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.
Normalisation of the 1641 Depositions. Experimental results
Method
Generation of REBELS Training
Pairs
Spelling Probabilities
Language Model Accuracy BLEU
1 ---- ---- ---- 75.59 50.31
2 Unsupervised NO YES 67.84 45.52
3 Unsupervised YES NO 79.18 56.55
4 Unsupervised YES YES 81.79 61.88
5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78
6 Supervised Supervised Trained Supervised Trained 93.96 87.30
Future Improvement
REBELSTranslationModel
Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities
Historical
text Normalised
text
SearchModule Based on Functional Automata
MAPTrainingModule
Thank You!
Comments / Questions?
ACKNOWLEDGEMENTS
The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.