26
An approach to unsupervised historical text normalisation Petar Mitankin Sofia University FMI Stefan Gerdjikov Sofia University FMI Stoyan Mihov Bulgarian Academy of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Embed Size (px)

DESCRIPTION

Slides of the presentation of the paper An approach to Unsupervised Historical Text Normalisation by Petar Mitankin, Stefan Gerdjikov and Stoyan Mihov in DATeCH 2014. #digidays

Citation preview

Page 1: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

FMI

Stefan GerdjikovSofia University

FMI

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

May

Page 2: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

FMI

Stefan GerdjikovSofia University

FMI

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

May

Page 3: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Contents

● Supervised Text Normalisation– CULTURA

– REBELS Translation Model

– Functional Automata

● Unsupervised Text Normalisation– Unsupervised REBELS

– Experimental Results

– Future Improvements

Page 4: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Page 5: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Page 6: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

Page 7: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

Page 8: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

Automatic Extraction of Historical Spelling Variations

Page 9: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● Training pairs from the ground truth:

(shee, she), (maye, may), (she, she),

(tyme, time), (saith, says), (have, have),

(tho:, thomas), ...

Page 10: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● Deterministic structure of all historical/modern subwords

● Each word has several hierarchical decompositions in the DAWG:

Hierarchical decomposition of each

historical word

Hierarchical decomposition of each

modern word

Page 11: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● For each training pair (knowth, knows) we find a mapping between the decompositions:

● We collect statistics about

historical subword -> modern subword

● We collect statistics about

historical subword -> modern subword

Page 12: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

REBELS generates normalisation candidates for

unseen historical words

Page 13: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

shee

REBELS

knowth

REBELS

me

REBELS

shee knowth me

Page 14: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

relevance score (he knuth my) =

REBELS TM (he knuth my) * C_tm +

Statistical Language Model (he knuth my)*C_lm

Combination of REBELS with Statistical Bigram Language Model

● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off

– Gutengberg English language corpus

Page 15: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Functional Automata

L(C_tm, C_lm) is represented with Functional Automata

Page 16: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Automatic Construction of Functional Automaton For The

Partial Derivative w.r.t. x

L(C_tm, C_lm) is optimised with the Conjugate Gradient method

Page 17: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

REBELSTranslationModel

SearchModule Based on Functional Automata

GroundTruth

TrainingModuleBased on Functional Automata

Historical

text Normalised

text

Page 18: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Text Normalisation

REBELSTranslationModel

Unsupervised Generation of Training Pairs(knoweth, knows)

Historical

text Normalised

text

SearchModule Based on Functional Automata

Page 19: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 20: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 21: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 22: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 23: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

Page 24: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Normalisation of the 1641 Depositions. Experimental results

Method

Generation of REBELS Training

Pairs

Spelling Probabilities

Language Model Accuracy BLEU

1 ---- ---- ---- 75.59 50.31

2 Unsupervised NO YES 67.84 45.52

3 Unsupervised YES NO 79.18 56.55

4 Unsupervised YES YES 81.79 61.88

5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78

6 Supervised Supervised Trained Supervised Trained 93.96 87.30

Page 25: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Future Improvement

REBELSTranslationModel

Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities

Historical

text Normalised

text

SearchModule Based on Functional Automata

MAPTrainingModule

Page 26: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Thank You!

Comments / Questions?

ACKNOWLEDGEMENTS

The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.