Upload
guy-de-pauw
View
2.507
Download
15
Embed Size (px)
DESCRIPTION
by Guy De Pauw, Peter Waiganjo Wagacha and Gilles-Maurice de Schryver
Citation preview
Language Technologies for African Languages – AfLaT 2009
The SAWA CorpusA Parallel Corpus English - Swahili
Guy De Pauw ([email protected])Peter Waiganjo Wagacha ([email protected])Gilles-Maurice de Schryver ([email protected])
2
Language Technologies for African Languages – AfLaT 2009
Resource-scarceness
• Language technology vs the digital divide• Digital data increasingly important for African languages
(web, mobile phone, …) • But: most research on African languages is rooted in
knowledge-based paradigm (↔ LT for Indo-European languages): - Hand-crafted expert systems- Typically high accuracy for domain- Limited portability to other languages and subdomains- Costly development phase- Limited resources (linguistic, expertise, financial, …)
• Need for a cheaper and faster (language-independent) alternative for developing African language technology
3
Language Technologies for African Languages – AfLaT 2009
Data-driven approaches• For Indo-European and Asian languages: the data-driven, corpus-
based approach has become the dominant paradigm since the 90’s • Basic methodology: automatically extract linguistic knowledge
from annotated text material (corpus) and bootstrap the development of language technology component
• Advantages:- language independence: portability (!!!!)- Knowledge acquisition bottleneck data-acquisition bottleneck- Robustness
• AfLaT-team: explore application of data-driven paradigm to African languages (Swahili, Gikuyu, Luo, Northern Sotho, …)
4
Language Technologies for African Languages – AfLaT 2009
Machine Translation3 paradigms:
- Rule-based MT- Statistical MT- Example-based MT
data-driven
Learn translation from examples:!! Parallel corpus !!
5
Language Technologies for African Languages – AfLaT 2009
Parallel Corpus
Collection of translated texts in two different languages, aligned on paragraph, sentence, phrase and/or word level
SAWA Corpus: parallel corpus English - Swahili
6
Language Technologies for African Languages – AfLaT 2009
Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the
equal and inalienable rights of all members of the human
family is the foundation of freedom, justice and peace in
the world,
Whereas disregard and contempt for human rights have
resulted in barbarous acts which have outraged the
conscience of mankind, and the advent of a world in which
human beings shall enjoy freedom of speech and belief
and freedom from fear and want has been proclaimed as
the highest aspiration of the common people,
Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa
lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za
Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa
katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana
Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa
Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na
ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya
nchi yo yote."
UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA
YA ULIMWENGU JUU YA HAKI ZA BINADAMU
UTANGULIZI
Kwa kuwa kukiri heshima ya asili na haki sawa kwa
binadamu wote ndio msingi wa uhuru, haki na amani
duniani,
Kwa kuwa kutojali na kudharau haki za binadamu
kumeletea vitendo vya kishenzi ambavyo vimeharibu
dhamiri ya binadamu na kwa sababu taarifa ya
ulimwengu ambayo itawafanya binadamu wafurahie
uhuru wao wa kusema, kusadiki na wa kutoogopa cho
chote imekwisha kutangazwa kwamba ndio hamu kuu ya
watu wote,
Example
7
Language Technologies for African Languages – AfLaT 2009
3 phases
• Data-collection: finding parallel texts
• Data-constitution: aligning the parallel texts on word level
• Data-exploitation- Statistical Machine Translation- Bootstrapping linguistic annotation
8
Language Technologies for African Languages – AfLaT 2009
Data Collection
• Limited availability of parallel texts English – Kiswahili:- Smaller documents: investment reports, political
texts, e.g. Universal Declaration of Human Rights
“there is no data, like more data”- Bible, Quran, secular literature- New translations
9
Language Technologies for African Languages – AfLaT 2009
Data Collection
• Even if the source data is digitally available beforehand, we are often faced with tough alignment problems during data constitution.
e.g. paragraph alignment
10
Language Technologies for African Languages – AfLaT 2009
Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the
equal and inalienable rights of all members of the human
family is the foundation of freedom, justice and peace in
the world,
Whereas disregard and contempt for human rights have
resulted in barbarous acts which have outraged the
conscience of mankind, and the advent of a world in which
human beings shall enjoy freedom of speech and belief
and freedom from fear and want has been proclaimed as
the highest aspiration of the common people,
Katika Disemba 10, 1948, Baraza kuu la Umoja wa Mataifa
lilikubali na kutangaza Taarifa ya Ulimwengu juu ya Haki za
Binadamu. Maelezo kamili ya Taarifa hiyo yamepigwa chapa
katika kurasa zifuatazo. Baada ya kutangaza taarifa hii ya maana
Baraza Kuu lilizisihi nchi zote zilizo Wanachama wa Umoja wa
Mataifa zitangaze na "zifanye ienezwe ionyeshwe, isomwe na
ielezwe mashuleni na katika vyuo vinginevyo bila kujali siasa ya
nchi yo yote."
UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA
YA ULIMWENGU JUU YA HAKI ZA BINADAMU
UTANGULIZI
Kwa kuwa kukiri heshima ya asili na haki sawa kwa
binadamu wote ndio msingi wa uhuru, haki na amani
duniani,
Kwa kuwa kutojali na kudharau haki za binadamu
kumeletea vitendo vya kishenzi ambavyo vimeharibu
dhamiri ya binadamu na kwa sababu taarifa ya
ulimwengu ambayo itawafanya binadamu wafurahie
uhuru wao wa kusema, kusadiki na wa kutoogopa cho
chote imekwisha kutangazwa kwamba ndio hamu kuu ya
watu wote,
11
Language Technologies for African Languages – AfLaT 2009
e.g. sentence alignment
Article 12
No one shall be subjected to arbitrary interference
with his privacy, family, home or correspondence,
nor to attacks upon his honour and reputation.
Everyone has the right to the protection of the law
against such interference or attacks.
Kifungu cha 12
Kila mtu asiingiliwe bila sheria katika mambo yake
ya faragha, ya jamaa yake, ya nyumbani mwake au
ya barua zake.
Wala asivunjiwe heshima na sifa yake.
Kila mmoja ana haki ya kulindwa na sheria kutokana
na pingamizi au mambo kama hayo.
12
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
13
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
14
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
Thanks to Mahmoud Shokrollahi-FarUniversity College of Nabiye Akram (Iran)
15
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
16
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
17
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
18
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
19
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
Thanks to Dr. James Omboga ZajaUniversity of Nairobi
20
Language Technologies for African Languages – AfLaT 2009
Available data in SAWA Corpus
English Sentence
s
Kiswahili Sentence
s
EnglishWords
KiswahiliWords
New Testament 16.4k 16.3k 189.2k 151.1k
Quran 14.3k 14.5k 165.5k 124.3k
Declaration of HR 0.2k 1.8k 1.8k
Kamusi.org 5.6k 35.5k 26.7k
Movie Subtitles 9.0k 72.2k 58.4k
Investment Reports 3.2k 3.1k 52.9k 54.9k
Local Translator 1.5k 1.6k 25.0k 25.7k
Total 50.2k 50.3k 542.1k 442.9k
All manually sentence aligned!
21
Language Technologies for African Languages – AfLaT 2009
Word alignment
Most difficult task: relate words between languages
No she ‘s uh, , up north
La
,
, ,yuko ,aa juu kaskazini
22
Language Technologies for African Languages – AfLaT 2009
Word alignment
You caught me skiving , I ‘m afraid .
Samahani , umenidaka nikihepa .
23
Language Technologies for African Languages – AfLaT 2009
Word alignment
• Can be done automatically using established tools (GIZA++)• Provide manual reference to evaluate automatic word alignment
tools (5000 words)
24
Language Technologies for African Languages – AfLaT 2009
Current results
Still a lot of room for improvement
Precision Recall F(=1)
39.4% 44.5% 41.79%
25
Language Technologies for African Languages – AfLaT 2009
Word alignment
Some alignment patterns are easy
No she ‘s uh, , up north
La
,
, ,yuko ,aa juu kaskazini
26
Language Technologies for African Languages – AfLaT 2009
Alignment problems
nimemkatalia
have turned him downI
27
Language Technologies for African Languages – AfLaT 2009
Morphological decomposition
have turned him downI
ni+ me+ m+ katalia
28
Language Technologies for African Languages – AfLaT 2009
Current results
Morpheme/Word alignment
Better alignment, but more complicated decoding
Precision Recall F(=1)
50.2% 64.5% 55.8%
29
Language Technologies for African Languages – AfLaT 2009
Future work
• Projection of Annotation
30
Language Technologies for African Languages – AfLaT 2009
Future work
• Projection of Annotation
• Refine GIZA++ alignment• Part-of-speech tagger
31
Language Technologies for African Languages – AfLaT 2009
Future work
• Projection of Annotation
• Refine GIZA++ alignment• Part-of-speech tagger• No data like more data: web-mining &
comparable corpora
• Example-based MT (omegaT)• Statistical MT (Moses)
32
Language Technologies for African Languages – AfLaT 2009
Conclusion
• Modest, but workable parallel corpus English – Swahili
• Bi-directional Machine Translation is now in the cards
• Modest, but encouraging word alignment scores
• Data-driven approach is viable for African languages