View
430
Download
3
Embed Size (px)
DESCRIPTION
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCore
Citation preview
TAUS MACHINE TRANSLATION SHOWCASE
MT for Southeast Asian Languages 14:00 – 14:20 Wednesday, 10 April 2013 Ai Ti Aw Institute for Infocomm, Singapore
Southeast Asian Language Machine Translation
Ms Ai Ti AW
Human Language Technology Department
Institute for Infocomm Research, Singapore
Localization World, Singapore, 10-12 Apr 2013 3 Localization World, Singapore, 10-12 Apr 2013
Agenda
1. Machine Translation
2. Southeast Asian Languages
3. Institute for Infocomm Research (I2R)
4. Challenges for Southeast Asian Language Translation
5. Machine Translation Applications
Localization World, Singapore, 10-12 Apr 2013 4 Localization World, Singapore, 10-12 Apr 2013
Pieter Brueghel the Elder (1563) (Wiki)
The Tower of Babel
Localization World, Singapore, 10-12 Apr 2013 5 Localization World, Singapore, 10-12 Apr 2013
Languages of the World
Each dot represents the geographic center of the 6,912 living languages in the Ethnologue database. Gordon, Raymond G., Jr. (ed.), 2005. Ethnologue: Languages of the World, FiAeenth ediBon. Dallas, Tex.: SIL InternaBonal. Online version: hJp://www.ethnologue.com/.
Localization World, Singapore, 10-12 Apr 2013 6 Localization World, Singapore, 10-12 Apr 2013
Father of Translation
Xuanzang (玄奘,602‐664): First Translator in China
Http://baike.baidu.com
St. Jerome (347-420) Translation of Bible into Latin
http://mb-soft.com/believe/txn/jerome.htm
Localization World, Singapore, 10-12 Apr 2013 7 Localization World, Singapore, 10-12 Apr 2013
Pioneer of Machine Translation
Warren Weaver (1894-1978): Decoding
When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” (1949)
http://en.wikipedia.org/wiki/Warren_Weaver
Localization World, Singapore, 10-12 Apr 2013 8 Localization World, Singapore, 10-12 Apr 2013
Translation Jokes
Localization World, Singapore, 10-12 Apr 2013 9 Localization World, Singapore, 10-12 Apr 2013
Machine Translation
Expert knowledge Translation examples Translation Model
Language Model
• Word • Phrase • Tree
Translation Unit
• Lexical • POS • Syntax
Linguistic Complexity
Decoding Algorithm
Localization World, Singapore, 10-12 Apr 2013 10 Localization World, Singapore, 10-12 Apr 2013
The Vauquois Triangle
direct
syntactic transfer
semantic transfer
interlingua
Localization World, Singapore, 10-12 Apr 2013 11 Localization World, Singapore, 10-12 Apr 2013
Translation Methodology
Word-to-Word Translation Phrase-based Translation S
VBA
把(NULL)
我(me)
给(give)
钢笔(pen)
。(.)
P WJRVGNG
VO
Give topenthe me .
VBP DT NN TO PRP PUNC.
NP PP
VP
S
Ts:
A:
Tt:
Syntax-based Translation
Localization World, Singapore, 10-12 Apr 2013 12 Localization World, Singapore, 10-12 Apr 2013
Rule-based Approach
Cerita menarik .
lexical structural
stru
ctur
al structural Parsing
Rules
Structure Generation
Lingware Interpreter analysis generation
Morphological Rules
Language Model
Dictionary
Morph Generation
The story is interesting .
Transfer Structural Mapping Rules
Bilingual Dictionary
Localization World, Singapore, 10-12 Apr 2013 13 Localization World, Singapore, 10-12 Apr 2013
Statistical-based Approach
TESTING
TRAINING
Word alignment
Translation model (TM)
Re-ordering model (RM)
Language model (LM)
Parallel corpus Statistical
modeling Language modeling
Target language corpus
Source language Input f Statistical
decoding
Target language output e
Localization World, Singapore, 10-12 Apr 2013 14 Localization World, Singapore, 10-12 Apr 2013
Southeast Asian Languages
Lao
Thai
Khmer
Myanmar
Filipino
Vietnamese
Malay
Indonesian
Chinese
English
Localization World, Singapore, 10-12 Apr 2013 15 Localization World, Singapore, 10-12 Apr 2013
Characteristics of Southeast Asian Languages
Tone Affix Inflection Re-duplication
Word Segmentatio
n
Sentence Concept
Chinese Yes No No No Yes Yes
Filipino No Yes Yes Yes No Yes
Indonesian
No Yes No Yes No Yes
Khmer No No No Yes Yes Yes
Lao Yes No No No Yes Yes
Malay No Yes No Yes No Yes
Myanmar Yes No Yes No Yes Yes
Thai Yes No No No Yes No
Vietnamese
Yes No No No Yes Yes - Contributed by the ASEAN-MT Project
Localization World, Singapore, 10-12 Apr 2013 16 Localization World, Singapore, 10-12 Apr 2013
Language Processing Tools
Morphological Analysis
Word Segmentation
Sentence Boundary Detection
Chinese (Singapore) NA Available NA Filipino (Philippine) Available NA NA Indonesian (Indonesia)
Available NA NA
Khmer (Cambodian) NA Available NA Lao (Laos) NA Available NA Malaysian (Malaysia) Available NA NA Myanmar (Myanmar) Available Available NA Thai (Thailand) NA Available Available Vietnamese (Vietnam)
NA Available NA - Contributed by the ASEAN-MT Project
Research Institutes and Companies
Localization World, Singapore, 10-12 Apr 2013 18 Localization World, Singapore, 10-12 Apr 2013
Localization World, Singapore, 10-12 Apr 2013 19 Localization World, Singapore, 10-12 Apr 2013
Localization World, Singapore, 10-12 Apr 2013 20 Localization World, Singapore, 10-12 Apr 2013
Machine Translation Research 1989: Initiated R&D in English→Chinese MT
1990: Awarded S$2m IBM English→Chinese MT project
1992: Developed in-house English↔Malay MT
1993: Set up MT Service Unit
1997: Spin-off AsiaRain Automated Translation
2000: Commercialized MT technology Chinese → English MT Indonesian ↔ English MT
English → Thai MT
2004: Enhance and construct lexical resources, machine learning techniques in source text analysis
2005: Started Statistical Machine Translation
2007: Vietnamese → English MT
2010: Hybrid MT
2012: Malay→Chinese MT, Vietnamese → Chinese MT
Localization World, Singapore, 10-12 Apr 2013 21 Localization World, Singapore, 10-12 Apr 2013
Phrase-based SMT: Learning Heuristics
Deyi Xiong, Min zhang and Haizhou Li. Learning Translation Boundaries for Phrase-Based Decoding. NAACL-HLT 2010
Xiangyu Duan, Min zhang and Haizhou Li. Pseudo-word for Phrase-based Machine Translation. ACL-2010 Boxing Chen, Min Zhang and Aiti Aw. Two-Stage Hypotheses Generation for Spoken Language Translation. ACM TALP 8(1) (2009)
1) Source Phrase Segmentation 2) Phrase Translation 3) Target Phrase Reordering
• Discover effective heuristics from a limited dataset • Phrase Segmentation Model
v 中国的/经济/发展 中国的/经济发展 中国的经济/发展 …..
• From Word to Pseudo-‐Word and then to Phrase v “想” and “would like to” “多少 钱” and “how much is it”
• Hypothesis Regeneration with System Combination v Generating new hypothesis from translation results (one or more systems) v Combining results and re-‐scoring
Localization World, Singapore, 10-12 Apr 2013 22 Localization World, Singapore, 10-12 Apr 2013
Linguistic Syntax-based SMT
22 22
Bleu-4 on NIST 05 (Trained on FBIS Corpus)
0.21
0.22
0.23
0.24
0.25
0.26
0.27
SCFG Moses Ours: STSG Ours: STSSG
Tree Sequence-‐based SMT
Min Zhang, Hongfei Jiang, Aiti Aw and Haizhou Li. A Tree Sequence Alignment-based Tree-to-Tree Translation Model. ACL-2008:HLT
Forest-‐based SMT
Hui Zhang, Min Zhang, Haizhou Li, Aiti Aw and Chew Lim Tan. Forest-based Tree Sequence to String Translation Model. ACL-IJCNLP-2009 Hui Zhang, Min Zhang, Haizhou Li and Chew Lim Tan. Fast Translation Rule Matching for Syntax-based Statistical Machine Translation. EMNLP-2009
Bleu-4 on NIST 05 (Trained on FBIS Corpus)
0.240.250.260.270.280.290.3
Moses Ours:TT2S
Ours:TTS2S
Ours:FT2S
Ours:FTS2S
Localization World, Singapore, 10-12 Apr 2013 23 Localization World, Singapore, 10-12 Apr 2013
Exploring Semantic in Phrase-based SMT Predicate Translation & Argument Reordering
Deyi Xiong, Min Zhang, Haizhou Li. Modeling the Translation of Predicate-Argument Structure for SMT. ACL 2012. ���
Localization World, Singapore, 10-12 Apr 2013 24 Localization World, Singapore, 10-12 Apr 2013
Discourse-based SMT (Topic Model)
Xinyan XIAO, Deyi XIONG, Min ZHANG, Qun LIU and Shouxun LIN. A Topic Similarity Model for Hierarchical Phrase-based Translation. ACL-2012
Localization World, Singapore, 10-12 Apr 2013 25 Localization World, Singapore, 10-12 Apr 2013
Discourse-based SMT (Document Cache Model)
§ Use document-‐level informaIon to choose translaIon candidates
Zhengxian GONG and Min ZHANG. Cache-based Document-level Statistical Machine Translation. EMNLP-2011
Localization World, Singapore, 10-12 Apr 2013 26 Localization World, Singapore, 10-12 Apr 2013
Challenge: Overcome Low Resources 1. How to build system with limited language resources? 2. How to leverage on human translation knowledge for SMT? 3. How to improve the system when large language resources
are available?
Localization World, Singapore, 10-12 Apr 2013 27 Localization World, Singapore, 10-12 Apr 2013
Approach
1. Given limited statistics, consider using prior linguistic knowledge to improve the statistical model
2. When we are able to craft rules, consider using statistical approach to improve the productivity
Localization World, Singapore, 10-12 Apr 2013 28 Localization World, Singapore, 10-12 Apr 2013
Ø Term § Phrase whose structure as a whole carries a specific meaning
Ø Term IdenIficaIon and TranslaIon § Domain Specific § Tedious and Bme consuming to acquire them manually for a new
domain
Lexical Pattern: Term Translation
• Skills Upgrading and Resilience Programme
• SPUR
• Program Kemahiran bagi Peningkatan dan
Ketahanan • SPUR
• 技能提升与应变计划 • 策马扬鞭
Localization World, Singapore, 10-12 Apr 2013 29 Localization World, Singapore, 10-12 Apr 2013
Mining Bilingual Terms
Lianhau Lee, Ai+ Aw, Thuy Vu, Sharifah Aljunied Mahani, Min Zhang and Haizhou Li “MARS: Mul+lingual Access and Retrieval System with Enhanced Query Transla+on and Document Retrieval” ACL-‐IJCNLP 2009. Lianhau Lee, Ai+ Aw, Min Zhang and Haizhou Li “EM-‐based Hybrid Model for Bilingual Terminology Extrac+on from Comparable Corpora”, COLING 2010
Mono Corpus
Mono Corpus
Monolingual Term
ExtracBon
Monolingual Term
ExtracBon
Document Alignment
Bilingual Term Alignment & ExtracBon
MonoTerms
Mono Terms
AlignDoc
Bi-‐Terms
Ø Ways of acquiring bilingual terms § Alignment on parallel
sentences § Using web data to search for
translaBon candidates § Mining from comparable
corpora § Manual coding/analysis of
new MWEs
Ø Our approach § AutomaBc mining of bilingual
terms from comparable corpora § Unavailability of large
parallel text § Easy accessibility of
monolingual corpus
Localization World, Singapore, 10-12 Apr 2013 30 Localization World, Singapore, 10-12 Apr 2013
Parallel Sentence Extraction: Document Alignment
Thuy Vu, Ai Ti Aw, Min Zhang. 2009. Feature-‐based Method for Document Alignment in Comparable News Corpora. In 12th EACL 2009, Athens, Greece
0
0.05
0.1
0.15
0.2
1 11 21 31 41 51 61 71 81 91
Bank Dunia World Bank 世界银行
0
0.01
0.02
0.03
1 11 21 31 41 51 61 71 81 91
Dunia World 世界
Localization World, Singapore, 10-12 Apr 2013 31 Localization World, Singapore, 10-12 Apr 2013
Document Alignment : Example Hospital Changi Baru dibuka mulai bulan depan
§ Author: Nazry Mokhtar, 28/11/1996
§ [Kemudahan $312 juta dijangka jadi …] Selain kemudahan perubatan penuh, ia akan mempunyai wad bersalin dan klinik bagi merawat bayi - sama seperti Hospital Kandang Kerbau. Sebuah hospital masyarakat baru juga akan dibina berdekatan hospital tersebut untuk menjadikan NCH sebagai pusat perubatan terunggul di kawasan timur Singapura yang mampu memenuhi keperluan sekitar 750,000 penduduk di situ. Ini menjadikannya sebagai hospital daerah pertama di sini yang dibangunkan khusus bagi memenuhi pelbagai keperluan perubatan penduduk di sesuatu daerah. § [Menteri Kesihatan, Brigedier-Jeneral (Kerahan) George Yeo, berkata demikian …] Antara kemudahannya termasuk kemudahan bersalin yang dikelolakan oleh Hospital Kandang Kerbau dan kemudahan bagi rawatan psikiatri dan pemulihan. Hospital baru itu menggantikan Hospital Toa Payoh dan Hospital Changi. § BG Yeo, yang juga Menteri Penerangan dan Kesenian, berkata: "Rancangan hospital ini ialah menawarkan kemudahan perubatan lengkap sejajar dengan matlamat menjadikannya sebuah pusat perubatan terunggul di daerah timur Singapura.” Mengenai hospital masyarakat yang bakal dibina berdekatan hospital baru itu, beliau berkata ia akan melengkapi kemudahan NCH. Hospital masyarakat dengan 200 katil pesakit itu akan diuruskan oleh Hospital St Andrew's Mission dan dijangka siap menjelang tahun 2000. [BG Yeo selanjutnya berkata …] § Dalam lawatan semalam, BG Yeo yang ditemani Menteri Negara Kanan (Pendidikan dan Kesihatan), Dr Aline Wong, masing-masing menanam sebatang pokok di luar lobi hospital itu.
New Changi Hospital will be health-care hub for eastern S'pore
§ Author: Allison Lim, 28/11/1996.
§ THE New Changi Hospital will be Singapore's first purpose-built regional hospital, said Health Minister George Yeo. It will cater for up to 750,000 people who live in the east and northeast regions. [To reach out to them, it has been designed to be a meeting place …] § Brigadier-General (NS) Yeo, who is also Minister for Information and the Arts, said that the hospital will have a birthing centre for young couples living in the region. It will be run as a satellite of the Kandang Kerbau Women's and Children's Hospital. In addition, there will be satellite facilities for psychiatry, rehabilitation medicine and other medical specialities. "The whole idea is a whole range of medical facilities in a hospital that will also serve as a health-care hub for the entire region," he said of the $480-million hospital. [The regional hospital concept ….] § The minister, who was accompanied by senior officials from the Health Ministry, later planted a Chengai sapling, near the hospital entrance. Senior Minister of State (Health and Education) Aline Wong planted a Tampines sapling. [Health care will remain affordable …] § BG Yeo said that later on, a community hospital will be built next to the New Changi Hospital, between it and the Pan-Island Expressway. "In fact, plans are already being drawn up and the St Andrew's Mission Hospital will run this new community hospital which will have more than 200 beds. So in this way we will provide, close to the housing estates here, a full range of medical facilities," he said. It should be ready by 2000. [He said that the regional hospital ….] § The new regional hospital will replace Toa Payoh Hospital, which will become a community hospital, and the existing Changi Hospital. [The latter's site will be returned ….]
Localization World, Singapore, 10-12 Apr 2013 32 Localization World, Singapore, 10-12 Apr 2013
Document Alignment : Example
MAS profit falls 68% to $1.22b on higher rates, stronger S$
§ Author: Ericia Tay, 21/07/2006.
§ [Central bank's total assets up …] The futures market suggests that oil prices could stay at around US$80 a barrel, and while the world economy has so far been resilient, the risks of a sharper slowdown due to supply disruptions have gone up, noted Mr Heng. § Nevertheless, inflationary pressures at home "should be fairly well contained", even though the indirect effects of higher oil prices on energy-related consumer items and business costs are expected to strengthen. The MAS stuck to its earlier prediction that Singapore's economic growth this year is likely to be between 5 per cent and 7 per cent, barring unexpected shocks in the rest of the year. § "Although global IT demand growth may be capped somewhat by potentially slower growth in the United States in the second half of 2006, the prospects for continued economic growth in the quarters ahead appear intact," said Mr Heng of the outlook for Singapore. The MAS also kept its inflation forecast of between 1 per cent and 2 per cent for the whole of this year. These macroeconomic projections are based on the assumption that crude oil prices average US$68 to US$75 a barrel. § In the first half of this year, Singapore's gross domestic product (GDP) grew by an estimated 9 per cent from the same period last year. Taking into account Singapore's GDP growth and inflation prospects, the central bank said its policy stance on the Singdollar - a modest and gradual strengthening of the currency - remains appropriate. [Unlike many central banks which use interest rates as a policy tool…]
金管局看好未来数季度增长
§ Author: 罗文燕, 21/07/2006
§ [ 在中东紧张局势升温。。。] 金融管理局董事经理王瑞杰昨天在发表常年报告书的记者会上说,高油价转嫁到能源相关消费物品和商业营运成本的程度预料会提高,但整体国内通货膨胀压力应该会受到相当好的控制。尽管油价升高,金管局保持对我国今年的通胀率将介于1%到2%的预测。[ 王瑞杰说 。。。] § 根据贸工部上星期发表的预估数据,我国经济今年上半年强劲增长了9.1%。不过,下半年的。王瑞杰说:“美国经济增长可能在下半年放缓,这或许会抑制全球资讯科技需求的增长,但(新加坡)今后几个季度持续保持经济增长的前景似乎没变。” § 因此,排除地缘政治风险激增等无法预见的外来冲击,金管局预期全年的经济增长率多数会保持在5%到7%。然而,王瑞杰指出:“石油供应被中断以致经济更急速放缓的风险现在增加了。显然的,地缘政治跟油价。。。” § 中东紧张局势最近升温,已导致油价进一步升高。王瑞杰说,从期货市场的走势来看,油价预料会保持在每桶80美元左右的高水平。他说,金管局对通胀和经济的预测,有考虑到平均油价可能处于每桶65美元到78美元的价位。 § 在考虑到我国的增长和通胀前景后,王瑞杰表示,金管局认为当局目前让新元汇率继续适度及逐步增值的政策立场仍然适合。当局下一次将在10月发表半年一次的货币政策声明。
Localization World, Singapore, 10-12 Apr 2013 33 Localization World, Singapore, 10-12 Apr 2013
Hybrid System Source Beliau juga berterima kasih kepada MAS dan AirAsia kerana menyediakan
penerbangan terus ke Macau, yang memudahkan MGTO untuk mempromosikan bandar itu.
SMT He was also grateful to mas and airasia for providing direct flights to macau, which facilitate promoting the MGTO to.
MEMT He also is thankful for MAS and Airasia for preparing flight directly to Macau, which facilitates MGTO to promote the town.
SMT+ MEMT
He was also grateful to MAS and Airasia for providing direct flights to macau, which facilitates the MGTO to promote the city.
BLEU
SMT 0.4062
MEMT 0.2725
SMT+MEMT 0.4165
Localization World, Singapore, 10-12 Apr 2013 34 Localization World, Singapore, 10-12 Apr 2013
Scientific Achievements
§ Papers in leading journals • IEEE Transactions on Audio, Speech and Language
Processing • ACM Transaction on Asian Language Information
Processing • Information Processing and Management • Computational Linguistics
§ Papers in leading conferences • The Annual Meeting of The Association for
Computational Linguistics (ACL) • Conference on Empirical Methods in Natural Language
Processing (EMNLP) • International Conference on Computational Linguistics
(COLING)
Localization World, Singapore, 10-12 Apr 2013 35
Baidu's Box Computing: Beating Google At Its Own Game March 27, 2012, Seeking Alpha “… According to Baidu, 60% of search results are produced by Box Computing, which delivers interactive, relevant, and intuitive search experience that makes Baidu a clear leader in China's online search market. Unfortunately, Google has yet to catch up with Baidu on semantic search.” “…Recently, Baidu formed a partnership with Agency for Science, Technology and Research (A*STAR) to establish an R&D center in Singapore that focuses on developing South Asian language processing technology. The joint research lab will initially focus on Vietnamese and Thai.”
Baidu-I2R Research Centre
Network-based Speech to Speech Translation Service
Localization World, Singapore, 10-12 Apr 2013 37 Localization World, Singapore, 10-12 Apr 2013
MALAY ↔ ENGLISH
- No existing commercial Malay speech recognition.
- Small footprint – compact models, can run on small devices.
Malay-English S2S Mobile Translation
Usable in many contexts - Humanitarian and Disaster Relief Efforts - Tourist travel
Localization World, Singapore, 10-12 Apr 2013 38 Localization World, Singapore, 10-12 Apr 2013
Document Translation
Localization World, Singapore, 10-12 Apr 2013 39 Localization World, Singapore, 10-12 Apr 2013
Multilingual Chat & Messaging
Chat Server
Chat Client Chat Client
Translation Bot Normalization Bot
4
3 21
1. Chat message normalized by normalization bot.2. Chat message sent to chat server.3. Chat message sent to the recipient.4. Chat message translated by the translation bot.
Default Dictionary
User defined
dictionary
Web Service Server
User defined
dictionaryUser
defined dictionary
User defined
dictionary
Localization World, Singapore, 10-12 Apr 2013 40