Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language
In this paper, we propose a new method for domain adaptation in Statistical Machine
Translation for low-resource domains in English-Vietnamese language. Specifically, our method
only uses monolingual data to adapt the translation phrase-table, our system brings improvements
over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i)
classify phrases on the target side of the translation phrase-table use the probability classifier
model, and (ii) adapt to the phrase-table translation by recomputing the direct translation
probability of phrases.
Our experiments are conducted with translation direction from English to Vietnamese on two very
different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The
English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the
experimental results showed that our method significantly outperformed the baseline system. Our
system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores
over the baseline system,
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language
VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 1 (2020) 46-56 46 Original Article Adaptation in Statistical Machine Translation for Low-resource Domains in English-Vietnamese Language Nghia-Luan Pham1,2,*, Van-Vinh Nguyen2 1Hai Phong University, 171 Phan Dang Luu, Kien An, Haiphong, Vietnam 2Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam Received 09 April 2019 Revised 19 May 2019; Accepted 13 December 2019 Abstract: In this paper, we propose a new method for domain adaptation in Statistical Machine Translation for low-resource domains in English-Vietnamese language. Specifically, our method only uses monolingual data to adapt the translation phrase-table, our system brings improvements over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i) classify phrases on the target side of the translation phrase-table use the probability classifier model, and (ii) adapt to the phrase-table translation by recomputing the direct translation probability of phrases. Our experiments are conducted with translation direction from English to Vietnamese on two very different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the experimental results showed that our method significantly outperformed the baseline system. Our system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores over the baseline system, Keywords: Machine Translation, Statistical Machine Translation, Domain Adaptation. 1. Introduction * Statistical Machine Translation (SMT) systems [1] are usually trained on large amounts of bilingual data and monolingual target language data. In general, these corpora _______ * Corresponding author. E-mail address: luanpn@dhhp.edu.vn https://doi.org/10.25073/2588-1086/vnucsce.231 may include quite heterogeneous topics and these topics usually define a set of terminological lexicons. Terminologies need to be translated taking into account the semantic context in which they appear. The Neural Machine Translation (NMT) approach [2] has recently been proposed for machine translation. However, the NMT method requires a large amount of parallel data and it has some characteristics such as NMT N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 47 system is too computationally costly and resource, the NMT system also requires much more training time than SMT system [3]. Therefore, SMT systems are still being studied for specific domains in low-resource language pairs. Monolingual data are usually available in large amounts, parallel data are low-resource for most language pairs. Collecting sufficiently large-high-quality parallel data is hard, especially on domain-specific data. For this reason, most languages in the world are low-resource for statistical machine translation, including the English-Vietnamese language pair. When SMT system is trained on the small amount of specific domain data leading to narrow lexical coverage which again results in low translation quality. On the other hand, the SMT systems are trained, tuned on specific- domain data will perform well on the corresponding domains, but performance deteriorates for out-of-domain sentences [4]. Therefore, SMT systems often suffer from domain adaptation problems in practical applications. When the test data and the training data come from the same domains, the SMT systems can achieve good quality. Otherwise, the translation quality degrades dramatically. Therefore, domain adaptation is of significant importance to developing translation systems which can be effectively transferred from one domain to another. In recent years, the domain adaptation problem in SMT becomes more important [5] and is an active field of research in SMT with more and more techniques being proposed and applied into practice [5-12]. The common techniques used to adapt two main components of contemporary state-of-the-art SMT systems: The language model and the translation model. In addition, there are also some proposals for adapting the Neural Machine Translation (NMT) system to a new domain [13, 14]. Although the NMT system has begun to be studied more, domain adaptation for the SMT system still plays an important role, especially for low-resource languages. This paper presents a new method to adapt the translation phrase-table of the SMT system. Our experiments were conducted for the English-Vietnamese language pair in the direction from English to Vietnamese. We use specific domain corpus comprise of two specific domains: Legal and General. The data has been collected from documents, dictionaries and the IWSLT 2015 organisers for the English-Vietnamese translation task. In our works, we train a translation model with parallel corpus in general domain, then we train a probability classifier model with monolingual corpus in legal domain, we use the classification probability of phrase on target side of phrase translation table to recompute the direct translation probability of the phrase translation table. This is the first adaptation method for the phrase translation table of the SMT system, especially for low-resource language pairs as English-Vietnamese language pair. For comparison, we train a baseline SMT system and a Neural Machine Translation system (NMT) to compare with our method. Experimental results showed that our method significantly outperforms the baseline system. Our system improved the translation quality of the machine translation system on the out-of- domain d ... 9 Additionally, we use 500 parallel sentences on legal domain in English-Vietnamese pair for test set. In-of-domain data: We use the parallel corpora sets on general domain to training SMT system. These data sets are provided by the IWSLT 2015 organisers for the English- Vietnamese translation task, consists of 122132 parallel sentences for the training set, 745 parallel sentences for development set and 1046 parallel sentences for the test set. Preprocessing: Data preprocessing is very important in any data-driven method. We carried out preprocessing in two steps: • Cleaning Data: We performed cleaning in two phases, phase-1: Following the cleaning process described in [18] and phase-2: Using the corpus cleaning scripts in Moses toolkit [19] with minimum and maximum number of tokens set to 1 and 80 respectively. • Word Segmentation: In English, whitespaces are used to separate words [20] but Vietnamese does not have morphology [21] and [20]. In Vietnamese, whitespaces are not used to separate words. The smallest meaningful part of Vietnamese orthography is a syllable [22]. Some examples of Vietnamese words are shown as follows: single words "nhà" - house, "nhặt" - pick up, "mua" - buy and "bán" - sell. Compound words: “mua-bán” - buy and sell, “bàn-ghế”- table and chair, “cây-cối” - trees, “đường-xá” - street, “hành-chính” - N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 53 administration. Thus, a word in Vietnamese may consist of several syllables separated by whitespaces. We used vntokenizer toolkit [23] to segment for Vietnamese data sets, this segmentation toolkit is quite popular for Vietnamese and we used tokenizer script in Moses to segment for English data sets. 4.2. Experiments We performed experiments on the Baseline- SMT and Adaptaion-SMT systems: • The Baseline-SMT is a SMT baseline system. This system is the phrase-based statistical machine translation with standard settings in the Moses toolkit2 [24], this is a state-of-the-art open-source phrase-based SMT system. In our systems, the weights of feature functions were optimized using MERT [25]. The Baseline-SMT is trained on the general domain (in-of-domain) data set and the Baseline-SMT system is evaluated sequentially on the General-test and Legal-test data sets. o Table 3. Some examples in our experiments. • The Adaptation-SMT is based on the Baseline-SMT system after being adapted to the translation model by recomputing the direct translation probability (e|f) of phrases in the phrase translation table, the Adaptaion-SMT is evaluated on the Legal-test data set2. We train a language model with 4-gram and Kneser-Ney smoothing was used in all the experiments. We used SRILM3 [26] as the _______ 2 3 language model toolkit. For evaluate translation quality of the Baseline-SMT system and Adaptaion-SMT system, we use the BLEU score [27]. For comparison, we also built a Neural Machine Translation (NMT) system use OpenNMT toolkit4 [28], the NMT system is trained with the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. _______ 4 N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 54 4.2.1. Results Table 2. The experiment results of the Baseline- SMT system and Adaptaion-SMT system SYSTEM BLEU SCORE Baseline-SMT (General-test) 31.3 Baseline-SMT (on Legal-test) 28.8 Adaptaion-SMT (on Legal-test) 29.7 Baseline-NMT (on General-test) 30.1 Baseline_NMT (on Legal-test) 20.9 The Table 2 showed that the baseline systems (the SMT and NMT system) are trained on the general domain data set, if the test data set (here is the General-test data set) is in the same domain as the training data, the BLEU score will be 31.3 for the Baseline-SMT system and 30.1 for the Baseline-NMT system. If the test data set is on the legal domain (here is the Legal-test data set), the BLEU score will be 28.8 for the Baseline-SMT system and 20.9 for Baseline-NMT system. The Table 2 also showed that the SMT system is trained on the general domain if the test domain is different from the training domain, the quality of the translation quality is reduced. In these experiments, the BLEU score was reduced by 2.5 points from 31.3 to 28.8. The Adaptaion-SMT system is adapted by our technique will improve the quality of the translation system. In these experiments, the BLEU score is improved to 0.9 points from 28.8 up to 29.7. The experiment results also showed that the SMT system has better results than the NMT system when translation systems are trained with the same low-resource domains of English-Vietnamese language pair such as legal domain and some other domains. 4.2.2. Analysis and discussion ư Figure 6. Examples about the direct translation probability of this phrase in phrase-table. Some examples in the Table 3, when systems translate source sentences in legal domain from english to vietnamese language. In the third sentence, the phrase “renewable” in context “renewable certificates valid for five years were granted by the construction departments of cities and provinces” (source sentence column) should been translated into “gia-hạn” as reference sentence but the Baseline-SMT system has translated the phrase “renewable” into “tái-tạo”, the NMT system has translated that phrase into “tái-tạo”, the Google Translate has translated that phrase into “tái tạo” and the Adaptaion-SMT system has translated the phrase "renewable" into “gia- hạn” like reference sentence. The first, the Baseline-SMT system has translated the phrase “renewable” into “tái- tạo” because the direct translation probability (4th column in Figure 6) of this phrase in phrase-table of Baseline-SMT system is highest (0.454545), and the direct translation probability into “gia-hạn” is lower (0.0909091). Therefore, when the SMT system combines component models as formulas 1, the ability to translate into “tái-tạo” will be higher “gia-hạn”. Later, apply the phrase classification model to compute the probability of “gia-hạn” and "renewable" phrase in legal domain, the probability of “gia-hạn” is higher than that, then update this value to phrase-table and the direct translation probabilitys (e|f) of phrase are recomputed. Therefore, the Adaptation- SMT has translated “renewable” phrase into “gia-hạn” N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 55 Some other examples in the Table 4.2 showed that translation quality of Adaptaion- SMT system is better than the Baseline-SMT system and with low-resource translation domains in English-Vietnamese language, the SMT system has more advantages than the NMT system. 5 Conclusions and future works In this paper, we presented a new method for domain adaptation in Statistical Machine Translation for low-resource domains in English-Vietnamese language pair. Our method only uses monolingual out-of-domain data to adapt the phrase-table by recomputing the phrase’s direct translation probability (e|f). Our system obtained an improved on the quality of machine translation in the legal domain up to 0.9 BLEU points over baseline. Experimental results show that our method is effective in improving the accuracy of the translation. In future, we intend to study this problem with other domains, the benefits of word embedding in phrase classification and integrate automatically our technique to decode of SMT system. References [1] Philipp Koehn, Franz Josef Och, Daniel Marcu, Statistical phrase-based translation, In Proceedings of HLT-NAACL, Edmonton, Canada, 2003, pp. 127-133. [2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes and Jeffrey Dean, Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR, abs/1609.08144, 2016. [3] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo and Marcello Federico, Neural versus phrase-based machine translation quality: A case study, 2016. [4] Barry Haddow, Philipp Koehn, Analysing the effect of out-of-domain data on smt systems, In Proceedings of the Seventh Workshop on Statistical Machine Translation, 2012, pp. 422-432. [5] Boxing Chen, Roland Kuhn and George Foster, Vector space model for adaptation in statistical machine translation, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013, pp. 1285-1293. [6] Daniel Dahlmeier, Hwee Tou Ng, Siew Mei Wu4, Building a large annotated corpus of learner english: The nus corpus of learner english, In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Appli-cations, 2013. [7] Eva Hasler, Phil Blunsom, Philipp Koehn and Barry Haddow, Dynamic topic adaptation for phrase-based mt, In Proceedings of the 14th Conference of the European Chapter of The Association for Computational Linguistics, 2014, pp. 328-337. [8] George Foster, Roland Kuhn, Mixture-model adaptation for smt, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Association for Computational Linguistics, 2007, pp. 128-135. [9] George Foster, Boxing Chen, Roland Kuhn, Simulating discriminative training for linear mixture adaptation in statistical machine translation, Proceedings of the MT Summit, 2013. [10] Hoang Cuong, Khalil Sima’an, and Ivan Titov, Adapting to all domains at once: Rewarding domain invariance in smt, Proceedings of the Transactions of the Association for Computational Linguistics (TACL), 2016. [11] Ryo Masumura, Taichi Asam, Takanobu Oba, Hirokazu Masataki, Sumitaka Sakauchi, and Akinori Ito, Hierarchical latent words language models for robust modeling to out-of domain tasks, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1896-1901 [12] Chenhui Chu, Raj Dabre, and Sadao Kurohashi. An empirical comparison of simple domain adaptation methods for neural machine translation, 2017. [13] Markus Freitag, Yaser Al-Onaizan, Fast domain adaptation for neural machine translation, 2016. [14] Jia Xu, Yonggang Deng, Yuqing Gao and Hermann Ney, Domain dependent statistical N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 56 machine translation, In Proceedings of the MT Summit XI, 2007, pp. 515-520. [15] Hua Wu, Haifeng Wang Chengqing Zong, Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora, In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 2008, pp. 993-1000. [16] Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, A maximum entropy approach to natural language processing, Computational Linguistics, 22, 1996. [17] 18Santanu Pal, Sudip Naskar, Josef Van Genabith, Uds-sant, English-German hybrid machine translation system, In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, September, Association for Computational Linguistics, 2015, pp. 152-157. [18] Louis Onrust, Antal van den Bosch, Hugo Van hamme, Improving cross-domain n-gram language modelling with skipgrams, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, pp. 137-142. [19] Mark Aronoff, Kirsten Fudeman, What is morphology, Vol. 8. john wiley and sons, 2011. [20] Laurence C. Thompson, The problem of the word in vietnamese, In journal of the International Linguistic Association 19(1) (1963) 39-52. https:// doi.org/10.1080/00437956.1963.11659787. [21] Binh N. Ngo, The Vietnamese language learning framework, Journal of Southeast Asian Language Teaching 10 (2001) 1-24. [22] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim Roussanaly, Ho Tuong Vinh, A hybrid approach to word segmentation of vietnamese texts, 2008. [23] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open source toolkit for statistical machine translation, In ACL-2007: Proceedings of demo and poster sessions, Prague, Czech Republic, 2007, pp.177-180. [24] Franz Josef Och, Minimum error rate training in statistical machine translation, In Proceedings of ACL, 2003, pp.160-167. [25] Andreas Stolcke, Srilm - an extensible language modeling toolkit, in proceedings of international conference on spoken language processing, 2002. [26] Papineni, Kishore, Salim Roukos, Todd Ward, WeiJing Zhu, Bleu: A method for automatic evaluation of machine translation, ACL, 2002. [27] G. Klein, Y. Kim, Y. Deng, J. Senellart, A.M. Rush, OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints. [28] Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip Kr. Naskar, Andy Way and Josef van Genabith, Combining multi-domain statistical machine translation models using automatic classifiers, In Proceedings of AMTA 2010., 2010. S lk
File đính kèm:
- adaptation_in_statistical_machine_translation_for_low_resour.pdf