Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language

In this paper, we propose a new method for domain adaptation in Statistical Machine

Translation for low-resource domains in English-Vietnamese language. Specifically, our method

only uses monolingual data to adapt the translation phrase-table, our system brings improvements

over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i)

classify phrases on the target side of the translation phrase-table use the probability classifier

model, and (ii) adapt to the phrase-table translation by recomputing the direct translation

probability of phrases.

Our experiments are conducted with translation direction from English to Vietnamese on two very

different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The

English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the

experimental results showed that our method significantly outperformed the baseline system. Our

system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores

over the baseline system,

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 1

Trang 1

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 2

Trang 2

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 3

Trang 3

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 4

Trang 4

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 5

Trang 5

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 6

Trang 6

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 7

Trang 7

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 8

Trang 8

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 9

Trang 9

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 11 trang viethung 8780
Bạn đang xem 10 trang mẫu của tài liệu "Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language

Adaptation in Statistical Machine Translation for Low-Resource Domains in English-Vietnamese Language
VNU Journal of Science: Comp. Science & Com. Eng, Vol. 36, No. 1 (2020) 46-56 
46 
Original Article 
Adaptation in Statistical Machine Translation 
for Low-resource Domains in English-Vietnamese Language 
Nghia-Luan Pham1,2,*, Van-Vinh Nguyen2 
1Hai Phong University, 171 Phan Dang Luu, Kien An, Haiphong, Vietnam 
2Faculty of Information Technology, VNU University of Engineering and Technology, 
Vietnam National University, Hanoi, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam 
Received 09 April 2019 
Revised 19 May 2019; Accepted 13 December 2019 
Abstract: In this paper, we propose a new method for domain adaptation in Statistical Machine 
Translation for low-resource domains in English-Vietnamese language. Specifically, our method 
only uses monolingual data to adapt the translation phrase-table, our system brings improvements 
over the SMT baseline system. We propose two steps to improve the quality of SMT system: (i) 
classify phrases on the target side of the translation phrase-table use the probability classifier 
model, and (ii) adapt to the phrase-table translation by recomputing the direct translation 
probability of phrases. 
Our experiments are conducted with translation direction from English to Vietnamese on two very 
different domains that are legal domain (out-of-domain) and general domain (in-of-domain). The 
English-Vietnamese parallel corpus is provided by the IWSLT 2015 organizers and the 
experimental results showed that our method significantly outperformed the baseline system. Our 
system improved on the quality of machine translation in the legal domain up to 0.9 BLEU scores 
over the baseline system, 
Keywords: Machine Translation, Statistical Machine Translation, Domain Adaptation. 
1. Introduction * 
Statistical Machine Translation (SMT) 
systems [1] are usually trained on large 
amounts of bilingual data and monolingual 
target language data. In general, these corpora 
_______ 
* Corresponding author. 
 E-mail address: luanpn@dhhp.edu.vn 
 https://doi.org/10.25073/2588-1086/vnucsce.231 
may include quite heterogeneous topics and 
these topics usually define a set of 
terminological lexicons. Terminologies need to 
be translated taking into account the semantic 
context in which they appear. 
The Neural Machine Translation (NMT) 
approach [2] has recently been proposed for 
machine translation. However, the NMT 
method requires a large amount of parallel data 
and it has some characteristics such as NMT 
N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 
47 
system is too computationally costly and 
resource, the NMT system also requires much 
more training time than SMT system [3]. 
Therefore, SMT systems are still being studied for 
specific domains in low-resource language pairs. 
Monolingual data are usually available in 
large amounts, parallel data are low-resource 
for most language pairs. Collecting sufficiently 
large-high-quality parallel data is hard, especially 
on domain-specific data. For this reason, most 
languages in the world are low-resource for 
statistical machine translation, including the 
English-Vietnamese language pair. 
When SMT system is trained on the small 
amount of specific domain data leading to 
narrow lexical coverage which again results in 
low translation quality. On the other hand, the 
SMT systems are trained, tuned on specific-
domain data will perform well on the 
corresponding domains, but performance 
deteriorates for out-of-domain sentences [4]. 
Therefore, SMT systems often suffer from 
domain adaptation problems in practical 
applications. When the test data and the training 
data come from the same domains, the SMT 
systems can achieve good quality. Otherwise, 
the translation quality degrades dramatically. 
Therefore, domain adaptation is of significant 
importance to developing translation systems 
which can be effectively transferred from one 
domain to another. 
In recent years, the domain adaptation 
problem in SMT becomes more important [5] 
and is an active field of research in SMT with 
more and more techniques being proposed and 
applied into practice [5-12]. The common 
techniques used to adapt two main components 
of contemporary state-of-the-art SMT systems: 
The language model and the translation model. 
In addition, there are also some proposals for 
adapting the Neural Machine Translation 
(NMT) system to a new domain [13, 14]. 
Although the NMT system has begun to be 
studied more, domain adaptation for the SMT 
system still plays an important role, especially 
for low-resource languages. 
This paper presents a new method to adapt 
the translation phrase-table of the SMT system. 
Our experiments were conducted for the 
English-Vietnamese language pair in the 
direction from English to Vietnamese. We use 
specific domain corpus comprise of two 
specific domains: Legal and General. The data 
has been collected from documents, dictionaries 
and the IWSLT 2015 organisers for the 
English-Vietnamese translation task. 
In our works, we train a translation model 
with parallel corpus in general domain, then we 
train a probability classifier model with 
monolingual corpus in legal domain, we use the 
classification probability of phrase on target 
side of phrase translation table to recompute the 
direct translation probability of the phrase 
translation table. This is the first adaptation 
method for the phrase translation table of the 
SMT system, especially for low-resource 
language pairs as English-Vietnamese language 
pair. For comparison, we train a baseline SMT 
system and a Neural Machine Translation 
system (NMT) to compare with our method. 
Experimental results showed that our method 
significantly outperforms the baseline system. 
Our system improved the translation quality of 
the machine translation system on the out-of-
domain d ... 9 
Additionally, we use 500 parallel sentences 
on legal domain in English-Vietnamese pair for 
test set. 
In-of-domain data: We use the parallel 
corpora sets on general domain to training SMT 
system. These data sets are provided by the 
IWSLT 2015 organisers for the English-
Vietnamese translation task, consists of 122132 
parallel sentences for the training set, 745 
parallel sentences for development set and 1046 
parallel sentences for the test set. 
Preprocessing: Data preprocessing is very 
important in any data-driven method. We 
carried out preprocessing in two steps: 
• Cleaning Data: We performed cleaning in 
two phases, phase-1: Following the cleaning 
process described in [18] and phase-2: Using 
the corpus cleaning scripts in Moses toolkit [19] 
with minimum and maximum number of tokens 
set to 1 and 80 respectively. 
• Word Segmentation: In English, 
whitespaces are used to separate words [20] but 
Vietnamese does not have morphology [21] and 
[20]. In Vietnamese, whitespaces are not used 
to separate words. The smallest meaningful part 
of Vietnamese orthography is a syllable [22]. 
Some examples of Vietnamese words are 
shown as follows: single words "nhà" - house, 
"nhặt" - pick up, "mua" - buy and "bán" - sell. 
Compound words: “mua-bán” - buy and sell, 
“bàn-ghế”- table and chair, “cây-cối” - trees, 
“đường-xá” - street, “hành-chính” - 
N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 
53 
administration. Thus, a word in Vietnamese 
may consist of several syllables separated by 
whitespaces. 
We used vntokenizer toolkit [23] to 
segment for Vietnamese data sets, this 
segmentation toolkit is quite popular for 
Vietnamese and we used tokenizer script in 
Moses to segment for English data sets. 
4.2. Experiments 
We performed experiments on the Baseline-
SMT and Adaptaion-SMT systems: 
• The Baseline-SMT is a SMT baseline 
system. This system is the phrase-based 
statistical machine translation with standard 
settings in the Moses toolkit2 [24], this is a 
state-of-the-art open-source phrase-based SMT 
system. In our systems, the weights of feature 
functions were optimized using MERT [25]. 
The Baseline-SMT is trained on the general 
domain (in-of-domain) data set and the 
Baseline-SMT system is evaluated sequentially 
on the General-test and Legal-test data sets. 
o 
Table 3. Some examples in our experiments.
• The Adaptation-SMT is based on the 
Baseline-SMT system after being adapted to the 
translation model by recomputing the direct 
translation probability  (e|f) of phrases in the 
phrase translation table, the Adaptaion-SMT is 
evaluated on the Legal-test data set2. 
We train a language model with 4-gram and 
Kneser-Ney smoothing was used in all the 
experiments. We used SRILM3 [26] as the 
_______ 
2  
3  
language model toolkit. For evaluate translation 
quality of the Baseline-SMT system and 
Adaptaion-SMT system, we use the BLEU 
score [27]. 
For comparison, we also built a Neural 
Machine Translation (NMT) system use 
OpenNMT toolkit4 [28], the NMT system is 
trained with the default model, which consists 
of a 2-layer LSTM with 500 hidden units on 
both the encoder/decoder. 
_______ 
4  
N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 
54 
4.2.1. Results 
Table 2. The experiment results of the Baseline-
SMT system and Adaptaion-SMT system 
SYSTEM BLEU SCORE 
Baseline-SMT (General-test) 31.3 
Baseline-SMT (on Legal-test) 28.8 
Adaptaion-SMT (on Legal-test) 29.7 
Baseline-NMT (on General-test) 30.1 
Baseline_NMT (on Legal-test) 20.9 
The Table 2 showed that the baseline 
systems (the SMT and NMT system) are trained 
on the general domain data set, if the test data 
set (here is the General-test data set) is in the 
same domain as the training data, the BLEU 
score will be 31.3 for the Baseline-SMT system 
and 30.1 for the Baseline-NMT system. If the 
test data set is on the legal domain (here is the 
Legal-test data set), the BLEU score will be 
28.8 for the Baseline-SMT system and 20.9 for 
Baseline-NMT system. 
The Table 2 also showed that the SMT 
system is trained on the general domain if the 
test domain is different from the training 
domain, the quality of the translation quality is 
reduced. In these experiments, the BLEU score 
was reduced by 2.5 points from 31.3 to 28.8. 
The Adaptaion-SMT system is adapted by our 
technique will improve the quality of the 
translation system. In these experiments, the 
BLEU score is improved to 0.9 points from 
28.8 up to 29.7. 
The experiment results also showed that the 
SMT system has better results than the NMT 
system when translation systems are trained 
with the same low-resource domains of 
English-Vietnamese language pair such as legal 
domain and some other domains. 
4.2.2. Analysis and discussion 
ư 
Figure 6. Examples about the direct translation probability of this phrase in phrase-table.
Some examples in the Table 3, when 
systems translate source sentences in legal 
domain from english to vietnamese language. In 
the third sentence, the phrase “renewable” in 
context “renewable certificates valid for five 
years were granted by the construction 
departments of cities and provinces” (source 
sentence column) should been translated into 
“gia-hạn” as reference sentence but the 
Baseline-SMT system has translated the phrase 
“renewable” into “tái-tạo”, the NMT system 
has translated that phrase into “tái-tạo”, the 
Google Translate has translated that phrase into 
“tái tạo” and the Adaptaion-SMT system has 
translated the phrase "renewable" into “gia-
hạn” like reference sentence. 
The first, the Baseline-SMT system has 
translated the phrase “renewable” into “tái-
tạo” because the direct translation probability 
(4th column in Figure 6) of this phrase in 
phrase-table of Baseline-SMT system is highest 
(0.454545), and the direct translation 
probability into “gia-hạn” is lower 
(0.0909091). Therefore, when the SMT system 
combines component models as formulas 1, the 
ability to translate into “tái-tạo” will be higher 
“gia-hạn”. 
Later, apply the phrase classification model 
to compute the probability of “gia-hạn” and 
"renewable" phrase in legal domain, the 
probability of “gia-hạn” is higher than that, 
then update this value to phrase-table and the 
direct translation probabilitys  (e|f) of phrase 
are recomputed. Therefore, the Adaptation-
SMT has translated “renewable” phrase into 
“gia-hạn” 
N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 
55 
Some other examples in the Table 4.2 
showed that translation quality of Adaptaion-
SMT system is better than the Baseline-SMT 
system and with low-resource translation 
domains in English-Vietnamese language, the 
SMT system has more advantages than the 
NMT system. 
5 Conclusions and future works 
In this paper, we presented a new method 
for domain adaptation in Statistical Machine 
Translation for low-resource domains in 
English-Vietnamese language pair. Our method 
only uses monolingual out-of-domain data to 
adapt the phrase-table by recomputing the 
phrase’s direct translation probability  (e|f). 
Our system obtained an improved on the quality 
of machine translation in the legal domain up to 
0.9 BLEU points over baseline. Experimental 
results show that our method is effective in 
improving the accuracy of the translation. 
In future, we intend to study this problem 
with other domains, the benefits of word 
embedding in phrase classification and integrate 
automatically our technique to decode of 
SMT system. 
References 
[1] Philipp Koehn, Franz Josef Och, Daniel Marcu, 
Statistical phrase-based translation, In 
Proceedings of HLT-NAACL, Edmonton, 
Canada, 2003, pp. 127-133. 
[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc 
V. Le, Mohammad Norouzi, Wolfgang Macherey, 
Maxim Krikun, Yuan Cao, Qin Gao, Klaus 
Macherey, Jeff Klingner, Apurva Shah, Melvin 
Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan 
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto 
Kazawa, Keith Stevens, George Kurian, Nishant 
Patil, Wei Wang, Cliff Young, Jason Smith, Jason 
Riesa, Alex Rudnick, Oriol Vinyals, Greg 
Corrado, Macduff Hughes and Jeffrey Dean, 
Google’s neural machine translation system: 
Bridging the gap between human and machine 
translation, CoRR, abs/1609.08144, 2016. 
[3] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo 
and Marcello Federico, Neural versus phrase-based 
machine translation quality: A case study, 2016. 
[4] Barry Haddow, Philipp Koehn, Analysing the effect 
of out-of-domain data on smt systems, In 
Proceedings of the Seventh Workshop on Statistical 
Machine Translation, 2012, pp. 422-432. 
[5] Boxing Chen, Roland Kuhn and George Foster, 
Vector space model for adaptation in statistical 
machine translation, Proceedings of the 51st 
Annual Meeting of the Association for 
Computational Linguistics, 2013, pp. 1285-1293. 
[6] Daniel Dahlmeier, Hwee Tou Ng, Siew Mei Wu4, 
Building a large annotated corpus of learner 
english: The nus corpus of learner english, In 
Proceedings of the NAACL Workshop on 
Innovative Use of NLP for Building Educational 
Appli-cations, 2013. 
[7] Eva Hasler, Phil Blunsom, Philipp Koehn and 
Barry Haddow, Dynamic topic adaptation for 
phrase-based mt, In Proceedings of the 14th 
Conference of the European Chapter of The 
Association for Computational Linguistics, 2014, 
pp. 328-337. 
[8] George Foster, Roland Kuhn, Mixture-model 
adaptation for smt, Proceedings of the Second 
Workshop on Statistical Machine Translation, 
Prague, Association for Computational 
Linguistics, 2007, pp. 128-135. 
[9] George Foster, Boxing Chen, Roland Kuhn, 
Simulating discriminative training for linear 
mixture adaptation in statistical machine 
translation, Proceedings of the MT Summit, 2013. 
[10] Hoang Cuong, Khalil Sima’an, and Ivan Titov, 
Adapting to all domains at once: Rewarding 
domain invariance in smt, Proceedings of the 
Transactions of the Association for Computational 
Linguistics (TACL), 2016. 
[11] Ryo Masumura, Taichi Asam, Takanobu Oba, 
Hirokazu Masataki, Sumitaka Sakauchi, and 
Akinori Ito, Hierarchical latent words language 
models for robust modeling to out-of domain 
tasks, Proceedings of the 2015 Conference on 
Empirical Methods in Natural Language 
Processing, 2015, pp. 1896-1901 
[12] Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 
An empirical comparison of simple domain 
adaptation methods for neural machine 
translation, 2017. 
[13] Markus Freitag, Yaser Al-Onaizan, Fast domain 
adaptation for neural machine translation, 2016. 
[14] Jia Xu, Yonggang Deng, Yuqing Gao and 
Hermann Ney, Domain dependent statistical 
N-L. Pham, V-V. Nguyen / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 36, No. 1 (2020) 46-56 
56 
machine translation, In Proceedings of the MT 
Summit XI, 2007, pp. 515-520. 
[15] Hua Wu, Haifeng Wang Chengqing Zong, 
Domain adaptation for statistical machine 
translation with domain dictionary and 
monolingual corpora, In Proceedings of the 22nd 
International Conference on Computational 
Linguistics (Coling 2008), Manchester, UK, 2008, 
pp. 993-1000. 
[16] Adam Berger, Stephen Della Pietra, and Vincent 
Della Pietra, A maximum entropy approach to 
natural language processing, Computational 
Linguistics, 22, 1996. 
[17] 18Santanu Pal, Sudip Naskar, Josef Van 
Genabith, Uds-sant, English-German hybrid 
machine translation system, In Proceedings of the 
Tenth Workshop on Statistical Machine 
Translation, Lisbon, Portugal, September, 
Association for Computational Linguistics, 2015, 
pp. 152-157. 
[18] Louis Onrust, Antal van den Bosch, Hugo Van 
hamme, Improving cross-domain n-gram 
language modelling with skipgrams, Proceedings 
of the 54th Annual Meeting of the Association for 
Computational Linguistics, Berlin, Germany, 
2016, pp. 137-142. 
[19] Mark Aronoff, Kirsten Fudeman, What is 
morphology, Vol. 8. john wiley and sons, 2011. 
[20] Laurence C. Thompson, The problem of the word 
in vietnamese, In journal of the International 
Linguistic Association 19(1) (1963) 39-52. https:// 
doi.org/10.1080/00437956.1963.11659787. 
[21] Binh N. Ngo, The Vietnamese language learning 
framework, Journal of Southeast Asian Language 
Teaching 10 (2001) 1-24. 
[22] Le Hong Phuong, Nguyen Thi Minh Huyen, Azim 
Roussanaly, Ho Tuong Vinh, A hybrid approach 
to word segmentation of vietnamese texts, 2008. 
[23] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris 
Callison-Burch, Marcello Federico, Nicola Bertoldi, 
Brooke Cowan, Wade Shen, Christine Moran, 
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra 
Constantin, Evan Herbst, Moses: Open source toolkit 
for statistical machine translation, In ACL-2007: 
Proceedings of demo and poster sessions, Prague, 
Czech Republic, 2007, pp.177-180. 
[24] Franz Josef Och, Minimum error rate training in 
statistical machine translation, In Proceedings of 
ACL, 2003, pp.160-167. 
[25] Andreas Stolcke, Srilm - an extensible language 
modeling toolkit, in proceedings of international 
conference on spoken language processing, 2002. 
[26] Papineni, Kishore, Salim Roukos, Todd Ward, 
WeiJing Zhu, Bleu: A method for automatic 
evaluation of machine translation, ACL, 2002. 
[27] G. Klein, Y. Kim, Y. Deng, J. Senellart, A.M. 
Rush, OpenNMT: Open-Source Toolkit for 
Neural Machine Translation. ArXiv e-prints. 
[28] Pratyush Banerjee, Jinhua Du, Baoli Li, Sudip Kr. 
Naskar, Andy Way and Josef van Genabith, 
Combining multi-domain statistical machine 
translation models using automatic classifiers, In 
Proceedings of AMTA 2010., 2010. 
S 
lk 

File đính kèm:

  • pdfadaptation_in_statistical_machine_translation_for_low_resour.pdf