IBM Model 4 Alignment Comparison : An evaluation of how the size of training data affects the interpretation accuracy and training time for two alignment models that translates natural language
Abstract: In modern society the amount of information processed by computers is increasing everyday. Computer translation has the potential to speed up communication between humans as well as human-computer interactions. For Statistical Machine Translation word alignment is key. How large does a corpus need to be to align a natural language sentence with a simple unambiguous language? We investigate this matter by running a simple algorithm and comparing it to the results we get from an industry equivalent. The results show that the size of the corpus needs to be larger for the simplified model when there is a greater number of words per sentence. The IBM Model 4 conversely shows that the more words per sentence decrease the necessary size of the corpus to make better predictions.Thus we can conclude that corpus size is dependant on the number of terms in each sentence for both models.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)