AUTOMATIC DETECTION OF UNDERRESOURCED LANGUAGES. Dialectal Arabic Short Texts

University essay from Göteborgs universitet/Institutionen för filosofi, lingvistikoch vetenskapsteori

Author: Wafia Adouane; [2016-11-15]

Keywords: ;

Abstract: Automatic Language Identification (ALI) is the first necessary step to do any language-dependent natural language processing task. It is the identification of the natural language of the input content by a machine. Being a well-established task in computational linguistics since early 1960's, various methods have been successfully applied to a wide range of languages. The state-of-the-art automatic language identifiers are based on character n-gram models trained on huge corpora. However, there are many natural languages which are not yet automatically processed. For instance, minority languages or informal forms of standard languages (general purpose languages used only in media/administration and taught at schools). Some of these languages are only spoken and do not exist in a written format.The use of social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. These new written languages are under resourced, hence the current ALI tools fail to properly recognize them. In this study, we revisit the problem of ALI with the focus on discriminating under-resourced similar languages. We deal with the case of dialectal Arabic (informal Arabic varieties) used in social media, and we consider each Arabic dialect/variety as a stand-alone language. Our main purpose is toinvestigate the performance of the ALI standard methods, namely machine learning and dictionary based methods, on distinguishing Arabic varieties. Given the fact that discriminating between Arabicvarieties is a nontrivial linguistic task because of the absence of any clear-cut borderlines between the variants, we can conclude that machine learning models are well suited for Arabic dialects identification. Support vector machines, namely the LinearSVC method combining the character based 5-6-grams with dialectal vocabulary as features, outperforms all the other methods. The dictionary-based method suffers mainly from the shortage in the vocabulary coverage.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)