A Novel Low Annotation-Cost Interactive Framework for Named Entity Recognition
Abstract: Named entity recognition (NER) is the process to sequence label an unstructured data to solve high ambiguity. It targets to identify all the named entities using predefined categories. The datasets used in domain-specific NER tasks require manual annotation. Unfortunately, the annotators are usually domain experts which can be extremely expensive. Recent studies have shown that using active learning combined with a machine learning algorithm can reduce the annotation effort. However, active learning queries experts for labels dozens of times during the training. The waiting time between the iterations for both annotators and data engineers makes the traditional active learning framework impractical. In this thesis project, a novel low annotation cost framework, two-step active learning is introduced to solve a real world NER task where the unlabeled domain-specific data is provided by Ericsson. The available annotating expert during the thesis work is an Ericsson employee. To evaluate the novel framework, another open labeled dataset is used. The NER task on Ericsson’s dataset is successfully solved and achieved 0.81 as the F1 score, where only 27.5% of the data was manually labeled. When evaluating the algorithm with the open dataset, the results have shown that the two-step active learning approach outperformed the traditional passive learning trained on the same amount of data (randomly selected). When training on 100% of the data using the passive learning approach, both active learning and two-step active learning reached similar performance with only 32% of the original data, where two-step active learning queries for labels only two times.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)