Duplicate Detection and Text Classification on Simplified Technical English

University essay from Linköpings universitet/Institutionen för datavetenskap

Author: Max Lund; [2019]

Keywords: NLP; CNL; transformer models; LSTM; BERT; document embeddings; word embeddings; text classification; text clustering; transfer learning; machine learning;

Abstract: This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)

Duplicate Detection and Text Classification on Simplified Technical English

Searchphrases right now

Popular searches

popular essays yesterday (2024-04-19)