Detection of insurance fraud using NLP and ML

University essay from Lunds universitet/Matematisk statistik

Abstract: Machine-Learning can sometimes see things we as humans can not. In this thesis we evaluated three different Natural Language Procces-techniques: BERT, word2vec and linguistic analysis (UDPipe), on their performance in detecting insurance fraud based on transcribed audio from phone calls (referred to as audio data) and written text (referred to as text-form data), related to insurance claims. We also included TF-IDF as a naive model. On all models we applied logistic regression on the extracted word embeddings. On word2vec and the linguistic analysis, we also applied a KNN-classifier on the word embeddings. For BERT, we instead opted to apply an LSTM-network on the merged CLS-token embeddings, due to the sequential nature of BERT’s architecture. For the audio data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 58.2% ±2.6%, BERT 56.0% ±2.6%, word2vec 54.1% ±3.8% and linguistic analysis 53.6% ±3.0%. For the text-form data, all models achieved a Macro F1-score higher than 50% on a 95%-confidence interval, with at least one type of classifier. TF-IDF scored 56.0% ±2.3%, BERT 57.4% ±0.9%, word2vec 56.0 ±2.1% and linguistic analysis 51.4% ±0.5%. Each score reported is from using the best performing classifier for that model. The above findings show that our models manage to learn something from the data, but due to rather small data sets and insurance cases from many different areas, it is quite difficult to draw any conclusions with high confidence. The results are not that much better than "guessing", and the small gain over 50% could be due to something else, such as bias in the data sets. We feel that there is potential to use these techniques in a real setting, but the topic seems to need further exploration. We especially feel that there is potential in using transformer-based models, such as BERT, but currently it lacks the ability to analyse longer sequences due to computational limitations. With the current development pace of transformer models, it might be possible to use these in the future to get a better representation of what is being said, which hopefully would produce better results.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)