Predicting the area of industry : Using machine learning to classify SNI codes based on business descriptions, a degree project at SCB

University essay from Umeå universitet/Statistik

Abstract: This study is a part of an experimental project at Statistics Sweden,which aims to, with the use of natural language processing and machine learning, predict Swedish businesses’ area of industry codes, based on their business descriptions. The response to predict consists of the most frequent 30 out of 88 main groups of Swedish standard industrial classification (SNI) codes that each represent a unique area of industry. The transformation from business description text to numerical features was done through the bag-of-words model. SNI codes are set when companies are founded, and due to the human factor, errors can occur. Using data from the Swedish Companies Registration Office, the purpose is to determine if the method of gradient boosting can provide high enough classification accuracy to automatically set the correct SNI codes that differ from the actual response. Today these corrections are made manually. The best gradient boosting model was able to correctly classify 52 percent of the observations, which is not considered high enough to implement automatic code correction into a production environment.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)