Segmentation of companies using DBSCAN and K-Means

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Jacob Dahlkvist; William Tomczak; [2022]

Keywords: ;

Abstract: Data management and machine learning have become an important tool for organizations around the world, to be able to provide a basis for further processing, for example. This work aims at helping a company with mapping of corporate industries with the help of keywords from companies’ websites. We will do this with machine learning. The essay will consistently explain how this model has been created by describing utilized algorithms, theories, methods and its performance. The work examines the clustering methods K-means and DBSCAN with the vectorization methods TF-IDF and Bag of Words. Evaluation is done using the Silhouette Coefficient (SC) and individual assessment. DBSCAN proves to be a better clustering method on this data set. However, there are problems with the amount of data, for example how distinct the differences are between the companies' keywords. This problem means that the clustering methods create too big uncertainties to allow for it to be used for commercial purposes. It is possible to use this tool for future implementations, but the amount of data must have more distinct differences.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)