Effectivisation of keywords extraction process : A supervised binary classification approach of scraped words from company websites

University essay from Umeå universitet/Institutionen för matematik och matematisk statistik

Abstract: In today’s digital era, establishing an online presence and maintaining a well-structured website is vitalfor companies to remain competitive in their respective markets. A crucial aspect of online success liesin strategically selecting the right words to optimize customer engagement and search engine visibility.However, this process is often time-consuming, involving extensive analysis of a company’s website aswell as its competitors’. This thesis focuses on developing an efficient binary classification approachto identify key words and phrases extracted from multiple company websites. The data set used forthis solution consists of approximately 92,000 scraped samples, primarily comprising non-key samples.Various features were extracted, and a word embedding model was employed to assess each sample’srelevance to its specific industry and topic. The logistic regression, decision tree and random forestalgorithms were all explored and implemented as different solutions to the classification problem. Theresults indicated that the logistic regression model excelled in retaining keywords but was less effectivein eliminating non-keywords. Conversely, the tree-based methods demonstrated superior classificationof keywords, albeit at the cost of misclassifying a few keywords. Overall, the random forest approachoutperformed the others, achieving a result of 76 percent in recall and 20 percent in precision whenpredicting key samples. In summary, this thesis presents a solution for classifying words and phrasesfrom company websites into key and non-key categories, and the developed methodology could offervaluable insights for companies seeking to enhance their website optimization strategies.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)