Web Scraping using Machine Learning

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Victor Carle; [2020]

Keywords: ;

Abstract: This thesis explores the possibilities of creating a robust Web Scraping algorithm, designed to continously scrape a specific website even though the HTML code is altered. The algorithm is intended to be used on websites that have a repetitive HTML structure containing data that can be scraped. A repetitive HTML structure often displays; news articles, videos, books, etc. This creates code in the HTML which is repeated many times, as the only thing different between the things displayed are for example titles. A good examplewould be Youtube. The scraper works through using text classification of words in the code of the HTML, training a Support Vector Machine to recognize the words or variable names. Classification of the words surrounding the sought-after data is done with the assumption that the future HTML ofa website will be similar to the current HTML, this in turn allows for robust scraping to be performed. To evaluate its performance a web archive is used in which the performance of the algorithm is back-tested on past versions of the site to hopefully get an idea of what the performance in the future might look like. The algorithm achieves varying results depending on a large variety of variables within the websites themselves as well as the past versions of the websites. The best performance was achieved on Yahoo news achieving an accuracy of 90 % dating back three months from the time the scraper stopped working.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)