Detection of Web API Content Scraping : An Empirical Study of Machine Learning Algorithms

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Abstract: Scraping is known to be difficult to detect and prevent, especially in the context of web APIs. It is in the interest of organisations that rely heavily on the content they provide through their web APIs to protect their content from scrapers. In this thesis, a machine learning approach towards detecting web API content scrapers is proposed. Three supervised machine learning algorithms were evaluated to see which would perform better on data from Spotify's web API. Data used to evaluate the classifiers consisted of aggregated HTTP request data that describes each application having sent HTTP requests to the web API over a span of two weeks. Two separate experiments were performed for each classifier, where the second experiment consisted of synthetic data for scrapers (the minority class) in addition to the original dataset. SMOTE was the algorithm used to perform oversampling in experiment two. The results show that Random Forest was the better classifier, with an MCC value of 0.692, without the use of synthetic data. For this particular problem, it is crucial that the classifier does not have a high false positive rate as legitimate usage of the web API should not be blocked. The Random Forest classifier has a low false positive rate and is therefore more favourable, and considered the strongest classifier out of the three examined.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)