Evaluating tools and techniques for web scraping

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Emil Persson; [2019]

Keywords: ;

Abstract: The purpose of this thesis is to evaluate state of the art web scraping tools. To support the process, an evaluation framework to compare web scraping tools is developed and utilised, based on previous work and established software comparison metrics. Twelve tools from different programming languages are initially considered. These twelve tools are then reduced to six, based on factors such as similarity and popularity. Nightmare.js, Puppeteer, Selenium, Scrapy, HtmlUnit and rvest are kept and then evaluated. The evaluation framework includes performance, features, reliability and ease of use. Performance is measured in terms of run time, CPU usage and memory usage. The feature evaluation is based on implementing and completing tasks, with each feature in mind. In order to reason about reliability, statistics regarding code quality and GitHub repository statistics are used. The ease of use evaluation considers the installation process, official tutorials and the documentation.While all tools are useful and viable, results showed that Puppeteer is the most complete tool. It had the best ease of use and feature results, while staying among the top in terms of performance and reliability. If speed is of the essence, HtmlUnit is the fastest. It does however use the most overall resources. Selenium with Java is the slowest and uses the most amount of memory, but is the second best performer in terms of features. Selenium with Python uses the least amount of memory and the second least CPU power. If JavaScript pages are to be accessed, Nightmare.js, Puppeteer, Selenium and HtmlUnit can be used.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)