The One Spider To Rule Them All : Web Scraping Simplified: Improving Analyst Productivity and Reducing Development Time with A Generalized Spider

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: This thesis addresses the process of developing a generalized spider for web scraping, which can be applied to multiple sources, thereby reducing the time and cost involved in creating and maintaining individual spiders for each website or URL. The project aims to improve analyst productivity, reduce development time for developers, and ensure high-quality and accurate data extraction. The research involves investigating web scraping techniques and developing a more efficient and scalable approach to report retrieval. The problem statement emphasizes the inefficiency of the current method with one customized spider per source and the need for a more streamlined approach to web scraping. The research question focuses on identifying patterns in the web scraping process and functions required for specific publication websites to create a more generalized web scraper. The objective is to reduce manual effort, improve scalability, and maintain high-quality data extraction. The problem is resolved using a quantitative approach that involves the analysis and implementation of spiders for each data source. This enables a comprehensive understanding of all potential scenarios and provides the necessary knowledge to develop a general spider. These spiders are then grouped based on their similarity, and through the application of simple logic, they are consolidated into a single general spider capable of handling all the sources. To construct the general spider, a utility library is created, equipped with the essential tools for extracting relevant information such as title, description, date, and PDF links. Subsequently, all the individual information is transferred to configuration files, enabling the execution of the general spider. The findings demonstrate the successful integration of multiple sources and spiders into a unified general spider. However, due to the limited time frame of the project, there is potential for further improvement. Enhancements could include better structuring of the configuration files, expansion of the utility library, or even the integration of AI capabilities to enhance the performance of the general spider. Nevertheless, the current solution is deemed suitable for automated article retrieval and ready to be used.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)