Identifying Content Blocks on Web Pages using Recursive Neural Networks and DOM-tree Features
Abstract: The internet is a source of abundant information spread across different web pages. The identification and extraction of information from the internet has long been an active area of research for multiple purposes relating to both research and business intelligence. However, many of the existing systems and techniques rely on assumptions that limit their general applicability and negatively affect their performance as the web changes and evolves. This work explores the use of Recursive Neural Networks (RecNNs) along with the extensive amount of features present in the DOM-trees for web pages as a technique for identifying information on the internet without the need for strict assumptions on the structure or content of web pages. Furthermore, the use of Sparse Group LASSO (SGL) is explored as an effective tool for performing feature selection in the context of web information extraction. The results show that a RecNN model outperforms a similarly structured feedforward baseline for the task of identifying cookie consent dialogs across various web pages. Furthermore, the results suggest that SGL can be used as an effective tool for feature selection of DOM-tree features.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)