Analysis of Tabula : A PDF-Table extraction tool

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Gustav Rosén; [2019]

Keywords: ;

Abstract: PDF is a widely used text document format used by both the private and the public sector. It is designed to create layouts of text and figures on a virtual page. Research groups often publish reports in this format including raw data in tables. The content of PDF-tables can be difficult to extract, an issue the National Food Agency often runs into. Building a PDF-interpreter from the scratch is a complex and overwhelming task but there are plenty of available PDF-Table extractors. While none meet the specific requirements of the National Food Agency the most effective tool, Tabula, is open source. By analyzing the source code an evaluation of extending Tabula can be made to possibly meet the requirements in the future. However, the lack of documentation and poor class definitions makes the source code arduous to understand. Building a new application using the same library as Tabula appears to be a more promising approach.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)