Towards a fully automated extraction and interpretation of tabular data using machine learning

University essay from Uppsala universitet/Avdelningen för systemteknik

Abstract: Motivation A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. This handling includes import of data into the same format, regardless of the output of the various instruments used. There are commercial solutions available for this process, but to our knowledge, all these require prior generation of templates to which data must conform.A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. Significant amount of time is spent on manual pre- processing, converting from one format to another. There are currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for automated recognition and loading of data stored in arbitrary formats. The aim of this study is three-folded: A) Investigate if unsupervised machine learning methods can be used to label different types of cells in spreadsheets. B) Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and technologies for the SaaS solution. Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done. A hypothesis-driven algorithm is built and adapted to two of the data formats CBCS uses most frequently. Discussions are held on choices of architecture and technologies for the SaaS solution, including system design patterns, web development framework and database. Result The reading and pre-processing framework is in itself a valuable result, due to its general applicability. No satisfying results are found when using mini-batch K means clustering method. When only reading data from one format, the dimensionality can be reduced from 542 to around 40 dimensions. The hypothesis-driven algorithm can consistently interpret the format it is designed for. More work is needed to make it more general. Implication The study contribute to the desired solution in short-term by the hypothesis-generating algorithm, and in a more generalisable way by the unsupervised learning approach. The study also contributes by initiating a conversation around the system design choices.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)