Analyzing the Influence of Synthetic andAugmented Data on Segmentation Model

University essay from Luleå tekniska universitet/Institutionen för system- och rymdteknik

Abstract: The field of Artificial Intelligence (AI) has experienced unprecedented growth in recent years, thanks to the numerous applications related to speech recognition, natural language processing, and computer vision. However, one of the challenges facing AI is the requirement for large amounts of energy, time, and data to be effective and accurate. As a result, many researchers are focusing on finding ways to minimize these challenges and make AI more accessible to everyone. One solution that has gained popularity in recent years is the generation of data artificially, either through synthetic or augmented means. Synthetic data is created by first learning how to recreate the structure or knowledge behind real-world data, while augmentation involves artificially changing the original data, such as flipping, rotating, cropping, or scaling. By generating data in this way, researchers can overcome the challenge of acquiring large amounts of real-world data and provide a more cost-effective way of training machine learning models. The lack of direct qualitative comparisons between synthetic and augmented data prompts the goal of this thesis, which aims to investigate the differences between the two approaches in generating data for an analytical tool that analyzes microorganisms. The thesis seeks to explore the effectiveness of each data type by training a segmentation model with the aim of generating accurate and realistic segmentations of microorganisms. The organism's growth rate and size is provided by a segmentation mask used as the base for the image generation. Which, in order to  generate realistic images that accurately depict their behavior it needs to be able to recreate imaging noise, halo effect, and relational dependencies. Additionally, a training process is being conducted with an augmentation model to compare the performance of the segmentation model based on different data types, offering valuable insights into their respective advantages and effectiveness in maximizing the limited data count. To achieve this goal, the thesis used three main components: the Albumentations augmentation library, the Taming transformer synthesizer built on a VQ-GAN, and the Omnipose segmentation model for evaluation. With the help of a microorganism dataset, the thesis aimed to train Omnipose to generate realistic and accurate segmentations based on different training sets. However, this is limited to testing on a restricted set of models among the wide range that exists. The findings of the thesis suggest that a better method for qualitative comparison is needed, which could involve a less elaborate setup or novel evaluation methods. Nonetheless, the results indicate that the choice between synthetic and augmented input data does not have significant effects on the initial outcomes when training the segementation model. This is supported by the Structural similarity index (SSIM) and Peak signal-to-noise ratio (PSNR) averages and curves between the two methods, despite the differences in data generation. A more considerable difference is observed with the quality of data, as indicated by the poor performance of the synthetic model and further testing of specific data distributions. Therefore, the efficiency of training a segmentation model on micro-organisms data is determined more by the quality and distribution of data rather than the dataset generation methods. This finding is of significance to researchers in the field as it adds further information on the how to better training on a segmentation model with a limited dataset of microbial images. Which can have significant implications in various fields, such as medicine, environmental science, and biotechnology, where accurate analysis of microorganisms can help diagnose diseases, monitor the health of ecosystems, and develop new biotechnological products. Moving forward, it is recommended that future research should aim to establish clear definitions for synthetic and augmented data, and evaluate their inherent characteristics in order to better understand their differences. This should include specific studies that generate synthetic and augmented data for micro-organisms, as well as direct comparisons between the two data types. Furthermore, additional research should be conducted on well-established datasets, such as ImageNet , with a focus on image-to-image processing and examining the impact on various computer vision models. Following a less established dataset on other microbial organisms to compare the results. Ultimately, this thesis contributes to the ongoing efforts to overcome the challenges faced in the field of AI by providing valuable insights into the effectiveness of generating data artificially.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)