Finding mislabeled data in datasets : A study on finding mislabeled data in datasetsby studying loss function

University essay from Uppsala universitet/Datalogi

Abstract: The amount of data keeps growing thus making the handling of all data to anextensive task. Most data needs preprocessing in different ways and to ease thisprocess methods and techniques are required.The aim of the thesis is to find mislabeled data in datasets by studying trainingrelated metrics for individual data-points. The metric studied in this thesis is theloss function.Experiments were made on MNIST and CIFAR10 datasets. First a CNN was trainedas one part of the filtering process. The individual losses were then obtained andstored. The second part consisted of distance measuring these losses. Both theEuclidean and Manhattan distances were calculated for each data-point to themedian class loss. The hypothesis of the thesis is that a greater distance to themedian class loss is associated with more uncertainty of the given label. Datsetsthat were studied was the MNIST and CIFAR10.The results shows that it is possible to find mislabeled data by studying individualloss functions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)