Evaluation of Idempotency & Block Size of Data on the Performance of Normalized Compression Distance Algorithm

University essay from Blekinge Tekniska Högskola/Sektionen för datavetenskap och kommunikation

Abstract: Normalized compression distance (NCD) is a similarity distance metric algorithm which is used for the purpose of analyzing the type of file fragments. The performance of NCD depends upon underlying compression algorithm to be used. We have studied three compressors bzip2, gzip and ppmd, the compression ratio of ppmd is better than bzip2 and the compression ratio of bzip2 is better than gzip, but which one out of these three is better than one another in the viewpoint of idempotency is evaluated by us. Then we have applied NCD along with k nearest neighbour as a classification algorithm to a randomly selected public corpus data with different block sizes (512 byte, 1024 bytes, 1536 bytes, 2048 bytes). The performance of two compressors bzip2 and gzip is also compared for the NCD algorithm in the perspective of idempotency. Objectives: In this study we have investigated the In this study we have investigated the combine effect of both of the parameters namely compression ratio versus idempotency and varying block size of data on the performance of NCD. The objective is to figure out that in order to have a better performance of NCD either a compressor for NCD should be selected on the basis of better compression ratio of compressors or better idempotency of compressors. The whole purpose of using different block sizes was to evaluate either the performance of NCD will improve or not by varying the block size of data to be used for making the datasets. Methods: Experiments are performed to test the hypotheses and evaluate the effect of compression ratio versus idempotency and block size of data on the performance of NCD. Results: The results obtained after the analysis of null hypotheses of main experiment are retained, which showed that there is no statistically significant difference on the performance of NCD when varying block size of data is used and also there is no statistically significant difference on the NCD’s performance when a compressor is selected for NCD on the basis of better compression ratio or better idempotency. Conclusions: As the results obtained from the experiments are unable to reject the null hypotheses of main experiment so no conclusion could be drawn of the effect of the independent variables on the dependent variable i.e. there is no statistically significant effect of compression ratio versus idempotency and varying block size of data on performance of the NCD.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)