Feature Selection and Case Selection Methods Based on Mutual Information in Software Cost Estimation

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Shihai Shi; [2014]

Keywords: ;

Abstract: Software cost estimation is one of the most crucial processes in software development management because it involves many management activities such as project planning, resource  allocation and risk assessment. Accurate software cost estimation not only does help to make investment and bid plan but also enable the project to be completed in the limited cost and time. The research interest of this master thesis will focus on feature selection method and case selection method and the goal is to improve the accuracy of software cost estimation model. Case based reasoning in software cost estimation is an immediate area of research focus. It can predict the cost of new software project via constructing estimation model using historical software projects. In order to construct estimation model, case based reasoning in software cost estimation needs to pick out relatively independent candidate features which are relevant to the estimated feature. However, many sequential search feature selection methods used currently are not able to obtain the redundancy value of candidate features precisely. Besides, when using local distance of candidate features to calculate the global distance of two software projects in case selection, the different impact of each candidate feature is unproven. To solve these two problems, this thesis explores the solutions with the help from NSFC. In this thesis, a feature selection algorithm based on hierarchical clustering is proposed. It gathers similar candidate features into the same clustering and selects one feature that is most similar to the estimated feature as the representative feature. These representative features form the candidate feature subsets. Evaluation metrics are applied to these candidate feature subsets and the one that can produce best performance will be marked as the final result of feature selection. The experiment result shows that the proposed algorithm improves 12.6% and 3.75% in PRED (0.25) over other sequential search feature selection methods on ISBSG dataset and Desharnais dataset, respectively. Meanwhile, this thesis defines candidate feature weight using symmetric uncertainty which origins from information theory. The feature weight is capable of reflecting the impact of each feature with the estimated feature. The experiment result demonstrates  that by applying feature weight, the performance of estimation model improves 8.9% than that without feature weight in PRED (0.25) value. This thesis discusses and analyzes the drawback of proposed ideas as well as mentions some improvement directions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)