Computational methods to estimate error rates forpeptide identifications in mass spectrometry-based proteomics

University essay from KTH/Numerisk analys, NA

Author: Xiao Liang; [2013]

Keywords: ;

Abstract: In the field of proteomics, tandem mass spectrometry is the core technology which promises to identify peptide components within complex mixtures on a large scale. Currently the bottleneck is to reduce the error rates and assign accurate statistical estimates of peptide identifications. In this work, we introduce the techniques of identifying chimeric spectra, where two or more precursor ions with similar mass and retention time are co-fragmented and sequenced by the MS/MS instrument. Based on this, we try to analyze the factor which leads to the high error rate of identifications. We show that chimeric spectra have high correlations with the ranking scores and can reduce the number of positive identifications. Additionally, we address the problem of assigning a posterior error probability (PEP) to the individual peptide-spectrum matches (PSMs) that are obtained via search engines. This problem is computationally more difficult than estimating the error rate associated with a large collection of PSMs, such as false discovery rate (FDR). Existing methods rely on parametric or semiparametric models of the underlying score distribution as preassumption.We provide a so-called kernel logistic regression procedure without any explicit assumptions about the score distribution. Based on an appropriate positive definite Gaussian kernel, the resulting PEP estimate is proven to be robust by achieving a close correspondence between the PEP-derived q-values and FDR-derived q-values. Furthermore, we also accept at least 200 more significant PSMs with setting a threshold based on PEP-derived q-values compared to FDR-derived q-values. Finally, we show that this kernel logistic regression method is well established in the statistics literature and it can produce accurate PEP estimates for different types of PSM score functions and data.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)