色谱 ›› 2012, Vol. 30 ›› Issue (09): 857-863.DOI: 10.3724/SP.J.1123.2012.06021

• 研究论文 •    下一篇

蛋白质组质谱分析中基于串并联支持向量机的肽段色谱保留时间预测方法

张纪阳*, 张代兵, 张伟, 谢红卫   

  1. 国防科学技术大学机电工程与自动化学院, 湖南 长沙 410073
  • 收稿日期:2012-06-18 修回日期:2012-07-26 出版日期:2012-09-28 发布日期:2012-09-20
  • 通讯作者: 张纪阳,博士,讲师,主要研究方向为生物信息学. Tel: (0731)84573369, E-mail: zjjyyang@163.com.
  • 基金资助:

    国家自然科学基金青年基金项目(31000587).

A new peptide retention time prediction method for mass spectrometry based proteomic analysis by a serial and parallel support vector machine model

ZHANG Jiyang*, ZHANG Daibing, ZHANG Wei, XIE Hongwei   

  1. School of Mechatronic Engineering and Automatic Control, National University of Defense Technology, Changsha 410073, China
  • Received:2012-06-18 Revised:2012-07-26 Online:2012-09-28 Published:2012-09-20

摘要: 基于质谱的大规模蛋白质鉴定中,在线液相色谱分离发挥了重要作用。色谱保留时间(retention time, RT)是肽段鉴定和定量的重要信息。由于整个色谱分析运行时间中,流动相中的有机相采用了非线性浓度曲线以及样品中肽段之间的相互影响等因素,基于肽段序列的RT预测还存在精度不高、模型推广性能差等问题。本文提出了一种基于串并联支持向量机(serial and parallel support vector machine, SP-SVM)的RT预测方法,能够表征洗脱过程中有机相浓度的非线性变化和肽段之间的相互影响,显著提高了肽段保留时间预测的精度。利用复杂样本数据集验证结果表明,预测RT和实验RT之间的决定系数达到了0.95,超过95%的鉴定肽段的RT预测误差范围小于总运行时间的20%,超过70%的鉴定肽段的RT预测误差范围小于总运行时间的10%。本文提出的模型的性能达到了目前已知的最好水平。

关键词: 保留时间, 串并联支持向量机, 蛋白质组学, 肽段鉴定, 液相色谱-质谱, 预测精度

Abstract: The online reversed-phase liquid chromatography (RPLC) contributes a lot for the large scale mass spectrometry based protein identification in proteomics. Retention time (RT) as an important evidence can be used to distinguish the false positive/true positive peptide identifications. Because of the nonlinear concentration curve of organic phase in the whole range of run time and the interactions among peptides, the sequence based RT prediction of peptides has low accuracy and is difficult to generalize in practice, and thus is less effective in the validation of peptide identifications. A serial and parallel support vector machine (SP-SVM) method was proposed to characterize the nonlinear effect of organic phase concentration and the interactions among peptides. The SP-SVM contains a support vector regression (SVR) only for model training (named as p-SVR) and 4 SVM models (named as C-SVM, l-SVR, s-SVR and n-SVR) for the RT prediction. After distinguishing the peptide chromatographic behavior by C-SVM, l-SVR and s-SVR were used to predict the peptide RT specifically to improve the accuracy. Then the peptide RT was normalized by n-SVR to characterize the peptide interactions. The prediction accuracy was improved significantly by applying this method to the processing of the complex sample dataset. The coefficient of the determination between predictive and experimental RTs reaches 0.95, the prediction error range was less than 20% of the total LC run time for more than 95% cases, and less than 10% of the total LC run time for more than 70% cases. The performance of this model reaches the best of known so far. More important, the SP-SVM method provides a framework to take into account the interactions among peptides in chromatographic separation, and its performance can be improved further by introducing new data processing and experiment strategy.

Key words: peptide identifications, prediction accuracy, proteomics, retention time, serial and parallel support vector machine (SP-SVM), liquid chromatography-mass spectrometry (LC-MS)

中图分类号: