期刊论文详细信息
BMC Bioinformatics
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
Julie D. Thompson1  Anne Jeannin-Girardon1  Corentin Meyer1  Pierre Collet1  Nicolas Scalzitti1  Olivier Poch1 
[1] Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France;
关键词: Genome annotation;    Primates;    Gene prediction;    Protein sequence errors;    Error correction;   
DOI  :  10.1186/s12859-020-03855-1
来源: Springer
PDF
【 摘 要 】

BackgroundRecent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses.ResultsWe first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins.ConclusionsGene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202104289318903ZK.pdf 1551KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:2次