期刊论文

【摘要】

BackgroundRecent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses.ResultsWe first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins.ConclusionsGene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

【授权许可】

CC BY

【预览】

附件列表
Files	Size	Format	View
RO202104289318903ZK.pdf	1551KB	PDF	download

BMC Bioinformatics
Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Julie D. Thompson¹ Anne Jeannin-Girardon¹ Corentin Meyer¹ Pierre Collet¹ Nicolas Scalzitti¹ Olivier Poch¹
[1] Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France;
关键词: Genome annotation; Primates; Gene prediction; Protein sequence errors; Error correction;
DOI : 10.1186/s12859-020-03855-1
来源: Springer
PDF


	文献评价指标
	下载次数：5次	浏览次数：2次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】