学位论文详细信息
Detection of frameshifts and improving genome annotation
Programmed frameshifting;Frameshifts;Pseudogenes;Indel mutations;Sequencing errors
Antonov, Ivan Valentinovich ; Computational Science and Engineering
University:Georgia Institute of Technology
Department:Computational Science and Engineering
关键词: Programmed frameshifting;    Frameshifts;    Pseudogenes;    Indel mutations;    Sequencing errors;   
Others  :  https://smartech.gatech.edu/bitstream/1853/45923/1/antonov_ivan_v_201212_phd.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program usesa hidden Markov model (HMM) of a genomic sequence with possibly frameshiftedprotein-coding regions. The Viterbi algorithm nds the maximum likelihood paththat discriminates between true adjacent genes and a single gene with a frameshift.We tested GeneTack as well as two other earlier developed programs FrameD andFSFind on 17 prokaryotic genomes with frameshifts introduced randomly into knowngenes. We observed that the average frameshift prediction accuracy of GeneTack, interms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy ofthe other two programs.GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991genes with frameshifts (fs-genes) were identifed. Our goal was to determine if aframeshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii)a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430clusters based on sequence similarity between their protein products (fs-proteins),conservation of predicted frameshift position, and its direction. While fs-genes in2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusterswere classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos-sessing conserved motifs near frameshifts were predicted to be recoding candidates.Experiments were performed for sequences derived from 20 out of the 239 clusters;programmed ribosomal frameshifting with eciency higher than 10% was observedfor four clusters.GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used forprokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Knownprogrammed frameshift genes were among the obtained clusters. Several clusters maycorrespond to new examples of dual coding genes.We developed a web interface to browse a database containing all the fs-genespredicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences.The fs-genes can be retrieved by similarity search to a given query sequence, by fs-gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to theirlikely origin, such as pseudogenization, phase variation, programmed frameshifts etc.All the tools and the database of fs-genes are available at the GeneTack web sitehttp://topaz.gatech.edu/GeneTack/

【 预 览 】
附件列表
Files Size Format View
Detection of frameshifts and improving genome annotation 5157KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:22次