期刊论文

【摘要】

Background

Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.

Results

We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.

Conclusions

RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch webcite. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.

【授权许可】

2011 Ye et al; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150226225917801.pdf	352KB	PDF	download
Figure 3.	82KB	Image	download
Figure 2.	158KB	Image	download
Figure 1.	51KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

【参考文献】

[1]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
[2]Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25(17):3389-3402.
[3]Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 1990, 87(6):2264-2268.
[4]Karlin S, Altschul SF: Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences of the United States of America 1993, 90(12):5873-5877.
[5]Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195-197.
[6]Wooley JC, Ye Y: Metagenomics: Facts and Artifacts, and Computational Challenges. Journal of Computer Science and Technology 2010, 25(1):71-81.
[7]Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes. Nature 2008, 452(7187):629-632.
[8]Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008, 9:386. BioMed Central Full Text
[9]Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res 2007, 17(3):377-386.
[10]Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 2009, 6(9):673-676.
[11]Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucl Acids Res 1999, 27(11):2369-2376.
[12]Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics 2002, 18(3):440-445.
[13]Li M, Ma B, Kisman D, Tromp J: Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2004, 2(3):417-439.
[14]Kent WJ: BLAT--The BLAST-Like Alignment Tool. Genome research 2002, 12:656-664.
[15]Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-Mouse Alignments with BLASTZ. Genome Research 2003, 13:103-107.
[16]Bray N, Pachter L: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome research 2004, 14:693-699.
[17]Bork P, Gibson TJ: Applying motif and profile searches. Methods Enzymol 1996, 266:162-184.
[18]Bork P, Sander C, Valencia A: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(16):7290-7294.
[19]Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform 2009, 23(1):205-211.
[20]Manber U, Myers G: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 1991, 22(5):935-948.
[21]Dill KA: Theory for the folding and stability of globular proteins. Biochemistry 1985, 24(6):1501-1509.
[22]Peterson EL, Kondev J, Theriot JA, Phillips R: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 2009, 25(11):1356-1362.
[23]Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15(1):87-88.
[24]Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature 2009, 457(7228):480-484.
[25]Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17(3):282-283.
[26]Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266:554-571.
[27]Schafmeister CE, LaPorte SL, Miercke LJW, Stroud RM: A designed four helix bundle protein with native-like structure. Nat Struct Mol Biol 1997, 4(12):1039-1046.
[28]Riddle DS, Santiago JV, Bray-Hall ST, Doshi N, Grantcharova VP, Yi Q, Baker D: Functional rapidly folding proteins from simplified amino acid sequences. Nat Struct Mol Biol 1997, 4(10):805-809.
[29]Murphy LR, Wallqvist A, Levy RM: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 2000, 13(3):149-152.
[30]Wommack KE, Bhavsar J, Ravel J: Metagenomics: read length matters. Appl Environ Microbiol 2008, 74(5):1453-1463.
[31]Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P: eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 2008, (36 Database):D250-254.
[32]Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797.
[33]Nelson KE, Weinstock GM, Highlander SK, Worley KC, Creasy HH, Wortman JR, Rusch DB, Mitreva M, Sodergren E, Chinwalla AT, Feldgarden M, Gevers D, Haas BJ, Madupu R, Ward DV, Birren BW, Gibbs RA, Methe B, Petrosino JF, Strausberg RL, Sutton GG, White OR, Wilson RK, Durkin S, Giglio MG, Gujja S, Howarth C, Kodira CD, Kyrpides N, Mehta T, et al.: A catalog of reference genomes from the human microbiome. Science 2010, 328(5981):994-999.

BMC Bioinformatics
RAPSearch: a fast protein similarity search tool for short reads

Yuzhen Ye² Jeong-Hyeon Choi¹ Haixu Tang¹
[1] Center for Genomics and Bioinformatics, Indiana University, Bloomington, IN 47405, USA
[2] School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA
关键词: metagenomics; reduced amino acid alphabet; suffix array; similarity search; short reads;
Others : 1130439 DOI : 10.1186/1471-2105-12-159

received in 2010-07-27, accepted in 2011-05-15, 发布年份 2011
PDF


	文献评价指标
	下载次数：47次	浏览次数：14次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】