期刊论文

【摘要】

Background

Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved.

Results

In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms.

Conclusions

The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available athttp://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm webcite.

【授权许可】

2014 Lee and Sheu; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150113174121789.pdf	536KB	PDF	download
Figure 4.	28KB	Image	download
Figure 3.	17KB	Image	download
Figure 2.	75KB	Image	download
Figure 1.	47KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【参考文献】

[1]Kaderali L, Schliep A: Selecting signature oligonucleotides to identify organisms using dna arrays. Bioinformatics 2002, 18(10):1340-1349.
[2]Francois P, Charbonnier Y, Jacquet J, Utinger D, Bento M, Lew D, Kresbach G. M, Ehrat M, Schlegel W, Schrenzel J: Rapid bacterial identification using evanescent-waveguide oligonucleotide microarray classification. J Microbiol Methods 2006, 65(3):390-403.
[3]Kiryu BM, Kiryu CP: Rapid identification of candida albicans and other human pathogenic yeasts by using oligonucleotides in a pcr. J Clin Microbiol 1998, 73:1634-1641.
[4]Li F, Stormo GD: Selection of optimal dna oligos for gene expression arrays. Bioinformatics 2001, 17:1067-1076.
[5]Roten CA, Gamba P, Barblan JL, Karamata D: Comparative genometrics (cg): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Res 2002, 30(1):142-144.
[6]Hsiao W, Wan I, Jones SJ, Brinkman FS: Islandpath: aiding detection of genomic islands in prokaryotes. Bioinformatics 2003, 19(3):418-420.
[7]Amin HM, Hashem A-GM, Aziz RK: Bioinformatics determination of etec signature genes as potential targets for molecular diagnosis and reverse vaccinology. BMC Bioinformatics 2009, 10:7. BioMed Central Full Text
[8]Duitama J, Kumar DM, Hemphill E, Khan M, Mandoiu II, Nelson CE: Primerhunter: a primer design tool for pcr-based virus subtype identification. Nucleic Acids Res 2009, 37:2483-2492.
[9]Vijaya SR, Zavaljevski N, Kumar K, Reifman J: A high-throughput pipeline for designing microarray-based pathogen diagnostic assays. BMC Bioinformatics 2008, 9:185. BioMed Central Full Text
[10]Tembe W, Zavaljevski N, Bode E, Chase C, Geyer J, Wasieloski L, Benson G, Reifman J: Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics 2007, 23(1):5-13.
[11]Satya RV, Zavaljevski N, Kumar K, Bode E, Padilla S, Wasieloski L, Geyer J, Reifman J: In silico microarray probe design for diagnosis of multiple pathogens. BMC Genomics 2008, 9:496. BioMed Central Full Text
[12]Phillippy AM, Mason JA, Ayanbule K, Sommer DD, Taviani E, Huq A, Colwell RR, Knight IT, Salzberg SL: Comprehensive dna signature discovery and validation. PLoS Comput Biol 2007, 3(5):e98.
[13]Phillippy AM, Ayanbule K, Edwards NJ, Salzberg SL: Insignia: a dna signature search web server for diagnostic assay development. Nucleic Acids Res 2009, 37(2):229-234.
[14]Rozen S, Skaletsky H: Primer3 on the www for general users and for biologist programmers. Methods Mol Biol 2000, 132:365-386.
[15]Satya RV, Kumar K, Zavaljevski N, Reifman J: A high-throughput pipeline for the design of real-time pcr signatures. BMC Bioinformatics 2010, 11:340. BioMed Central Full Text
[16]Bader KC, Grothoff C, Meier H: Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 2011, 27:1546-1554.
[17]Zheng J, Close TJ, Jiang T, Lonardi S: Efficient selection of unique and popular oligos for large est databases. Bioinformatics 2004, 20:2101-2112.
[18]Lee HP, Sheu TF, Tsai YT, Shih CH, Tang. C Y: Efficient discovery of unique signatures on whole-genome est databases. In Proceeding of the 20th Annual ACM Symposium on Applied Computing (SAC2005). Santa Fe: Association for Computing Machinery; 2005:100-104.
[19]Lee HP, Sheu TF, Tang CY: A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics 2010, 11:132. BioMed Central Full Text
[20]Eissler T, Hodges C P Meier H: Ptpan-overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics 2011, 27:2797-2805.
[21]Marcais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011, 27:764-770.
[22]Rizk G, Lavenier D, Chikhi R: Dsk: k-mer counting with very low memory usage. Bioinformatics 2013, 29(5):652-653.
[23]Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. Cambridge: MIT Press; 2009.
[24]Grundy WN, Bailey TL, Elkan CP: Parameme: a parallel implementation and a web interface for a dna and protein motif discovery tool. Bioinformatics 1999, 12:303-310.
[25]Ho ES, Jakubowski CD, Gunderson SI: itriplet, a rule-based nucleic acid sequence motif finder. Algorithm Mol Biol 2009, 29:14.
[26]Green JR, Korenberg MJ, Aboul-Magd. M O: Pci-ss: Miso dynamic nonlinear protein secondary structure prediction. BMC Bioinformatics 2009, 10:222. BioMed Central Full Text
[27]Venkatesan A, Gopal J, Candavelou M, Gollapalli S, Karthikeyan K: Computational approach for protein structure prediction. Healthcare Inform Res 2013, 19:137-147.
[28]Chen Y, Wan A, Liu W: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. BMC Bioinformatics 2006, 7(4):4.
[29]Rognes T: Paralign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res 2001, 29:1647-1652.
[30]Ebedes J, Datta. A: Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics 2004, 20(7):1193-1195.
[31]Sun W, Al-Haj S, He J: Parallel computing in protein structure topology determination. In Proceedings of 26th Army Science Conference. Orlando: Assistant Secretary of Army; 2008:-cp8.
[32]Kurtz S, Narechania A, Stein JC, Ware D: A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 2008, 9:517. BioMed Central Full Text

BMC Bioinformatics
An algorithm of discovering signatures from DNA databases on a computer cluster

Hsiao Ping Lee¹ Tzu-Fang Sheu²
[1] Department of Medical Research, Chung Shan Medical University Hospital, 110, Sec. 1, Jianguo N. Road, 40201 Taichung, Taiwan
[2] Department of Computer Science and Communication Engineering, Providence University, 200, Sec. 7, Taiwan Boulevard, 43301 Shalu Dist., Taichung, Taiwan
关键词: Divide-and-conquer strategies; Computer clusters; Signature discovery;
Others : 1085519 DOI : 10.1186/1471-2105-15-339

received in 2014-04-22, accepted in 2014-09-29, 发布年份 2014
PDF


	文献评价指标
	下载次数：364次	浏览次数：121次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】