期刊论文详细信息
BMC Research Notes
FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences
Frank Oliver Glöckner2  Timmy Schweer3  Wolfgang Hankeln1  Jan Gerken2  Jost Waldmann2 
[1] Mediomix GmbH, Eupener Straße 139, 50933 Köln, Germany;Jacobs University Bremen gGmbH, Campusring 1, 28759 Bremen, Germany;Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany
关键词: High-throughput;    Data validation;    FASTA;   
Others  :  1132479
DOI  :  10.1186/1756-0500-7-365
 received in 2014-01-22, accepted in 2014-06-10,  发布年份 2014
PDF
【 摘 要 】

Background

Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered.

Findings

FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines.

Conclusions

The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.

【 授权许可】

   
2014 Waldmann et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150303211956875.pdf 226KB PDF download
Figure 1. 63KB Image download
【 图 表 】

Figure 1.

【 参考文献 】
  • [1]Sanger F, Nicklen S, Coulson A: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977, 74(12):5463-5467.
  • [2]Mardis ER: Next-generation DNA sequencing methods. Ann Rev Genomics Hum Genet 2008, 9:387-402.
  • [3]Lipman D, Pearson W: Rapid and sensitive protein similarity searches. Science 1985, 227(4693):1435-1441.
  • [4]Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci 1988, 85(8):2444-2448.
  • [5]Mangalam H: The Bio* toolkits–a brief overview. Brief Bioinform 2002, 3(3):296-302.
  • [6]Cornish-Bowden A: Nomenclature for incompletely specified bases in nucleic acid sequencesrecommendations. Nucleic Acids Res 1985, 13(9):3021-3030.
  • [7]IUPAC-IUB-JCBN: IUPAC-IUB Joint commission on biochemical nomenclature (JCBN). Nomenclature and symbolism for amino acids and peptides. Recommendations 1983. Eur J Biochem 1984, 138(1):9-37.
  • [8]IUPAC-IUB-JCBN: IUPAC-IUB Joint commission on biochemical nomenclature (JCBN). Nomenclature and symbolism for amino acids and peptides. Corrections to recommendations 1983. Eur J Biochem 1993, 213(1):2.
  • [9]Riley M, Abe T, Arnaud M, Berlyn M, Blattner F, Chaudhuri R, Glasner J, Horiuchi T, Keseler I, Kosuge T, Perna N, Rudd K, Serres M, Thomas G, Thomson N, Wishart D, Mori H: Escherichia coli K-12: a cooperatively developed annotation snapshot–2005. Nucleic Acids Res 2006, 34(1):1-9.
  • [10]Apweiler R, Bairoch A, Wu CH: Protein sequence databases. Curr Opin Chem Biol 2004, 8(1):76-80.
  • [11]Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, Li W, et al.: The sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biol 2007, 5(3):16.
  • [12]Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner F: The SILVA ribosomal rna gene database project: improved data processing and web-based tools. Nucleic Acids Res 2013, 41(D1):590-596.
  • [13]Hankeln W, Wendel N, Gerken J, Waldmann J, Buttigieg P, Kostadinov I, Kottmann R, Yilmaz P, Glöckner F: CDinFusion - submission-ready, on-line integration of sequence and contextual data. PLoS ONE 2011, 6(9):24797.
  文献评价指标  
  下载次数:16次 浏览次数:14次