期刊论文详细信息
GigaScience
Investigation into the annotation of protocol sequencing steps in the sequence read archive
Hugh P Shanahan1  Jamie Alnasir1 
[1] Department of Computer Science, Royal Holloway, University of London, Egham TW20 0EX, UK
关键词: Annotation;    Experiment;    Metadata;    Protocol;    Enrichment;    Fragmentation;    Ligation;    Next-generation sequencing;   
Others  :  1204327
DOI  :  10.1186/s13742-015-0064-7
 received in 2014-10-15, accepted in 2015-04-28,  发布年份 2015
PDF
【 摘 要 】

Background

The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined.

Results

We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records).

Conclusions

The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.

【 授权许可】

   
2015 Alnasir and Shanahan; licensee BioMed Central.

【 预 览 】
附件列表
Files Size Format View
20150524040323370.pdf 1345KB PDF download
Figure 3. 41KB Image download
Figure 2. 48KB Image download
Figure 1. 48KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010; 11:31-46.
  • [2]Mardis ER. Anticipating the 1,000 dollar genome. Genome Biol. 2006; 7:112. BioMed Central Full Text
  • [3]Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011; 39:19-21.
  • [4]Edgar R. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30:207-10.
  • [5]Brazma A. Arrayexpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003; 31:68-71.
  • [6]Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010; 95:315-27.
  • [7]Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem. 2013; 6:287-303.
  • [8]Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R et al.. Characterizing and measuring bias in sequence data. Genome Biol. 2013; 14:R51. BioMed Central Full Text
  • [9]Meacham F, Boffelli D, Dhahbi J, Martin DIK, Singer M, Pachter L. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 2011; 12:451. BioMed Central Full Text
  • [10]Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008; 36:16 e105.
  • [11]Hansen KD, Brenner SE, Dudoit S. Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010; 38:12 e131.
  • [12]Allhoff M, Schönhuth A, Martin M, Costa IG, Rahmann S, Marschall T. Discovering motifs that induce sequencing errors. BMC Bioinformatics. 2013.
  • [13]Cheung MS, Down TA, Latorre I, Ahringer J. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 2011; 39:15 e103.
  • [14]Sambrook, J, & Russell, DW. Fragmentation of DNA by sonication. CSH protocols. 2006; doi:10.1101/pdb.prot4538.
  • [15]Sambrook, J, & Russell, DW. Fragmentation of DNA by nebulization. CSH protocols. 2006; doi:10.1101/pdb.prot4539.
  • [16]Orlowski J, Bujnicki JM. Structural and evolutionary classification of type ii restriction enzymes based on theoretical and experimental analyses. Nucleic Acids Res. 2008; 36:3552-69.
  • [17]Kamps-Hughes N, Quimby A, Zhu Z, Johnson EA. Massively parallel characterization of restriction endonucleases. Nucleic Acids Res. 2013; 41:11: e119.
  • [18]Keohavong P, Thilly WG. Fidelity of DNA polymerases in DNA amplification. Proc Natl Acad Sci U S A. 1989; 86:9253-7.
  • [19]Schwartz SL. Farman ml systematic overrepresentation of DNA termini and underrepresentation of subterminal regions among sequencing templates preparded from hydrodynamically sheared linear DNA molecules. BMC Genomics. 2010; 11:87. BioMed Central Full Text
  • [20]Eastberg JH, Pelletier J, Stoddard B. L. Recognition of DNA substrates by T4 bacteriophage polynucleotide kinase. Nucleic Acids Res. 2004; 32:653-60.
  • [21]Sanger Institute. Illumina library preparation for long PCR products, Sanger. 2014; ftp://ftp.sanger.ac.uk/pub/pulldown/PCR_96-well%20protocol.pdf Accessed 10 January 2014.
  • [22]Housby J, Southern E. Fidelity of DNA ligation: a novel experimental approach based on the polymerisation of libraries of oligonucleotides. Nucleic Acids Res. 1998; 26:4259-66.
  • [23]Seguin-Orlando A, Schubert M, Clary J, Stagegaard J, Alberdi MT, Prado JL et al.. Ligation Bias in Illumina next-generation DNA libraries: implications for sequencing ancient genomes. PLoS One. 2013; 8:10 e78575.
  • [24]Kozarewa I, Ning Z, Quail MA, Sanders M, Berriman J, Turner MD J. Amplification-free illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods. 2009; 6:291-5.
  • [25]Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, Polz M. F. PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Appl Environ Microbiol. 2005; 71:8966-9.
  • [26]Chen YC, Liu T, Yu CH, Chiang TY, Hwang C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS One. 2013; 8:4 e62856.
  • [27]Spitaleri S, Piscitello D, Di Martino D, Saravo L. Experimental procedures comparing the activity of different Taq polymerases. Forensic Sci Int. 2004.
  • [28]Quail MA, Otto TD, Gu Y, Harris SR, Skelly TF, McQuillan JA et al.. Optimal enzymes for amplifying sequencing libraries. Nat Methods. 2012; 9:10-11.
  • [29]Sikorsky JA, Primerano DA, Fenger TW, Denvir J. DNA damage reduces Taq DNA polymerase fidelity and PCR amplification efficiency. Biochem Biophys Res Commun. 2007; 355:431-7.
  • [30]Jiao X, Rosenlund M, Hooper SD, Tellgren-Roth C, He L, Fu Y et al.. Structural alterations from multiple displacement amplification of a human genome revealed by mate-pair sequencing. PLoS One. 2011; 6(7): Article ID e22250
  • [31]Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C et al.. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet. 2001; 29:365-71.
  • [32]Functional Genomics Data Society. MINSEQE: Minimum Information about a high throughput Nucleotide SEQuencing Experiment - a proposal for standards in functional genomic data reporting. FGED 2012. http://fged.org/site_media/pdf/MINSEQE_1.0.pdf Accessed 03 January 2014.
  • [33]Nakazato T, Ohta T, Bono H. Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One. 2013; 8: Article ID e77910
  • [34]EMBL-EBI. Accessing ENA data programmatically: Retrieve SRA metadata in XML format. EMBL-EBI. 2013. http://www.ebi.ac.uk/training/online/course/nucleotide-sequence-data-resources-ebi/accessing-ena-data-programmatically Accessed 02 December 2013.
  • [35]NCBI. SRA Handbook. National Center for Biotechnology Information; 2010. http://www.ncbi.nlm.nih.gov/books/NBK47528/ Accessed 02 December 2013.
  • [36]Bioconductor. A compilation of metadata from NCBI SRA and tools. Bioconductor project version 2.14. 2013; http://www.bioconductor.org/packages/2.14/bioc/html/SRAdb.htmlAccessed 02 December 2013.
  • [37]Alnasir, J; Shanahan, HP (2015): Supporting material for "Investigation into the annotation of protocol sequencing steps in the Sequence Read Archive". GigaScience Database. https://github.com/gigascience/paper-alnasir2015.
  文献评价指标  
  下载次数:18次 浏览次数:15次