期刊论文

【摘要】

Background

Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data.

Results

A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.

Conclusions

The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.

【授权许可】

2014 George and Chang; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20140711133945651.pdf	585KB	PDF	download
Figure 6.	34KB	Image	download
Figure 5.	15KB	Image	download
Figure 4.	28KB	Image	download
Figure 3.	44KB	Image	download
Figure 2.	48KB	Image	download
Figure 1.	30KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

【参考文献】

[1]Miller R, Wu G, Deshpande RR, Vieler A, Gärtner K, Li X, Moellering ER, Zäuner S, Cornish AJ, Liu B, Bullard B, Sears BB, Kuo MH, Hegg EL, Shachar-Hill Y, Shiu SH, Benning C: Changes in transcript abundance in Chlamydomonas reinhardtii following nitrogen deprivation predict diversion of metabolism. Plant Physiol 2010, 154:1737-1752.
[2]Gao L, Fang Z, Zhang K, Zhi D, Cui X: Length bias correction for RNA-seq data in gene set analyses. Bioinformatics 2010, 27(5):662-669.
[3]Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y: Statistical methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol 2011, 5(Suppl 3):S1. BioMed Central Full Text
[4]Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol 2010, 11(10):R106. BioMed Central Full Text
[5]Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM: Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics 2012, 17(13):484.
[6]Cherbas L, Willingham A, Zhang D, Yang L, Zou Y, Eads BD, Carlson JW, Landolin JM, Kapranov P, Dumais J, Samsonova A, Choi JH, Roberts J, Davis CA, Tang H, van Baren MJ, Ghosh S, Dobin A, Bell K, Lin W, Langton L, Duff MO, Tenney AE, Zaleski C, Brent MR, Hoskins RA, Kaufman TC, Andrews J, Graveley BR, Perrimon N: The transcriptional diversity of 25 Drosophila cell lines. Genome Res 2011, 21:301-314.
[7]Risso D, Schwartz K, Sherlock G, Dudoit S: GC-Content normalization for RNA-seq data. BMC Bioinforma 2011, 12:480. BioMed Central Full Text
[8]Robinson M, McCarthy D, Chen Y, Smyth G: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26(1):139-140.
[9]Hastie ND, Bishop JO: Three abundance classes of messenger RNA in mouse tissues. Cell 1976, 9:761-774.
[10]Hoyle DC, Rattray M, Jupp R, Brass A: Making sense of microarray data distributions. Bioinformatics 2002, 18:576-584.
[11]Chang CW, Zou W, Chen JJ: A new method for gene identification in comparative genomic analysis. J Data Sci 2008, 4:415-427.
[12]Ohtaki M, Otani K, Hiyama K, Kamei N, Satoh K, Hiyama E: A robust method for estimating gene expression states using Affymetrix microarray probe level data. BMC Bioinforma 2010, 11:183. BioMed Central Full Text
[13]Hebenstreit D, Teichmann S: Analysis and simulation of gene expression profiles in pure and mixed cell populations. Phys Biol 2011, 8(3):035013.
[14]Lu C, King RD: An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems. Bioinformatics 2009, 25:2020-2027.
[15]Ramskold D, Wang ET, Burge CB, Sandberg R: An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 2009, 5:e1000598.
[16]Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann S: RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol 2011, 7:497.
[17]Casella G, Berger RL: Statistical Inference. 2nd edition. Pacific Grove, CA: Duxbury Press; 2001.
[18]Fraley C, Raftery AE: MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering. Department of Statistics, University of Washington; 2006.
[19]Schwarz G: Estimating the dimension of a model. Ann Stat 1978, 6:461-464.
[20]Biernacki C, Celeux G, Govaert G: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 2000, 22:719-725.
[21]Ray S, Lindsay BG: The topography of multivariate normal mixtures. Ann Stat 2005, 33:2042-2065.
[22]Hennig C: Methods for merging Gaussian mixture components. ADAC 2010, 4(1):3-34.
[23]Wu AR, Neff NF, Kalisky T, Dalerba P, Treulein B, Rothenberg ME, Mburu FM, Mantalas GL, Sim S, Clarke MF, Quake SR: Quantitative assessment of single-cell RNA-sequencing methods. Nat Methods 2013, 11:41-46.
[24]Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28(5):511-515.
[25]Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast whole genome defined by RNA sequencing. Science 2008, 320:1344-1349.
[26]Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456(7221):470-476.
[27]Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5:621-628.
[28]Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, Khrebtukova I, Schroth GP, Beyerlein P, Beutler AS: mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res 2010, 20(6):847-60.
[29]Toung JM, Morley M, Li MY, Cheung VG: RNA-sequence analysis of human B-cells. Genome Res 2011, 21(6):991-998.
[30]Fraley C, Raftery AE: Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002, 97:611-631.
[31]Menardi G, Azzalini A: An advancement in clustering via nonparametric density estimation. Stat Comput 2013. doi:10.1007/s11222-013-9400-x. URL http://link.springer.com/10.1007/s11222-013-9400-x webcite
[32]Nagode M, Fajdiga M: The REBMIX algorithm for the univariate finite mixture estimation. Commun Stat Theory Methods 2011, 40(5):876-892.
[33]Frazee A, Langmead B, Leek J: Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinforma 2011, 12:449. BioMed Central Full Text
[34]Friedman JH: Multivariate adaptive regression splines. Ann Stat 1991, 19:1-67.
[35]Morgan JN, Sonquist JA: Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 1963, 58:415-435.
[36]Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and regression trees. Belmont, California: Wadsworth, Inc. Press; 1984.
[37]Craven P, Wahba G: Smoothing noisy data with spline functions. Numer Math 1979, 31:377-403.
[38]Friedman JH, Silverman BW: Flexible parsimonious smoothing and additive modeling. Technometrics 1989, 31:3-39.
[39]Wang ET, Cody NA, Jog S, Biancolella M, Wang TT, Treacy DJ, Luo S, Schroth GP, Housman DE, Reddy S, Lécuyer E, Burge CB: Transcriptome-wide regulation of Pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 2012, 150:710-724.

BMC Bioinformatics
DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression

Nysia I George¹ Ching-Wei Chang¹
[1] Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA
关键词: Mixture distribution; Flag; Data-adaptive; Low expression; RNA-sequencing;
Others : 818697 DOI : 10.1186/1471-2105-15-92

received in 2013-08-30, accepted in 2014-03-25, 发布年份 2014
PDF


	文献评价指标
	下载次数：42次	浏览次数：4次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】