期刊论文详细信息
Algorithms for Molecular Biology
Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data
Alice Cleynen2  Michel Koskas2  Emilie Lebarbier2  Guillem Rigaill1  Stéphane Robin2 
[1] , Unité de Recherche en Génomique Végétale (URGV) INRA-CNRS-Université d’Evry Val d’Essonne, 2 Rue Gaston Crémieux, 91057 Evry Cedex, France
[2] INRA, UMR 518, 16 rue Claude Bernard, 75231 Paris Cedex 05, France
关键词: Data compression;    Count data;    Genome annotation;    RNA-Seq data;    Fast algorithm;    Exact algorithm;    Segmentation algorithm;   
Others  :  793007
DOI  :  10.1186/1748-7188-9-6
 received in 2013-05-13, accepted in 2014-03-03,  发布年份 2014
PDF
【 摘 要 】

Background

Change point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution.

Results

Because of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNA-Seq data.

Conclusions

We illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an R package on the CRAN repository.

【 授权许可】

   
2014 Cleynen et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140705042345424.pdf 986KB PDF download
Figure 5. 47KB Image download
Figure 4. 48KB Image download
Figure 3. 41KB Image download
Figure 2. 39KB Image download
Figure 1. 87KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

【 参考文献 】
  • [1]Braun JV, Muller HG: Statistical methods for DNA sequence segmentation. Stat Sci 1998, 13(2):142-162.
  • [2]Durot C, Lebarbier E, Tocquet AS: Estimating the joint distribution of independent categorical variables via model selection. Bernoulli 2009, 15:475-507.
  • [3]Bockhorst J, Jojic N: Discovering patterns in biological sequences by optimal segmentation. In Proceedings of the 23rd Conference in Uncertainty in Artificial Intelligence. AUAI Press; 2007.
  • [4]Zhang Z, Lange K, Sabatti C: Reconstructing DNA copy number by joint segmentation of multiple sequences. BMC Bioinformatics 2012, 13:205. BioMed Central Full Text
  • [5]Erdman C, Emerson JW: A fast Bayesian change point analysis for the segmentation of microarray data. Bioinformatics 2008, 24(19):2143-2148.
  • [6]Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostat (Oxford, England) 2004, 5(4):557-572.
  • [7]Picard F, Robin S, Lavielle M, Vaisse C, Daudin J: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6:27. BioMed Central Full Text
  • [8]Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S: Joint segmentation, calling and normalization of multiple CGH profiles. Biostatistics 2011, 12(3):413-428.
  • [9]Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods 2009, 6:99-103.
  • [10]Xie C, Tammi MT: CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 2009, 10:80. BioMed Central Full Text
  • [11]Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 2009, 19:1586-1592.
  • [12]Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E: Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics (Oxford, England) 2011, 27:268-9.
  • [13]Shen JJ, Zhang NR: Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. Ann Appl Stat 2012, 6(2):476-496.
  • [14]Rivera C, Walther G: Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics. Scand J Stat 2013, 40(4):752-769.
  • [15]Franke J, Kirch C, Kamgaing JT: Changepoints in times series of counts. J Time Series Anal 2012, 33(5):757-770.
  • [16]Killick R, Fearnhead P, Eckley I: Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 2012, 107(500):1590-1598.
  • [17]Hocking TD, Schleiermacher G, Janoueix-Lerosey I, Boeva V, Cappo J, Delattre O, Bach F, Vert J-P: Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics 2013, 14(1):164. BioMed Central Full Text
  • [18]Rigaill G: Pruned dynamic programming for optimal multiple change-point detection. Arxiv:1004.0887 2010. [http://arxiv.org/abs/1004.0887 webcite]
  • [19]Johnson N, Kemp A, Kotz S: Univariate Discrete Distributions. John Wiley & Sons Inc.; 2005.
  • [20]Risso D, Schwartz K, Sherlock G, Dudoit S: GC-Content normalization for RNA-Seq data. BMC Bioinformatics 2011, 12:480. BioMed Central Full Text
  • [21]Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010, 11:94. BioMed Central Full Text
  • [22]Akaike H: A new look at the statistical model identification. Automatic Control IEEE Trans 1974, 19(6):716-723.
  • [23]Yao Y: Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Ann Stat 1984, 12(4):1434-1447.
  • [24]Zhang NR, Siegmund DO: A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 2007, 63:22-32. [PMID: 17447926]
  • [25]Cleynen A, Lebarbier E: Segmentation of the poisson and negative binomial rate models: a penalized estimator. Esaim: P & S 2014. arXiv preprint arXiv:1301.2534
  • [26]Lebarbier E: Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Process 2005, 85(4):717-736.
  • [27]Arlot S, Massart P: Data-driven calibration of penalties for least-squares regression. J Mach Learn Res 2009, 10:245-279. (electronic)
  • [28]Luong TM, Rozenholc Y, Nuel G: Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model. Comput Stat Data Anal 2013.
  文献评价指标  
  下载次数:65次 浏览次数:9次