Algorithms for Molecular Biology | |
A mixed integer linear programming model to reconstruct phylogenies from single nucleotide polymorphism haplotypes under the maximum parsimony criterion | |
Daniele Catanzaro2  Ramamoorthi Ravi1  Russell Schwartz3  | |
[1] Tepper School of Business, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890 | |
[2] Graphes et Optimisation Mathématique (G.O.M.), Computer Science Department, Université Libre de Bruxelles (U.L.B.), Boulevard du Triomphe, CP 210/01, B-1050, Brussels, Belgium | |
[3] Department of Biological Sciences and Lane Center for Computational Biology, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA, 15213 | |
关键词: Computational biology; Single nucleotide polymorphism; Maximum parsimony; Haplotype estimation; Phylogeny estimation; Mixed integer programming; Exact algorithms; Combinatorial optimization; | |
Others : 809308 DOI : 10.1186/1748-7188-8-3 |
|
received in 2012-06-13, accepted in 2013-01-02, 发布年份 2013 | |
【 摘 要 】
Background
Phylogeny estimation from aligned haplotype sequences has attracted more and more attention in the recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from medical research, to drug discovery, to epidemiology, to population dynamics. The literature on molecular phylogenetics proposes a number of criteria for selecting a phylogeny from among plausible alternatives. Usually, such criteria can be expressed by means of objective functions, and the phylogenies that optimize them are referred to as optimal. One of the most important estimation criteria is the parsimony which states that the optimal phylogeny T∗for a setView MathML">of n haplotype sequences over a common set of variable loci is the one that satisfies the following requirements: (i) it has the shortest length and (ii) it is such that, for each pair of distinct haplotypesView MathML">, the sum of the edge weights belonging to the path from hi to hj in T∗ is not smaller than the observed number of changes between hi and hj. Finding the most parsimonious phylogeny forView MathML">involves solving an optimization problem, called the Most Parsimonious Phylogeny Estimation Problem (MPPEP), which isView MathML">-hard in many of its versions.
Results
In this article we investigate a recent version of the MPPEP that arises when input data consist of single nucleotide polymorphism haplotypes extracted from a population of individuals on a common genomic region. Specifically, we explore the prospects for improving on the implicit enumeration strategy of implicit enumeration strategy used in previous work using a novel problem formulation and a series of strengthening valid inequalities and preliminary symmetry breaking constraints to more precisely bound the solution space and accelerate implicit enumeration of possible optimal phylogenies. We present the basic formulation and then introduce a series of provable valid constraints to reduce the solution space. We then prove that these constraints can often lead to significant reductions in the gap between the optimal solution and its non-integral linear programming bound relative to the prior art as well as often substantially faster processing of moderately hard problem instances.
Conclusion
We provide an indication of the conditions under which such an optimal enumeration approach is likely to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, although with stricter limits on numbers of variable sites. The work thus provides methodology suitable for provably optimal solution of some harder instances that resist all prior approaches.
【 授权许可】
2013 Catanzaro et al; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140709003836791.pdf | 386KB | download | |
Figure 2. | 17KB | Image | download |
Figure 1. | 18KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
【 参考文献 】
- [1]Sridhar S, Lam F, Blelloch GE, Ravi R, Schwartz R: Mixed integer linear programming for maximum parsimony phylogeny inference. IEEE/ACM Trans Comput Biol Bioinformatics 2008, 5(3):323-331.
- [2]Misra N, Blelloch GE, Ravi R, Schwartz R: Generalized Buneman pruning for inferring the most parsimonious multi-state phylogeny. J Comput Biol 2011, 18(3):445-457.
- [3]Pachter L, Sturmfels B: The mathematics of phylogenomics. SIAM Rev 2007, 49:3-31.
- [4]Bush RM, Bender CA, Subbarao K, Cox NJ, Fitch WM: Predicting the evolution of human influenza A. Science 1999, 286(5446):1921-1925.
- [5]Ross HA, Rodrigo AG: Immune-mediated positive selection drives human immunodeficency virus type 1 molecular variation and predicts disease duration. J Virol 2002, 76(22):11715-11720.
- [6]Ou CY, Ciesielski CA, Myers G, Bandea CI, Luo CC, Korber BTM, Mullins JI, Schochetman G, Berkelman RL, Economou AN, Witte JJ, Furman LJ, Satten GA, Maclnnes KA, Curran JW, Jaffe HW: Molecular epidemiology of HIV transmission in a dental practice. Science 1992, 256(5060):1165-1171.
- [7]Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YS, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL: The genome sequence of the SARS-associated coronavirus. Science 2003, 300(5624):1399-1404.
- [8]Chang BSW, Donoghue MJ: Recreating ancestral proteins. Trends Ecol Evol 2000, 15(3):109-114.
- [9]Bader DA, Moret BME, Vawter L: Industrial applications of high-performance computing for phylogeny reconstruction. In SPIE ITCom 4528. SPIE, Denver; 2001:159-168.
- [10]Harvey PH, Brown AJL, Smith JM, Nee S: New Uses for New Phylogenies. Oxford University Press, Oxford; 1996.
- [11]Catanzaro D: Estimating phylogenies from molecular data. In Mathematical Approaches to Polymer Sequence Analysis and Related Problems. Edited by Bruni R.. Springer, New York; 2011.
- [12]Beyer WA, Stein M, Smith T, Ulam S: A molecular sequence metric and evolutionary trees. Math Biosci 1974, 19:9-25.
- [13]Waterman MS, Smith TF, Singh M, Beyer WA: Additive evolutionary trees. J Theor Biol 1977, 64:199-213.
- [14]Albert VA: Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford; 2005.
- [15]Ding Z, Filkov V, Gusfield D: A linear time algorithm for Perfect Phylogeny Haplotyping (PPH) problem. J Comput Biol 2006, 13(2):522-553.
- [16]Bonizzoni P: A linear time algorithm for the Perfect Phylogeny Haplotype problem. Algorithmica 2007, 48(3):267-285.
- [17]Catanzaro D: The minimum evolution problem: Overview and classification. Networks 2009, 53(2):112-125.
- [18]Felsenstein J: Inferring Phylogenies. Sinauer Associates, Sunderland; 2004.
- [19]Bafna V, Gusfield D, Lancia G, Yooseph S: Haplotyping as perfect phylogeny: A direct approach. J Comput Biol 2003, 10:323-340.
- [20]Pennington G, Smith CA, Shackney S, Schwartz R: Reconstructing tumor phylogenies from heterogeneous single-cell data. J Bioinformatics Comput Biol 2006, 5(2a):407-427.
- [21]Riester M, Attolini CSO, Downey RJ, Singer S, Michor F: A differentiation-based phylogeny of cancer subtypes. PLoS Comput Biol 2010, 6:e100077.
- [22]Sridhar S, Dhamdhere K, Blelloch GE, Halperin E, Ravi R, Schwartz R: Algorithms for efficient near-perfect phylogenetic tree reconstruction in theory and practice. IEEE/ACM Trans Comput Biol Bioinformatics 2007, 4(4):561-571.
- [23]Goemans MX, Myung YS: A catalog of Steiner tree formulations. Networks 1993, 23:19-28.
- [24]Zhang XS, Wang RS, Wu LY, Chen L: Models and algorithms for haplotyping problem. Curr Bioinformatics 2006, 1:105-114.
- [25]Catanzaro D, Labbé M: The pure parsimony haplotyping problem: Overview and computational advances. Int Trans Oper Res 2009, 16(5):561-584.
- [26]Gusfield D: Efficient algorithms for inferring evolutionary trees. Networks 1991, 21:19-28.
- [27]Argawala R, Fernandez-Baca D: A polynomial time algorithm for the perfect phylogeny problem when the number of character states is fixed. SIAM J Comput 1994, 23:1216-1224.
- [28]Kannan S, Warnow T: A fast algorithm for the computation and enumeration of perfect phylogenies. SIAM J Comput 1997, 26:1749-1763.
- [29]Garey MR, Johnson DS: Computers and Intractability. Freeman, New York; 2003.
- [30]Du DZ, Smith JM, Rubinstein JH: Advances in Steiner trees. Kluwer Academic Publisher, Boston; 2000.
- [31]Cheng X, Du DZ: Steiner Trees in Industry. Kluwer Academic Publishers, Boston; 2001.
- [32]Fischetti M, Salazar-Gonzáles JJ, Toth P: A branch and cut algorithm for the symmetric generalized traveling salesman problem. Oper Res 1997, 45(30):378-395.
- [33]Halldórsson BV, Bafna V, Edwards N, Lippert R: Combinatorial problems arising in SNP and haplotype analysis. In Discrete Mathematics and Theoretical Computer Science, Volume 2731 of Lecture Note in Computer Science. Edited by Calude CS. Springer-Verlag, Berlin; 2003:26-47.
- [34]Robins G, Zelikovsky A: Tighter bounds for graph Steiner tree approximation. SIAM J Discrete Math 2005, 19:122-134.
- [35]The International HapMap Consortium: The international hapmap project. Nature 2003, 426(18):789-796.
- [36]Chopra S, Rao MR: The Steiner tree problem I: Formulations, compositions and extension of facets. Math Programming 1994, 64(1-3):209-229.
- [37]Chopra S, Rao MR: The Steiner tree problem II: Properties and classes of facets. Math Programming 1994, 64(1-3):231-246.
- [38]Koch T, Martin A, Voß S: SteinLib: An Updated Library on Steiner Tree Problems in Graphs. Berlin: Tech. Rep. ZIB-Report 00-37, Konrad-Zuse-Zentrum für Informationstechnik Berlin, Takustr. 7; 2000. [http://elib.zib.de/steinlib webcite]