期刊论文详细信息
BMC Bioinformatics
OD-seq: outlier detection in multiple sequence alignments
Peter Jehl1  Fabian Sievers1  Desmond G. Higgins1 
[1] UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Dublin 4, Ireland
关键词: Multiple sequence alignment;    Outlier;   
Others  :  1229494
DOI  :  10.1186/s12859-015-0702-1
 received in 2015-05-15, accepted in 2015-08-13,  发布年份 2015
【 摘 要 】

Background

Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous.

Results

The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N2 ) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity.

Conclusion

OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.

【 授权许可】

   
2015 Jehl et al.

附件列表
Files Size Format View
Fig. 8. 217KB Image download
Fig. 7. 26KB Image download
Fig. 6. 69KB Image download
Fig. 5. 46KB Image download
Fig. 4. 35KB Image download
Fig. 3. 59KB Image download
Fig. 2. 54KB Image download
Fig. 1. 43KB Image download
Fig. 8. 217KB Image download
Fig. 7. 26KB Image download
Fig. 6. 69KB Image download
Fig. 5. 46KB Image download
Fig. 4. 35KB Image download
Fig. 3. 59KB Image download
Fig. 2. 54KB Image download
Fig. 1. 43KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

【 参考文献 】
  • [1]Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W et al.. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539.
  • [2]Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for multiple sequence alignments. JMB. 2000; 302:205-217.
  • [3]Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Bio Evol. 2003; 30:772-80.
  • [4]Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792-7.
  • [5]Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview Version 2 - a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009; 25(9):1189-91.
  • [6]Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM Comput Surv. 2009; 41(3):58.
  • [7]Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997; 25:4876-82.
  • [8]Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000; 17:540-52.
  • [9]Penn O, Privman E, Landan G, Graur D, Pupko T. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010; 27(8):1759-67.
  • [10]Zepeda Mendoza ML, Nygaard S, da Fonseca RR. DivA: detection of non-homologous and very divergent regions in protein sequence alignments. BMC Res Notes. 2014; 7:806. BioMed Central Full Text
  • [11]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403-10.
  • [12]Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR et al.. The Pfam protein families database. Nucleic Acids Res. 2014; 42:D222-30.
  • [13]Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008; 320(5883):1632-5.
  • [14]Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 2010; 5:21. BioMed Central Full Text
  • [15]Felsenstein J. Phylip - phylogeny inference package (version 3.2). Cladistics. 1989; 5:164-6.
  • [16]Gouy M, Guindon S, Gascuel O. SeaView Version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol. 2010; 27(2):221-4.
  文献评价指标  
  下载次数:218次 浏览次数:15次