期刊论文详细信息
BMC Research Notes
An efficient clustering algorithm for partitioning Y-short tandem repeats data
Mohamed Nizam Isa1  Zainab Abu Bakar2  Ali Seman2 
[1] Medical Faculty, Masterskill University College of Health Sciences, No. 6, Jalan Lembah, Bandar Seri Alam, 81750, Johor Bahru, Johor, Malaysia;Center for Computer Sciences, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia
关键词: Data mining;    Optimization;    Clustering;    Bioinformatics;    Algorithms;   
Others  :  1165497
DOI  :  10.1186/1756-0500-5-557
 received in 2012-03-01, accepted in 2012-09-22,  发布年份 2012
PDF
【 摘 要 】

Background

Y-Short Tandem Repeats (Y-STR) data consist of many similar and almost similar objects. This characteristic of Y-STR data causes two problems with partitioning: non-unique centroids and local minima problems. As a result, the existing partitioning algorithms produce poor clustering results.

Results

Our new algorithm, called k-Approximate Modal Haplotypes (k-AMH), obtains the highest clustering accuracy scores for five out of six datasets, and produces an equal performance for the remaining dataset. Furthermore, clustering accuracy scores of 100% are achieved for two of the datasets. The k-AMH algorithm records the highest mean accuracy score of 0.93 overall, compared to that of other algorithms: k-Population (0.91), k-Modes-RVF (0.81), New Fuzzy k-Modes (0.80), k-Modes (0.76), k-Modes-Hybrid 1 (0.76), k-Modes-Hybrid 2 (0.75), Fuzzy k-Modes (0.74), and k-Modes-UAVM (0.70).

Conclusions

The partitioning performance of the k-AMH algorithm for Y-STR data is superior to that of other algorithms, owing to its ability to solve the non-unique centroids and local minima problems. Our algorithm is also efficient in terms of time complexity, which is recorded as O(km(n-k)) and considered to be linear.

【 授权许可】

   
2012 Seman et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150416031323227.pdf 819KB PDF download
Figure 5. 38KB Image download
Figure 4. 28KB Image download
Figure 3. 31KB Image download
Figure 2. 31KB Image download
Figure 1. 35KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

【 参考文献 】
  • [1]Kayser M, Kittler R, Erler A, Hedman M, Lee AC, Mohyuddin A, Mehdi SQ, Rosser Z, Stoneking M, Jobling MA, Sajantila A, Tyler-Smith C: A comprehensive survey of human Y-chromosomal microsatellites. Am J Hum Genet 2004, 74(6):1183-1197.
  • [2]Perego UA, Turner A, Ekins JE, Woodward SR: The science of molecular genealogy. National Genealogical Society Quarterly 2005, 93(4):245-259.
  • [3]Perego UA: The power of DNA: Discovering lost and hidden relationships. Oslo: World Library and Information Congress: 71st IFLA General Conference and Council Oslo; 2005.
  • [4]Hutchison LAD, Myres NM, Woodward S: Growing the family tree: The power of DNA in reconstructing family relationships. Proceedings of the First Symposium on Bioinformatics and Biotechnology (BIOT-04) 2004, 1:42-49.
  • [5]Dekairelle AF, Hoste B: Application of a Y-STR-pentaplex PCR (DYS19, DYS389I and II, DYS390 and DYS393) to sexual assault cases. Forensic Sci Int 2001, 118:122-125.
  • [6]Rolf B, Keil W, Brinkmann B, Roewer L, Fimmers R: Paternity testing using Y-STR haplotypes: Assigning a probability for paternity in cases of mutations. Int J Legal Med 2001, 115:12-15.
  • [7]Dettlaff-Kakol A, Pawlowski R: First polish DNA “manhunt” - an application of Y-chromosome STRs. Int J Legal Med 2002, 116:289-291.
  • [8]Stix G: Traces of the distant past. Sci Am 2008, 299:56-63.
  • [9]Gerstenberger J, Hummel S, Schultes T, Häck B, Herrmann B: Reconstruction of a historical genealogy by means of STR analysis and Y-haplotyping of ancient DNA. Eur J Hum Genet 1999, 7:469-477.
  • [10]International Society of Genetic Genealogy. http://www.isogg.org webcite
  • [11]The Y Chromosome Consortium. http://ycc.biosci.arizona.edu webcite
  • [12]Schlecht J, Kaplan ME, Barnard K, Karafet T, Hammer MF, Merchant NC: Machine-learning approaches for classifying haplogroup from Y chromosome STR data. PLoS Comput Biol 2008, 4(6):e1000093.
  • [13]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-based clustering for Y-Short Tandem Repeats (Y-STR) as Numerical and Categorical data. Proc. 2010 Int. Conf. on Information Retrieval and Knowledge Management (CAMP’10) 2010, 1:28-33. Shah Alam, Malaysia
  • [14]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-Based Hard and Soft Clustering Approaches for Y-STR Data. Journal of Genetic Genealogy 2010, 6(1):1-9. Available online: http://www.jogg.info webcite
  • [15]Seman A, Abu Bakar Z, Mohd Sapawi A: Attribute Value Weighting in K-Modes Clustering for Y-Short Tandem Repeats (Y-STR) Surname. Proc. of Int. Symposium on Information Technology 2010 (ITsim’10) 2010, 3:1531-1536. Kuala Lumpur, Malaysia
  • [16]Seman A, Abu Bakar Z, Mohd Sapawi A: Hard and Soft Updating Centroids for Clustering Y-Short Tandem Repeats (Y-STR) Data. Proc. 2010 IEEE Conference on Open Systems (ICOS 2010) 2010, 1:6-11. Kuala Lumpur, Malaysia
  • [17]Seman A, Abu Bakar Z, Mohd Sapawi A: Modeling Centre-based Hard and Soft Clustering for Y Chromosome Short Tandem Repeats (Y‐STR) Data. Proc. 2010 International Conference on Science and Social Research (CSSR 2010) 2010, 1:73-78. Kuala Lumpur, Malaysia
  • [18]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-based Hard Clustering Algorithm for Y-STR Data. Malaysia Journal of Computing 2010, 1:62-73.
  • [19]Seman A, Abu Bakar Z, Isa MN: Evaluation of k-Mode-type Algorithms for Clustering Y-Short Tandem Repeats. Journal of Trends in Bioinformatics 2012, 5(2):47-52.
  • [20]Ng M, Jing L: A new fuzzy k-modes clustering algorithm for categorical data. International Journal of Granular Computing, Rough Sets and Intelligent Systems 2009, 1(1):105-119.
  • [21]He Z, Xu X, Deng S: Attribute value weighting in k-Modes clustering. Ithaca, NY, USA: Cornell University Library, Cornell University; 2007:1-15. available online: http://arxiv.org/abs/cs/0701013v1 webcite
  • [22]Ng MK, Junjie M, Joshua L, Huang Z, He Z: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 2007, 29(3):503-507.
  • [23]Kim DW, Lee YK, Lee D, Lee KH: k-Populations algorithm for clustering categorical data. Pattern Recogn 2005, 38:1131-1134.
  • [24]Huang Z, Ng M: A Fuzzy k-Modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 1999, 7(4):446-452.
  • [25]Huang Z: Extensions to the k-Means algorithm for clustering large datasets with categorical values. Data Min Knowl Discov 1998, 2:283-304.
  • [26]MacQueen JB: Some methods for classification and analysis of multivariate observations. The 5th Berkeley Symposium on Mathematical Statistics and Probability 1967, 1:281-297.
  • [27]Ralambondrainy H: A conceptual version of the k-Means algorithm. Pattern Recogn Lett 1995, 16:1147-1157.
  • [28]Bobrowski L, Bezdek JC: c-Means clustering with the l1 and l∞ norms. IEEE Trans Syst Man Cybern 1989, 21(3):545-554.
  • [29]Salim SZ, Ismail MA: k-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 1984, 6:81-87.
  • [30]WorldFamilies.net. http://www.worldfamilies.net webcite
  • [31]Fitzpatrick C: Forensic genealogy. Fountain Valley: Cal.: Rice Book Press; 2005.
  • [32]Ireland yDNA project. http://www.familytreedna.com/public/IrelandHeritage/ webcite
  • [33]Finland DNA Project. http://www.familytreedna.com/public/Finland/ webcite
  • [34]Y-Haplogroup project. http://www.worldfamilies.net/yhapprojects/ webcite
  • [35]Clan Donald Genealogy Project. http://dna-project.clan-donald-usa.org webcite
  • [36]Flannery Clan. http://www.flanneryclan.ie webcite
  • [37]Doug and Joan Mumma’s Home Page. http://www.mumma.org webcite
  • [38]Williams Genealogy. http://williams.genealogy.fm webcite
  • [39]Phillips DNA Project. . http://www.phillipsdnaproject.com webcite
  • [40]Brown Genealogy Society. http://brownsociety.org webcite
  • [41]San OM, Huynh V, Nakamori Y: An alternative extension of the K-Means Algorithm for clustering categorical data. IJAMCS 2004, 14(2):241-247.
  • [42]Blake CL, Merz CJ: UCI repository of machine learning database. 1989.
  文献评价指标  
  下载次数:186次 浏览次数:72次