| BMC Research Notes | |
| An efficient clustering algorithm for partitioning Y-short tandem repeats data | |
| Mohamed Nizam Isa1  Zainab Abu Bakar2  Ali Seman2  | |
| [1] Medical Faculty, Masterskill University College of Health Sciences, No. 6, Jalan Lembah, Bandar Seri Alam, 81750, Johor Bahru, Johor, Malaysia;Center for Computer Sciences, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia | |
| 关键词: Data mining; Optimization; Clustering; Bioinformatics; Algorithms; | |
| Others : 1165497 DOI : 10.1186/1756-0500-5-557 |
|
| received in 2012-03-01, accepted in 2012-09-22, 发布年份 2012 | |
PDF
|
|
【 摘 要 】
Background
Y-Short Tandem Repeats (Y-STR) data consist of many similar and almost similar objects. This characteristic of Y-STR data causes two problems with partitioning: non-unique centroids and local minima problems. As a result, the existing partitioning algorithms produce poor clustering results.
Results
Our new algorithm, called k-Approximate Modal Haplotypes (k-AMH), obtains the highest clustering accuracy scores for five out of six datasets, and produces an equal performance for the remaining dataset. Furthermore, clustering accuracy scores of 100% are achieved for two of the datasets. The k-AMH algorithm records the highest mean accuracy score of 0.93 overall, compared to that of other algorithms: k-Population (0.91), k-Modes-RVF (0.81), New Fuzzy k-Modes (0.80), k-Modes (0.76), k-Modes-Hybrid 1 (0.76), k-Modes-Hybrid 2 (0.75), Fuzzy k-Modes (0.74), and k-Modes-UAVM (0.70).
Conclusions
The partitioning performance of the k-AMH algorithm for Y-STR data is superior to that of other algorithms, owing to its ability to solve the non-unique centroids and local minima problems. Our algorithm is also efficient in terms of time complexity, which is recorded as O(km(n-k)) and considered to be linear.
【 授权许可】
2012 Seman et al.; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20150416031323227.pdf | 819KB | ||
| Figure 5. | 38KB | Image | |
| Figure 4. | 28KB | Image | |
| Figure 3. | 31KB | Image | |
| Figure 2. | 31KB | Image | |
| Figure 1. | 35KB | Image |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
【 参考文献 】
- [1]Kayser M, Kittler R, Erler A, Hedman M, Lee AC, Mohyuddin A, Mehdi SQ, Rosser Z, Stoneking M, Jobling MA, Sajantila A, Tyler-Smith C: A comprehensive survey of human Y-chromosomal microsatellites. Am J Hum Genet 2004, 74(6):1183-1197.
- [2]Perego UA, Turner A, Ekins JE, Woodward SR: The science of molecular genealogy. National Genealogical Society Quarterly 2005, 93(4):245-259.
- [3]Perego UA: The power of DNA: Discovering lost and hidden relationships. Oslo: World Library and Information Congress: 71st IFLA General Conference and Council Oslo; 2005.
- [4]Hutchison LAD, Myres NM, Woodward S: Growing the family tree: The power of DNA in reconstructing family relationships. Proceedings of the First Symposium on Bioinformatics and Biotechnology (BIOT-04) 2004, 1:42-49.
- [5]Dekairelle AF, Hoste B: Application of a Y-STR-pentaplex PCR (DYS19, DYS389I and II, DYS390 and DYS393) to sexual assault cases. Forensic Sci Int 2001, 118:122-125.
- [6]Rolf B, Keil W, Brinkmann B, Roewer L, Fimmers R: Paternity testing using Y-STR haplotypes: Assigning a probability for paternity in cases of mutations. Int J Legal Med 2001, 115:12-15.
- [7]Dettlaff-Kakol A, Pawlowski R: First polish DNA “manhunt” - an application of Y-chromosome STRs. Int J Legal Med 2002, 116:289-291.
- [8]Stix G: Traces of the distant past. Sci Am 2008, 299:56-63.
- [9]Gerstenberger J, Hummel S, Schultes T, Häck B, Herrmann B: Reconstruction of a historical genealogy by means of STR analysis and Y-haplotyping of ancient DNA. Eur J Hum Genet 1999, 7:469-477.
- [10]International Society of Genetic Genealogy. http://www.isogg.org webcite
- [11]The Y Chromosome Consortium. http://ycc.biosci.arizona.edu webcite
- [12]Schlecht J, Kaplan ME, Barnard K, Karafet T, Hammer MF, Merchant NC: Machine-learning approaches for classifying haplogroup from Y chromosome STR data. PLoS Comput Biol 2008, 4(6):e1000093.
- [13]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-based clustering for Y-Short Tandem Repeats (Y-STR) as Numerical and Categorical data. Proc. 2010 Int. Conf. on Information Retrieval and Knowledge Management (CAMP’10) 2010, 1:28-33. Shah Alam, Malaysia
- [14]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-Based Hard and Soft Clustering Approaches for Y-STR Data. Journal of Genetic Genealogy 2010, 6(1):1-9. Available online: http://www.jogg.info webcite
- [15]Seman A, Abu Bakar Z, Mohd Sapawi A: Attribute Value Weighting in K-Modes Clustering for Y-Short Tandem Repeats (Y-STR) Surname. Proc. of Int. Symposium on Information Technology 2010 (ITsim’10) 2010, 3:1531-1536. Kuala Lumpur, Malaysia
- [16]Seman A, Abu Bakar Z, Mohd Sapawi A: Hard and Soft Updating Centroids for Clustering Y-Short Tandem Repeats (Y-STR) Data. Proc. 2010 IEEE Conference on Open Systems (ICOS 2010) 2010, 1:6-11. Kuala Lumpur, Malaysia
- [17]Seman A, Abu Bakar Z, Mohd Sapawi A: Modeling Centre-based Hard and Soft Clustering for Y Chromosome Short Tandem Repeats (Y‐STR) Data. Proc. 2010 International Conference on Science and Social Research (CSSR 2010) 2010, 1:73-78. Kuala Lumpur, Malaysia
- [18]Seman A, Abu Bakar Z, Mohd Sapawi A: Centre-based Hard Clustering Algorithm for Y-STR Data. Malaysia Journal of Computing 2010, 1:62-73.
- [19]Seman A, Abu Bakar Z, Isa MN: Evaluation of k-Mode-type Algorithms for Clustering Y-Short Tandem Repeats. Journal of Trends in Bioinformatics 2012, 5(2):47-52.
- [20]Ng M, Jing L: A new fuzzy k-modes clustering algorithm for categorical data. International Journal of Granular Computing, Rough Sets and Intelligent Systems 2009, 1(1):105-119.
- [21]He Z, Xu X, Deng S: Attribute value weighting in k-Modes clustering. Ithaca, NY, USA: Cornell University Library, Cornell University; 2007:1-15. available online: http://arxiv.org/abs/cs/0701013v1 webcite
- [22]Ng MK, Junjie M, Joshua L, Huang Z, He Z: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 2007, 29(3):503-507.
- [23]Kim DW, Lee YK, Lee D, Lee KH: k-Populations algorithm for clustering categorical data. Pattern Recogn 2005, 38:1131-1134.
- [24]Huang Z, Ng M: A Fuzzy k-Modes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst 1999, 7(4):446-452.
- [25]Huang Z: Extensions to the k-Means algorithm for clustering large datasets with categorical values. Data Min Knowl Discov 1998, 2:283-304.
- [26]MacQueen JB: Some methods for classification and analysis of multivariate observations. The 5th Berkeley Symposium on Mathematical Statistics and Probability 1967, 1:281-297.
- [27]Ralambondrainy H: A conceptual version of the k-Means algorithm. Pattern Recogn Lett 1995, 16:1147-1157.
- [28]Bobrowski L, Bezdek JC: c-Means clustering with the l1 and l∞ norms. IEEE Trans Syst Man Cybern 1989, 21(3):545-554.
- [29]Salim SZ, Ismail MA: k-Means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 1984, 6:81-87.
- [30]WorldFamilies.net. http://www.worldfamilies.net webcite
- [31]Fitzpatrick C: Forensic genealogy. Fountain Valley: Cal.: Rice Book Press; 2005.
- [32]Ireland yDNA project. http://www.familytreedna.com/public/IrelandHeritage/ webcite
- [33]Finland DNA Project. http://www.familytreedna.com/public/Finland/ webcite
- [34]Y-Haplogroup project. http://www.worldfamilies.net/yhapprojects/ webcite
- [35]Clan Donald Genealogy Project. http://dna-project.clan-donald-usa.org webcite
- [36]Flannery Clan. http://www.flanneryclan.ie webcite
- [37]Doug and Joan Mumma’s Home Page. http://www.mumma.org webcite
- [38]Williams Genealogy. http://williams.genealogy.fm webcite
- [39]Phillips DNA Project. . http://www.phillipsdnaproject.com webcite
- [40]Brown Genealogy Society. http://brownsociety.org webcite
- [41]San OM, Huynh V, Nakamori Y: An alternative extension of the K-Means Algorithm for clustering categorical data. IJAMCS 2004, 14(2):241-247.
- [42]Blake CL, Merz CJ: UCI repository of machine learning database. 1989.
PDF