| BMC Bioinformatics | |
| Determination of sample size for a multi-class classifier based on single-nucleotide polymorphisms: a volume under the surface approach | |
| Xinyu Liu2  Yupeng Wang1  TN Sriram2  | |
| [1] Computational Biology Service Unit, Cornell University, Ithaca, NY 14853, USA | |
| [2] Department of Statistics, University of Georgia, Athens, GA 30602, USA | |
| 关键词: Sample size determination; Receiver operating characteristic; Probability of correct classification; Heterogeneous stock mice data; HapMap data; Classification; Area under the receiver operating characteristic curve; | |
| Others : 818429 DOI : 10.1186/1471-2105-15-190 |
|
| received in 2013-06-25, accepted in 2014-06-04, 发布年份 2014 | |
PDF
|
|
【 摘 要 】
Background
Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) for two classes and the Volume Under the ROC hyper-Surface (VUS) for three or more classes. Sample size determination based on AUC or VUS would not only guarantee an overall correct classification rate, but also make studies more cost-effective.
Results
For coded SNP data from D(≥2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are then used to evaluate the associated AUCs or VUSs, whose accuracies are validated using Monte Carlo simulations. We give a sample size determination method, which ensures that the difference between the two approximate AUCs (or VUSs) is below a pre-specified threshold. The performance of our sample size determination method is then illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for a continuum of threshold values. In all, four different sample size determination studies are conducted with the HapMap data, covering cases involving well-separated populations to poorly-separated ones.
Conclusion
For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file 1, and a ZIP file of the R package “SampleSizeSNP” is given in Additional file 2.
【 授权许可】
2014 Liu et al.; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20140711101915221.pdf | 292KB | ||
| Figure 4. | 36KB | Image | |
| Figure 4. | 36KB | Image | |
| Figure 4. | 36KB | Image | |
| Figure 3. | 38KB | Image | |
| Figure 3. | 38KB | Image | |
| Figure 3. | 38KB | Image | |
| Figure 2. | 44KB | Image | |
| Figure 2. | 44KB | Image | |
| Figure 2. | 44KB | Image | |
| Figure 1. | 45KB | Image | |
| Figure 1. | 45KB | Image | |
| Figure 1. | 45KB | Image |
【 图 表 】
Figure 1.
Figure 1.
Figure 1.
Figure 2.
Figure 2.
Figure 2.
Figure 3.
Figure 3.
Figure 3.
Figure 4.
Figure 4.
Figure 4.
【 参考文献 】
- [1]Guzzetta G, Jurman G, Furlanello C: A machine learning pipeline for quantitative phenotype prediction from genotype data . BMC Bioinformatics 2010, 11(Suppl 8):S3. BioMed Central Full Text
- [2]Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM: Predicting unobserved phenotypes for complex traits from whole-genome SNP data . Plos Genet 2008, 4:e1000231.
- [3]Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I: Detecting high-order interactions of single nucleotide polymorphisms using genetic programming . Bioinformatics 2007, 23:3280-3288.
- [4]Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk to disease from genome-wide association studies . Genome Res 2007, 17:1520-1528.
- [5]Zhou N, Wang L: Effective selection of informative SNPs and classification on the HapMap genotype data . BMC Bioinformatics 2007, 8:484-492. BioMed Central Full Text
- [6]De Valpine P, Bitter HM, Brown MPS, Heller J: A simulation-approximation approach to sample size planning for high-dimensional classification studies . Biostatistics 2009, 10:424-435.
- [7]Dobbin KK, Simon RM: Sample size determination in microarray experiments for class comparison and prognostic classification . Biostatistics 2005, 6:27-38.
- [8]Dobbin KK, Simon RM: Sample size planning for developing classifiers using high-dimensional DNA microarray data . Biostatistics 2007, 8:101-117.
- [9]Dobbin KK, Zhao Y, Simon RM: How large a training set is needed to develop a classifier for microarray data . Clin Cancer Res 2008, 14:108-114.
- [10]Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP: Estimating dataset size requirements for classifying DNA microarray data . J Comput Biol 2003, 10:119-142.
- [11]Liu X, Wang Y, Rekhaya R, Sriram TN: Sample size determination for classifiers based on single-nucleotide polymorphisms . Biostatistics 2012, 13:217-227.
- [12]Metz C: Basic principles of ROC analysis . Seminars Nucl Med 1978, 3:283-298.
- [13]Fawcett T: An introduction to ROC analysis . Pattern Recogn Lett 2005, 27:861-874.
- [14]Landgrebe T, Duin RPW: Approximating the multiclass ROC by pairwise analysis . Pattern Recogn Lett 2007, 28:1747-1758.
- [15]Landgrebe T, Paclik P: The ROC skeleton for multiclass ROC estimation . Pattern Recogn Lett 2010, 31:949-958.
PDF