BMC Medical Research Methodology | |
The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project | |
Julia Fraser1  Richard H. Swartz2  Demetrios J. Sahlas3  Paula M. McLaughlin4  Donna Kwan4  Alicia J. Peltsch4  Frederico Pieruccini-Faria5  Manuel Montero-Odasso5  Kelly M. Sunderland6  Derek Beaton6  Malcolm A. Binns7  Stephen C. Strother8  | |
[1] 0000 0000 8644 1405, grid.46078.3d, Department of Kinesiology, University of Waterloo, 200 University Ave W, N2L 3G1, Waterloo, Ontario, Canada;0000 0000 9743 1587, grid.413104.3, Department of Medicine (Neurology), Sunnybrook Health Sciences Centre, 2075 Bayview Ave, M4N 3M5, Toronto, Ontario, Canada;0000 0001 2157 2938, grid.17063.33, Faculty of Medicine, University of Toronto, 1 King’s College Cir, M5S 1A8, Toronto, Ontario, Canada;0000 0004 1936 8227, grid.25073.33, Department of Medicine, McMaster University, 1280 Main St W, L8S 4L8, Hamilton, Ontario, Canada;0000 0004 1936 8884, grid.39381.30, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, N6A 5C1, London, Ontario, Canada;0000 0004 1936 8884, grid.39381.30, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, N6A 5C1, London, Ontario, Canada;0000 0000 9674 4717, grid.416448.b, Gait and Brain Lab, Parkwood Institute, 550 Wellington Rd, N6C 0A7, London, Ontario, Canada;0000 0001 0556 2414, grid.415847.b, Lawson Health Research Institute, 750 Base Line Rd E, N6C 2R5, London, Ontario, Canada;Rotman Research Institute, Baycrest Health Sciences, 3560 Bathurst St, M6A 2E1, Toronto, Ontario, Canada;Rotman Research Institute, Baycrest Health Sciences, 3560 Bathurst St, M6A 2E1, Toronto, Ontario, Canada;0000 0001 2157 2938, grid.17063.33, Dalla Lana School of Public Health, University of Toronto, 155 College St, M5T 3M7, Toronto, Ontario, Canada;Rotman Research Institute, Baycrest Health Sciences, 3560 Bathurst St, M6A 2E1, Toronto, Ontario, Canada;0000 0001 2157 2938, grid.17063.33, Medical Biophysics Department, University of Toronto, 101 College St, Suite 15-701, M5G 1L7, Toronto, Ontario, Canada; | |
关键词: Quality control; Multivariate outliers; Minimum covariance determinant; Principal component analysis; Visualization; | |
DOI : 10.1186/s12874-019-0737-5 | |
来源: publisher | |
【 摘 要 】
BackgroundLarge and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow.MethodsWe illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods – Minimum Covariance Determinant (MCD) and Candès’ Robust Principal Component Analysis (RPCA) – and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification.ResultsOf 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection.ConclusionsManual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202004239938119ZK.pdf | 1444KB | download |