期刊论文详细信息
Journal of Data Science
Automating Data Analysis Methods in Epidemiology
article
George Choueiry1  Pascale Salameh1 
[1] Department of Epidemiology & Biostatistics, School of Public Health, Lebanese University;School of Pharmacy, Lebanese University;School of medicine, Lebanese university
关键词: automation;    computer software;    machine learning;    normal distribution;   
DOI  :  10.6339/JDS.201901_17(1).0003
学科分类:土木及结构工程学
来源: JDS
PDF
【 摘 要 】

Technological advances in software development effectively handled technical details that made life easier for data analysts, but also allowed for nonexperts in statistics and computer science to analyze data. As a result, medical research suffers from statistical errors that could be otherwise prevented such as errors in choosing a hypothesis test and assumption checking of models. Our objective is to create an automated data analysis software package that can help practitioners run non-subjective, fast, accurate and easily interpretable analyses. We used machine learning to predict the normality of a distribution as an alternative to normality tests and graphical methods to avoid their downsides. We implemented methods for detecting outliers, imputing missing values, and choosing a threshold for cutting numerical variables to correct for non-linearity before running a linear regression. We showed that data analysis can be automated. Our normality prediction algorithm outperformed the Shapiro-Wilk test in small samples with Matthews correlation coefficient of 0.5 vs. 0.16. The biggest drawback was that we did not find alternatives for statistical tests to test linear regression assumptions which are problematic in large datasets. We also applied our work to a dataset about smoking in teenagers. Because of the opensource nature of our work, these algorithms can be used in future research and projects.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202307150000345ZK.pdf 468KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:0次