期刊论文

【摘要】

BackgroundData preprocessing techniques are devoted to correcting or alleviating errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing.In this paper, we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used data preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection.ResultsThe algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but also maintain or even improve the original accuracy in a short period of time.ConclusionDPASF contains algorithms that are useful when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.

【授权许可】

CC BY

【预览】

附件列表
Files	Size	Format	View
RO201910102077909ZK.pdf	1078KB	PDF	download

Big Data Analytics
DPASF: a flink library for streaming data preprocessing

¹ ¹ ¹ ¹
[1] 0000000121678994, grid.4489.1, Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain;
关键词: Flink; Big data; Machine learning; Data preprocessing;
DOI : 10.1186/s41044-019-0041-8
来源: publisher
PDF


	文献评价指标
	下载次数：35次	浏览次数：9次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】