科技报告

【摘要】

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer "data loss prevention" (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time.

【预览】

附件列表
Files	Size	Format	View
RO201804100002867LZ	460KB	PDF	download


Text Classification for Data Loss Prevention

Hart, Michael ; Manadhata, Pratyusa K. ; Johnson, Rob
HP Development Company
关键词: Data Loss Prevention; DLP; SVM; Text Classification;
RP-ID : HPL-2011-114
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：47次	浏览次数：58次

【 摘 要 】

【 预 览 】

【摘要】

【预览】