科技报告详细信息
Text Classification for Data Loss Prevention
Hart, Michael ; Manadhata, Pratyusa K. ; Johnson, Rob
HP Development Company
关键词: Data Loss Prevention;    DLP;    SVM;    Text Classification;   
RP-ID  :  HPL-2011-114
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer "data loss prevention" (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time.

【 预 览 】
附件列表
Files Size Format View
RO201804100002867LZ 460KB PDF download
  文献评价指标  
  下载次数:34次 浏览次数:56次