| Data | 卷:6 |
| An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing | |
| Mário Antunes1  Vitor Beires Nogueira2  Gonçalo Carnaz2  | |
| [1] Computer Science and Communication Research Centre (CIIC), School of Technology and Management, Polytechnic of Leiria, 2411-901 Leiria, Portugal; | |
| [2] Informatics Departament, University of Évora, 7002-554 Évora, Portugal; | |
| 关键词: crime-related documents; cybersecurity; criminal investigation; Portuguese language corpus; natural language processing; 5W1H; | |
| DOI : 10.3390/data6070071 | |
| 来源: DOAJ | |
【 摘 要 】
Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of
【 授权许可】
Unknown