基于似然比的法医学文本比较程序的比较研究:多变量核密度与词汇特征、词n -图、字符n -图

2014 Fifth Cybercrime and Trustworthy Computing Conference Pub Date : 2014-11-24 DOI:10.1109/CTC.2014.9

S. Ishihara

{"title":"基于似然比的法医学文本比较程序的比较研究:多变量核密度与词汇特征、词n -图、字符n -图","authors":"S. Ishihara","doi":"10.1109/CTC.2014.9","DOIUrl":null,"url":null,"abstract":"This is a comparative study to empirically investigate the performances of three different procedures for calculating authorship attribution likelihood ratios (LR). The procedures to be compared are: 1) a procedure based on multivariate kernel density (MVKD) with lexical features; 2) a procedure based on word N-grams; and 3) a procedure based on character N-grams. Furthermore, the best-performing LRs of these three procedures are fused into combined single LRs using a logistic-regression fusion, in order to investigate the extent of the improvement/deterioration that the fusion brings about. This study uses chatlog messages, which were presented as evidence to prosecute paedophiles, for testing. The numbers of word tokens used to model the authorship attribution of each message group are 500 and 1000 words. This was done to examine the effect of sample size on the performance of a system. The performance of a system is assessed with regard to its validity (= accuracy) and reliability (= precision) using the log-likelihood-ratio cost (Cllr) and 95% credible intervals (CI), respectively. While describing the different characteristics of these three procedures in their outcomes, this study demonstrates that the MVKD procedure was the best-performing procedure out of the three in terms of Cllr . This study also demonstrates that a logistic-regression fusion is useful for combining the LRs obtained from the three procedures in question, resulting in a good improvement in performance.","PeriodicalId":213064,"journal":{"name":"2014 Fifth Cybercrime and Trustworthy Computing Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams\",\"authors\":\"S. Ishihara\",\"doi\":\"10.1109/CTC.2014.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This is a comparative study to empirically investigate the performances of three different procedures for calculating authorship attribution likelihood ratios (LR). The procedures to be compared are: 1) a procedure based on multivariate kernel density (MVKD) with lexical features; 2) a procedure based on word N-grams; and 3) a procedure based on character N-grams. Furthermore, the best-performing LRs of these three procedures are fused into combined single LRs using a logistic-regression fusion, in order to investigate the extent of the improvement/deterioration that the fusion brings about. This study uses chatlog messages, which were presented as evidence to prosecute paedophiles, for testing. The numbers of word tokens used to model the authorship attribution of each message group are 500 and 1000 words. This was done to examine the effect of sample size on the performance of a system. The performance of a system is assessed with regard to its validity (= accuracy) and reliability (= precision) using the log-likelihood-ratio cost (Cllr) and 95% credible intervals (CI), respectively. While describing the different characteristics of these three procedures in their outcomes, this study demonstrates that the MVKD procedure was the best-performing procedure out of the three in terms of Cllr . This study also demonstrates that a logistic-regression fusion is useful for combining the LRs obtained from the three procedures in question, resulting in a good improvement in performance.\",\"PeriodicalId\":213064,\"journal\":{\"name\":\"2014 Fifth Cybercrime and Trustworthy Computing Conference\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 Fifth Cybercrime and Trustworthy Computing Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CTC.2014.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Fifth Cybercrime and Trustworthy Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CTC.2014.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

这是一项比较研究，实证调查了计算作者归因似然比(LR)的三种不同程序的性能。要比较的过程是:1)基于多变量核密度(MVKD)的具有词法特征的过程;2)基于词n图的程序;3)基于字符n图的程序。此外，使用逻辑回归融合将这三种方法中表现最好的LRs融合为组合的单个LRs，以研究融合带来的改善/恶化程度。这项研究使用聊天记录信息进行测试，这些信息被作为起诉恋童癖者的证据。用于对每个消息组的作者归属建模的单词令牌的数量分别为500和1000个单词。这样做是为了检验样本大小对系统性能的影响。系统的性能分别使用对数似然比成本(Cllr)和95%可信区间(CI)来评估其有效性(=准确性)和可靠性(=精度)。虽然描述了这三种手术在结果上的不同特征，但本研究表明MVKD手术在Cllr方面是三种手术中表现最好的。本研究还表明，逻辑回归融合对于结合从三个程序中获得的LRs是有用的，从而导致性能的良好改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams

This is a comparative study to empirically investigate the performances of three different procedures for calculating authorship attribution likelihood ratios (LR). The procedures to be compared are: 1) a procedure based on multivariate kernel density (MVKD) with lexical features; 2) a procedure based on word N-grams; and 3) a procedure based on character N-grams. Furthermore, the best-performing LRs of these three procedures are fused into combined single LRs using a logistic-regression fusion, in order to investigate the extent of the improvement/deterioration that the fusion brings about. This study uses chatlog messages, which were presented as evidence to prosecute paedophiles, for testing. The numbers of word tokens used to model the authorship attribution of each message group are 500 and 1000 words. This was done to examine the effect of sample size on the performance of a system. The performance of a system is assessed with regard to its validity (= accuracy) and reliability (= precision) using the log-likelihood-ratio cost (Cllr) and 95% credible intervals (CI), respectively. While describing the different characteristics of these three procedures in their outcomes, this study demonstrates that the MVKD procedure was the best-performing procedure out of the three in terms of Cllr . This study also demonstrates that a logistic-regression fusion is useful for combining the LRs obtained from the three procedures in question, resulting in a good improvement in performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 Fifth Cybercrime and Trustworthy Computing Conference

自引率

0.00%

发文量