Effectiveness results for popular e-discovery algorithms

Eugene Yang, D. Grossman, O. Frieder, R. Yurchak
{"title":"Effectiveness results for popular e-discovery algorithms","authors":"Eugene Yang, D. Grossman, O. Frieder, R. Yurchak","doi":"10.1145/3086512.3086540","DOIUrl":null,"url":null,"abstract":"E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation. Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.","PeriodicalId":425187,"journal":{"name":"Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3086512.3086540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation. Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.
流行的电子发现算法的有效性结果
电子发现应用程序依赖二进制文本分类来确定文档与特定案例的相关性。尽管存在许多这样的分类算法,但目前,供应商通常部署的工具通常只包含一种文本分类方法。与之前的研究不同,这些研究同时改变了许多评估参数,没有包括当前常见的算法、权重或特征,或者使用了不再有意义的小文档集合,我们在基准电子发现数据集上使用现代基准电子发现查询(主题)系统地评估了二进制文本分类算法。我们演示了使用不同参数组合获得的性能的广泛差异,从而激发了这种评估。具体来说,我们在一个大型标准数据集上比较了五种文本分类算法、三种术语加权技术和两种特征类型,并使用电子发现社区最感兴趣的指标评估了该测试套件(30种变体)的结果。我们的研究结果系统地表明,一套算法比单一算法更好地服务于电子发现项目,因为性能因主题而异,没有任何方法在各种条件和主题范围内都是统一的优越。为此,我们开发了一个名为FreeDiscovery的开源项目,它为电子发现项目提供了一套简化的算法访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信