Effectiveness results for popular e-discovery algorithms

Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law Pub Date : 2017-06-12 DOI:10.1145/3086512.3086540

Eugene Yang, D. Grossman, O. Frieder, R. Yurchak

{"title":"Effectiveness results for popular e-discovery algorithms","authors":"Eugene Yang, D. Grossman, O. Frieder, R. Yurchak","doi":"10.1145/3086512.3086540","DOIUrl":null,"url":null,"abstract":"E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation. Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.","PeriodicalId":425187,"journal":{"name":"Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3086512.3086540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

E-Discovery applications rely upon binary text categorization to determine relevance of documents to a particular case. Although many such categorization algorithms exist, at present, vendors often deploy tools that typically include only one text categorization approach. Unlike previous studies that vary many evaluation parameters simultaneously, fail to include common current algorithms, weights, or features, or used small document collections which are no longer meaningful, we systematically evaluate binary text categorization algorithms using modern benchmark e-Discovery queries (topics) on a benchmark e-Discovery data set. We demonstrate the wide variance of performance obtained using the different parameter combinations, motivating this evaluation. Specifically, we compare five text categorization algorithms, three term weighting techniques and two feature types on a large standard dataset and evaluate the results of this test suite (30 variations) using metrics of greatest interest to the e-Discovery community. Our findings systematically demonstrate that an e-Discovery project is better served by a suite of, rather than a single, algorithms since performance varies greatly depending on the topic, and no approach is uniformly superior across the range of conditions and topics. To that end, we developed an open source project called FreeDiscovery that provides e-Discovery projects with simplified access to a suite of algorithms.

查看原文本刊更多论文

流行的电子发现算法的有效性结果

电子发现应用程序依赖二进制文本分类来确定文档与特定案例的相关性。尽管存在许多这样的分类算法，但目前，供应商通常部署的工具通常只包含一种文本分类方法。与之前的研究不同，这些研究同时改变了许多评估参数，没有包括当前常见的算法、权重或特征，或者使用了不再有意义的小文档集合，我们在基准电子发现数据集上使用现代基准电子发现查询(主题)系统地评估了二进制文本分类算法。我们演示了使用不同参数组合获得的性能的广泛差异，从而激发了这种评估。具体来说，我们在一个大型标准数据集上比较了五种文本分类算法、三种术语加权技术和两种特征类型，并使用电子发现社区最感兴趣的指标评估了该测试套件(30种变体)的结果。我们的研究结果系统地表明，一套算法比单一算法更好地服务于电子发现项目，因为性能因主题而异，没有任何方法在各种条件和主题范围内都是统一的优越。为此，我们开发了一个名为FreeDiscovery的开源项目，它为电子发现项目提供了一套简化的算法访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law

自引率

0.00%

发文量