Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers with STEREO

J. Data Intell. Pub Date : 2022-05-01 DOI:10.26421/jdi3.2-4

S. Epp, Michael J. Hoffmann, N. Lell, M. Mohr, A. Scherp

{"title":"Extracting Experiment Statistics, Conditions, and Topics from Scientific Papers with STEREO","authors":"S. Epp, Michael J. Hoffmann, N. Lell, M. Mohr, A. Scherp","doi":"10.26421/jdi3.2-4","DOIUrl":null,"url":null,"abstract":"We address the problem of extracting reports of statistics along with information about the experiment conditions and experiment topics from scientific publications. A common writing style for statistical results are the recommendations of the American Psychology Association (APA). In practice, writing styles vary as reports are not 100\\% following APA-style or parameters are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experiment conditions investigated and the general experiment topic. We address these challenges by proposing a flexible pipeline STEREO based on wrapper induction and unsupervised aspect detection to extract experiment statistics, conditions, and topics. Thus, in contrast to existing rule-based tools like statcheck with a pre-defined set of rules, we learn rules via induction. Hierarchical wrapper induction is applied to learn rules to extract the reported statistics. Challenge here is to apply wrapper induction on an information extraction task without having formatting landmarks as they can be exploited in HTML pages. Result of step 1 is a set of extracted statistic reports together with sentences in which the reports were found. This is used as input to step 2 of STEREO, which has two parts. We extract experiment conditions using a grammar-based wrapper. Furthermore, we identify the experiment topic using an unsupervised attention-based aspect extraction approach adapted to our problem domain. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25% of the CORD-19 corpus (about 500 documents) to learn statistics extraction rules that cover 95% of the sentences in CORD-19. The statistic extraction has 100% precision on APA-conform statistics, which is identical with statcheck. In addition, STEREO can extract non-APA writing styles with 95% precision, which statcheck does not support. Extracting non-APA conform statistics is important as they make more than 99% of all $113$k extracted statistics. We could extract in 46% the correct conditions from APA-conform reports (30% for non-APA). The best model for topic extraction achieves a precision of 75% on statistics reported in APA style $73% for non-APA conform). We conclude that STEREO is a good foundation for automatic statistic extraction and future developments for scientific paper analysis. Particularly the extraction of non-APA conform reports is important and allows applications such as giving feedback to authors about what is missing and could be changed. Finally, STEREO complements existing metadata extraction tools and can be integrated in a general scientific paper analysis pipeline.","PeriodicalId":232625,"journal":{"name":"J. Data Intell.","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Data Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26421/jdi3.2-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We address the problem of extracting reports of statistics along with information about the experiment conditions and experiment topics from scientific publications. A common writing style for statistical results are the recommendations of the American Psychology Association (APA). In practice, writing styles vary as reports are not 100\% following APA-style or parameters are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experiment conditions investigated and the general experiment topic. We address these challenges by proposing a flexible pipeline STEREO based on wrapper induction and unsupervised aspect detection to extract experiment statistics, conditions, and topics. Thus, in contrast to existing rule-based tools like statcheck with a pre-defined set of rules, we learn rules via induction. Hierarchical wrapper induction is applied to learn rules to extract the reported statistics. Challenge here is to apply wrapper induction on an information extraction task without having formatting landmarks as they can be exploited in HTML pages. Result of step 1 is a set of extracted statistic reports together with sentences in which the reports were found. This is used as input to step 2 of STEREO, which has two parts. We extract experiment conditions using a grammar-based wrapper. Furthermore, we identify the experiment topic using an unsupervised attention-based aspect extraction approach adapted to our problem domain. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25% of the CORD-19 corpus (about 500 documents) to learn statistics extraction rules that cover 95% of the sentences in CORD-19. The statistic extraction has 100% precision on APA-conform statistics, which is identical with statcheck. In addition, STEREO can extract non-APA writing styles with 95% precision, which statcheck does not support. Extracting non-APA conform statistics is important as they make more than 99% of all $113$k extracted statistics. We could extract in 46% the correct conditions from APA-conform reports (30% for non-APA). The best model for topic extraction achieves a precision of 75% on statistics reported in APA style $73% for non-APA conform). We conclude that STEREO is a good foundation for automatic statistic extraction and future developments for scientific paper analysis. Particularly the extraction of non-APA conform reports is important and allows applications such as giving feedback to authors about what is missing and could be changed. Finally, STEREO complements existing metadata extraction tools and can be integrated in a general scientific paper analysis pipeline.

查看原文本刊更多论文

用STEREO从科学论文中提取实验统计、条件和主题

我们解决了从科学出版物中提取统计报告以及实验条件和实验主题信息的问题。统计结果的常用写作风格是美国心理学协会(APA)的建议。在实践中，写作风格各不相同，因为报告不是100%遵循apa风格，或者尽管必须报告参数，但没有报告参数。此外，统计数据不是单独报告的，而是在实验条件调查和一般实验主题的背景下报告的。为了解决这些问题，我们提出了一种基于包装器归纳和无监督方面检测的柔性管道STEREO来提取实验统计数据、条件和主题。因此，与现有的基于规则的工具(如具有预定义规则集的statcheck)相比，我们通过归纳来学习规则。采用分层包装器归纳法学习规则，提取报告的统计信息。这里的挑战是在没有格式化标志的情况下对信息提取任务应用包装器归纳，因为它们可以在HTML页面中被利用。步骤1的结果是一组提取的统计报告，以及发现这些报告的句子。这被用作STEREO的第2步的输入，它有两个部分。我们使用基于语法的包装器提取实验条件。此外，我们使用适应于我们的问题域的无监督的基于注意的方面提取方法来识别实验主题。我们将管道应用于CORD-19数据集中的100,000多个文档。它只需要0.25%的CORD-19语料库(约500个文档)学习统计抽取规则，就可以覆盖CORD-19中95%的句子。在符合apa标准的统计数据上，统计提取精度达到100%，与statcheck相同。此外，STEREO可以以95%的精度提取非apa写作风格，这是statcheck不支持的。提取非apa符合统计数据很重要，因为它们占提取统计数据的99%以上。我们可以从符合apa的报告中提取46%的正确条件(非apa的30%)。主题提取的最佳模型在以APA风格报告的统计数据上达到75%的精度(非APA符合73%)。我们认为STEREO为自动统计提取和科学论文分析的未来发展奠定了良好的基础。特别是对非apa报告的提取是很重要的，并且允许应用程序，例如向作者反馈缺少的内容和可以更改的内容。最后，STEREO补充了现有的元数据提取工具，可以集成到一般的科学论文分析管道中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Data Intell.

自引率

0.00%

发文量