文本分析的最新进展

IF 7.4 1区 数学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS
Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin, Wanshan Li
{"title":"文本分析的最新进展","authors":"Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin, Wanshan Li","doi":"10.1146/annurev-statistics-040522-022138","DOIUrl":null,"url":null,"abstract":"Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.Expected final online publication date for the Annual Review of Statistics and Its Application, Volume 11 is March 2024. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.","PeriodicalId":48855,"journal":{"name":"Annual Review of Statistics and Its Application","volume":"101 8","pages":""},"PeriodicalIF":7.4000,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Recent Advances in Text Analysis\",\"authors\":\"Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin, Wanshan Li\",\"doi\":\"10.1146/annurev-statistics-040522-022138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.Expected final online publication date for the Annual Review of Statistics and Its Application, Volume 11 is March 2024. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.\",\"PeriodicalId\":48855,\"journal\":{\"name\":\"Annual Review of Statistics and Its Application\",\"volume\":\"101 8\",\"pages\":\"\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2023-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annual Review of Statistics and Its Application\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1146/annurev-statistics-040522-022138\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Review of Statistics and Its Application","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1146/annurev-statistics-040522-022138","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

文本分析是数据科学中一个有趣的研究领域,有各种各样的应用,比如人工智能、生物医学研究和工程。我们回顾了流行的文本分析方法,从主题建模到最近的神经语言模型。特别地,我们回顾了topic - score,这是一种主题建模的统计方法,并讨论了如何使用它来分析统计学家的多属性数据集(MADStat),这是我们收集和清理的统计出版物的数据集。将Topic-SCORE和其他方法应用于MADStat得到了有趣的发现。例如,我们确定了统计学中的11个代表性主题。对于每个期刊,主题权重随时间的演变可以可视化,这些结果用于分析统计研究的趋势。特别地,我们提出了一个新的统计模型来对11个主题的引用影响进行排序,并构建了一个跨主题引用图来说明不同主题的研究成果如何相互传播。MADStat上的结果从文本分析的角度提供了1975年至2015年统计研究的数据驱动图片。预计《统计年鉴及其应用》第11卷的最终在线出版日期为2024年3月。修订后的估计数请参阅http://www.annualreviews.org/page/journal/pubdates。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Recent Advances in Text Analysis
Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.Expected final online publication date for the Annual Review of Statistics and Its Application, Volume 11 is March 2024. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Annual Review of Statistics and Its Application
Annual Review of Statistics and Its Application MATHEMATICS, INTERDISCIPLINARY APPLICATIONS-STATISTICS & PROBABILITY
CiteScore
13.40
自引率
1.30%
发文量
29
期刊介绍: The Annual Review of Statistics and Its Application publishes comprehensive review articles focusing on methodological advancements in statistics and the utilization of computational tools facilitating these advancements. It is abstracted and indexed in Scopus, Science Citation Index Expanded, and Inspec.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信