识别关键字作为高质量过滤器的有效方法，以获取涉及紧急情况的Twitter西班牙语数据

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2023-10-26 DOI:10.1016/j.csl.2023.101579

Joel Garcia-Arteaga , Jesús Zambrano-Zambrano , Jorge Parraga-Alava , Jorge Rodas-Silva

{"title":"识别关键字作为高质量过滤器的有效方法，以获取涉及紧急情况的Twitter西班牙语数据","authors":"Joel Garcia-Arteaga , Jesús Zambrano-Zambrano , Jorge Parraga-Alava , Jorge Rodas-Silva","doi":"10.1016/j.csl.2023.101579","DOIUrl":null,"url":null,"abstract":"<div>Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a <math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math> score of 0.97, a MAE of <math><mrow><mn>14</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>3</mn></mrow></msup></mrow></math>, and a MSE of <math><mrow><mn>4</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>4</mn></mrow></msup></mrow></math>. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.</div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data\",\"authors\":\"Joel Garcia-Arteaga , Jesús Zambrano-Zambrano , Jorge Parraga-Alava , Jorge Rodas-Silva\",\"doi\":\"10.1016/j.csl.2023.101579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a <math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math> score of 0.97, a MAE of <math><mrow><mn>14</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>3</mn></mrow></msup></mrow></math>, and a MSE of <math><mrow><mn>4</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>4</mn></mrow></msup></mrow></math>. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.</div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230823000980\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823000980","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于Twitter用户产生的大量数据，Twitter已经成为数据挖掘项目中数据提取的强大知识来源，这使得研究人员可以实时找到几乎任何主题的内容，但这取决于所使用的关键字的质量，否则提取的数据中会有很高比例的不相关内容。在本文中，我们引入了一种基于时间感知的机器学习方法来识别有意义的关键字，以便在使用Twitter API时最大限度地提取相关的紧急相关推文。我们遵循CRISP-DM方法。第一阶段依赖于对问题的理解，我们发现了使用有意义的关键字来过滤内容和提取质量更高的数据的必要性，并减少了不相关推文的百分比。在第二阶段，数据收集，我们使用Twitter官方API提取推文，并将其标记为“紧急情况”和“非紧急情况”。之后，我们对收集到的数据进行分析(数据理解)，以确定预处理技术，并为模型准备数据。最后，在建模和测试阶段，我们训练了一个受限玻尔兹曼机和四种自编码器的变体，包括一种由遗传算法提出的架构，将它们用作关键字标识符，并确定其中哪一种具有最佳性能以将其部署到生产(部署阶段)。结果表明，由遗传算法(GADAE)提出的自编码器性能稍好，R2得分为0.97,MAE为14×10−3,MSE为4×10−4。GADAE是最好的模型，在厄瓜多尔涉及紧急事件的推文中，它比人工过滤多提取了110%的相关推文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data

Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a $R^{2}$ score of 0.97, a MAE of $14 \times 1 0^{- 3}$ , and a MSE of $4 \times 1 0^{- 4}$ . GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.