X 上的鲁棒吸毒检测：采用变换器方法的集合方法

IF 2.9 4区综合性期刊 Q1 Multidisciplinary

Arabian Journal for Science and Engineering Pub Date : 2024-03-14 DOI:10.1007/s13369-024-08845-6

Reem Al-Ghannam, Mourad Ykhlef, Hmood Al-Dossari

{"title":"X 上的鲁棒吸毒检测：采用变换器方法的集合方法","authors":"Reem Al-Ghannam, Mourad Ykhlef, Hmood Al-Dossari","doi":"10.1007/s13369-024-08845-6","DOIUrl":null,"url":null,"abstract":"There is a growing trend for groups associated with drug use to exploit social media platforms to propagate content that poses a risk to the population, especially those susceptible to drug use and addiction. Detecting drug-related social media content has become important for governments, technology companies, and those responsible for enforcing laws against proscribed drugs. Their efforts have led to the development of various techniques for identifying and efficiently removing drug-related content, as well as for blocking network access for those who create it. This study introduces a manually annotated Twitter dataset consisting of 112,057 tweets from 2008 to 2022, compiled for use in detecting associations connected with drug use. Working in groups, expert annotators classified tweets as either related or unrelated to drug use. The dataset was subjected to exploratory data analysis to identify its defining features. Several classification algorithms, including support vector machines, XGBoost, random forest, Naive Bayes, LSTM, and BERT, were used in experiments with this dataset. Among the baseline models, BERT with textual features achieved the highest F1-score, at 0.9044. However, this performance was surpassed when the BERT base model and its textual features were concatenated with a deep neural network model, incorporating numerical and categorical features in the ensemble method, achieving an F1-score of 0.9112. The Twitter dataset used in this study was made publicly available to promote further research and enhance the accuracy of the online classification of English-language drug-related content.","PeriodicalId":8109,"journal":{"name":"Arabian Journal for Science and Engineering","volume":"1 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Drug Use Detection on X: Ensemble Method with a Transformer Approach\",\"authors\":\"Reem Al-Ghannam, Mourad Ykhlef, Hmood Al-Dossari\",\"doi\":\"10.1007/s13369-024-08845-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a growing trend for groups associated with drug use to exploit social media platforms to propagate content that poses a risk to the population, especially those susceptible to drug use and addiction. Detecting drug-related social media content has become important for governments, technology companies, and those responsible for enforcing laws against proscribed drugs. Their efforts have led to the development of various techniques for identifying and efficiently removing drug-related content, as well as for blocking network access for those who create it. This study introduces a manually annotated Twitter dataset consisting of 112,057 tweets from 2008 to 2022, compiled for use in detecting associations connected with drug use. Working in groups, expert annotators classified tweets as either related or unrelated to drug use. The dataset was subjected to exploratory data analysis to identify its defining features. Several classification algorithms, including support vector machines, XGBoost, random forest, Naive Bayes, LSTM, and BERT, were used in experiments with this dataset. Among the baseline models, BERT with textual features achieved the highest F1-score, at 0.9044. However, this performance was surpassed when the BERT base model and its textual features were concatenated with a deep neural network model, incorporating numerical and categorical features in the ensemble method, achieving an F1-score of 0.9112. The Twitter dataset used in this study was made publicly available to promote further research and enhance the accuracy of the online classification of English-language drug-related content.\",\"PeriodicalId\":8109,\"journal\":{\"name\":\"Arabian Journal for Science and Engineering\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Arabian Journal for Science and Engineering\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1007/s13369-024-08845-6\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arabian Journal for Science and Engineering","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1007/s13369-024-08845-6","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

摘要

与毒品使用有关的团体利用社交媒体平台传播对人群，尤其是对那些容易吸毒和成瘾的人群构成风险的内容的趋势日益明显。检测与毒品有关的社交媒体内容已成为政府、技术公司和负责执行禁药法律的人员的重要任务。在他们的努力下，开发出了各种技术，用于识别和有效删除与毒品有关的内容，以及阻止制造这些内容的人访问网络。本研究介绍了一个人工标注的推特数据集，该数据集由 2008 年至 2022 年的 112,057 条推文组成，用于检测与毒品使用有关的关联。专家注释员以小组为单位，将推文分类为与吸毒相关或无关。对数据集进行了探索性数据分析，以确定其定义特征。在该数据集的实验中使用了几种分类算法，包括支持向量机、XGBoost、随机森林、Naive Bayes、LSTM 和 BERT。在基线模型中，带有文本特征的 BERT 获得了最高的 F1 分数（0.9044）。然而，当将 BERT 基础模型及其文本特征与深度神经网络模型进行组合，并在组合方法中加入数值和分类特征时，F1 分数达到了 0.9112，超过了这一成绩。本研究中使用的 Twitter 数据集已公开发布，以促进进一步的研究，并提高英语涉毒内容在线分类的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Robust Drug Use Detection on X: Ensemble Method with a Transformer Approach

查看原文本刊更多论文

Robust Drug Use Detection on X: Ensemble Method with a Transformer Approach

There is a growing trend for groups associated with drug use to exploit social media platforms to propagate content that poses a risk to the population, especially those susceptible to drug use and addiction. Detecting drug-related social media content has become important for governments, technology companies, and those responsible for enforcing laws against proscribed drugs. Their efforts have led to the development of various techniques for identifying and efficiently removing drug-related content, as well as for blocking network access for those who create it. This study introduces a manually annotated Twitter dataset consisting of 112,057 tweets from 2008 to 2022, compiled for use in detecting associations connected with drug use. Working in groups, expert annotators classified tweets as either related or unrelated to drug use. The dataset was subjected to exploratory data analysis to identify its defining features. Several classification algorithms, including support vector machines, XGBoost, random forest, Naive Bayes, LSTM, and BERT, were used in experiments with this dataset. Among the baseline models, BERT with textual features achieved the highest F1-score, at 0.9044. However, this performance was surpassed when the BERT base model and its textual features were concatenated with a deep neural network model, incorporating numerical and categorical features in the ensemble method, achieving an F1-score of 0.9112. The Twitter dataset used in this study was made publicly available to promote further research and enhance the accuracy of the online classification of English-language drug-related content.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Arabian Journal for Science and Engineering 综合性期刊-综合性期刊

CiteScore

5.20

自引率

3.40%

发文量

审稿时长

4.3 months

期刊介绍： King Fahd University of Petroleum & Minerals (KFUPM) partnered with Springer to publish the Arabian Journal for Science and Engineering (AJSE). AJSE, which has been published by KFUPM since 1975, is a recognized national, regional and international journal that provides a great opportunity for the dissemination of research advances from the Kingdom of Saudi Arabia, MENA and the world.