{"title":"TransVAE-PAM: A Combined Transformer and DAG-based Approach for Enhanced Fake News Detection in Indian Context","authors":"Shivani Tufchi, Tanveer Ahmed, Ashima Yadav, Krishna Kant Agrawal, Ankit Vidyarthi","doi":"10.1145/3651160","DOIUrl":null,"url":null,"abstract":"<p>In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and <i>β</i>-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"119 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3651160","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and β-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.