推特数据分析及标准词采集中的文本规范化

Q3 Engineering

Journal of Applied Engineering and Technological Science Pub Date : 2023-06-05 DOI:10.37385/jaets.v4i2.1991

Arif Ridho Lubis, M. K. Nasution

{"title":"推特数据分析及标准词采集中的文本规范化","authors":"Arif Ridho Lubis, M. K. Nasution","doi":"10.37385/jaets.v4i2.1991","DOIUrl":null,"url":null,"abstract":"is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently.","PeriodicalId":34350,"journal":{"name":"Journal of Applied Engineering and Technological Science","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Twitter Data Analysis and Text Normalization in Collecting Standard Word\",\"authors\":\"Arif Ridho Lubis, M. K. Nasution\",\"doi\":\"10.37385/jaets.v4i2.1991\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently.\",\"PeriodicalId\":34350,\"journal\":{\"name\":\"Journal of Applied Engineering and Technological Science\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Engineering and Technological Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.37385/jaets.v4i2.1991\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Engineering and Technological Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37385/jaets.v4i2.1991","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 2

摘要

是社会数据分析中最重要的数据源之一。然而，Twitter上包含的文本通常是非结构化的，这导致很难收集标准单词。因此，在本研究中，我们对Twitter数据进行分析，并对文本进行规范化，生成可用于社交数据分析的标准词。本研究的目的是提高Twitter对社交媒体标准词的数据收集质量，促进社交数据的分析更加准确有效。使用的方法是使用分类算法和文本规范化技术的自然语言处理技术。本研究的结果是一套可用于社会数据分析的标准词，共11430个词，然后是4075个结构或正式词，7355个非正式词。非正式词汇由可信来源更正，以创建从社交媒体推特数据中获得的正式和非正式词汇语料库。本研究的贡献在于，所开发的方法可以通过保证使用的词语的标准和准确来提高Twitter社交数据收集的质量，并且本研究中使用的文本归一化方法可以作为其他社交数据文本归一化的参考，从而促进收集。以及更高质量的社会数据分析。本研究可以帮助研究者或实践者理解自然语言处理技术及其在社会数据分析中的应用。这项研究预计将有助于更有效和高效地收集社会数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Twitter Data Analysis and Text Normalization in Collecting Standard Word

is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Applied Engineering and Technological Science

CiteScore

1.50

自引率

0.00%

发文量

审稿时长

4 weeks