区域/马拉地语分析的有效数据集准备技术:为区域语言/马拉地语文本分析创建自定义数据集

2023 Somaiya International Conference on Technology and Information Management (SICTIM) Pub Date : 2023-03-24 DOI:10.1109/SICTIM56495.2023.10104666

Sudashan Sirsat, Nitish Zulpe

{"title":"区域/马拉地语分析的有效数据集准备技术:为区域语言/马拉地语文本分析创建自定义数据集","authors":"Sudashan Sirsat, Nitish Zulpe","doi":"10.1109/SICTIM56495.2023.10104666","DOIUrl":null,"url":null,"abstract":"Regional language contents are the key to globalization of any successful internet based business model. Looking at the huge population interested in accessing the internet using their mother tongue or regional language is the new normal. This regional language contents on social media and word wide web pages fetched the attention of a large chunk of business analysts, data scientists and social reformists to understand the regional language sentiments through this humongous amount of regional language opinionated text. Regional Language Sentiment Analysis or Marathi language sentiment Analysis will be possible if one can create a dataset which can face text analytics language challenges like uniformity, syntactic and semantic challenges of regional language. This study is a small attempt to create a basic dataset capable of facing future Regional Language Sentiment Analysis or Marathi Language Sentiment Analysis based on NLP and SA based algorithmic approaches. This study will try to generate a Marathi language dataset from social media opinionated text and web scraping of a Marathi language webpage. All the technical issues associated with generating regional language or Marathi language dataset will be recorded, rectified and relatively refined through rigorous iterations to make the dataset future ready Marathi language sentiment analysis. This study will try to understand the needs of Regional Sentiment analysis requirements in terms of dataset, the best suitable file structure and efficient way of creating and customizing the Marathi text dataset in order to make it Natural Language Processing (NLP) and Sentiment Analysis SA ready for future studies in continuation.","PeriodicalId":244947,"journal":{"name":"2023 Somaiya International Conference on Technology and Information Management (SICTIM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Dataset Preparation Techniques for Regional/Marathi Language Analysis: Creating Customized Dataset for Regional Language/Marathi Language Text Analysis\",\"authors\":\"Sudashan Sirsat, Nitish Zulpe\",\"doi\":\"10.1109/SICTIM56495.2023.10104666\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Regional language contents are the key to globalization of any successful internet based business model. Looking at the huge population interested in accessing the internet using their mother tongue or regional language is the new normal. This regional language contents on social media and word wide web pages fetched the attention of a large chunk of business analysts, data scientists and social reformists to understand the regional language sentiments through this humongous amount of regional language opinionated text. Regional Language Sentiment Analysis or Marathi language sentiment Analysis will be possible if one can create a dataset which can face text analytics language challenges like uniformity, syntactic and semantic challenges of regional language. This study is a small attempt to create a basic dataset capable of facing future Regional Language Sentiment Analysis or Marathi Language Sentiment Analysis based on NLP and SA based algorithmic approaches. This study will try to generate a Marathi language dataset from social media opinionated text and web scraping of a Marathi language webpage. All the technical issues associated with generating regional language or Marathi language dataset will be recorded, rectified and relatively refined through rigorous iterations to make the dataset future ready Marathi language sentiment analysis. This study will try to understand the needs of Regional Sentiment analysis requirements in terms of dataset, the best suitable file structure and efficient way of creating and customizing the Marathi text dataset in order to make it Natural Language Processing (NLP) and Sentiment Analysis SA ready for future studies in continuation.\",\"PeriodicalId\":244947,\"journal\":{\"name\":\"2023 Somaiya International Conference on Technology and Information Management (SICTIM)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Somaiya International Conference on Technology and Information Management (SICTIM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SICTIM56495.2023.10104666\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Somaiya International Conference on Technology and Information Management (SICTIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SICTIM56495.2023.10104666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

地区性语言内容是任何成功的基于互联网的商业模式全球化的关键。看看有多少人对使用母语或当地语言上网感兴趣，这是一种新常态。社交媒体和网页上的这些地域语言内容引起了大量商业分析师、数据科学家和社会改革家的注意，他们希望通过这些海量的地域语言自以为是的文本来理解地域语言情绪。区域语言情感分析或马拉地语情感分析将成为可能，如果一个人可以创建一个数据集，可以面对文本分析语言的挑战，如区域语言的统一性、句法和语义挑战。本研究是一个小型尝试，旨在创建一个基本数据集，能够面对未来基于NLP和基于SA的算法方法的区域语言情感分析或马拉地语情感分析。本研究将尝试从一个马拉地语网页的社交媒体文本和网络抓取中生成一个马拉地语数据集。所有与生成区域语言或马拉地语数据集相关的技术问题都将通过严格的迭代进行记录、修正和相对完善，使数据集为未来的马拉地语情感分析做好准备。本研究将尝试了解区域情感分析在数据集方面的需求，最合适的文件结构以及创建和自定义马拉地语文本数据集的有效方法，以使其自然语言处理(NLP)和情感分析SA为未来的继续研究做好准备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Dataset Preparation Techniques for Regional/Marathi Language Analysis: Creating Customized Dataset for Regional Language/Marathi Language Text Analysis

Regional language contents are the key to globalization of any successful internet based business model. Looking at the huge population interested in accessing the internet using their mother tongue or regional language is the new normal. This regional language contents on social media and word wide web pages fetched the attention of a large chunk of business analysts, data scientists and social reformists to understand the regional language sentiments through this humongous amount of regional language opinionated text. Regional Language Sentiment Analysis or Marathi language sentiment Analysis will be possible if one can create a dataset which can face text analytics language challenges like uniformity, syntactic and semantic challenges of regional language. This study is a small attempt to create a basic dataset capable of facing future Regional Language Sentiment Analysis or Marathi Language Sentiment Analysis based on NLP and SA based algorithmic approaches. This study will try to generate a Marathi language dataset from social media opinionated text and web scraping of a Marathi language webpage. All the technical issues associated with generating regional language or Marathi language dataset will be recorded, rectified and relatively refined through rigorous iterations to make the dataset future ready Marathi language sentiment analysis. This study will try to understand the needs of Regional Sentiment analysis requirements in terms of dataset, the best suitable file structure and efficient way of creating and customizing the Marathi text dataset in order to make it Natural Language Processing (NLP) and Sentiment Analysis SA ready for future studies in continuation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 Somaiya International Conference on Technology and Information Management (SICTIM)

自引率

0.00%

发文量