AUTOSTRADA的情感分析。INFO/RU“用户评论”

Q3 Mathematics

SPIIRAS Proceedings Pub Date : 2019-04-12 DOI:10.15622/SP.18.2.354-389

Y. Seliverstov, Viktoriya Chigur, Arseniy Sazanov, S. Seliverstov, A. Svistunova

{"title":"AUTOSTRADA的情感分析。INFO/RU“用户评论”","authors":"Y. Seliverstov, Viktoriya Chigur, Arseniy Sazanov, S. Seliverstov, A. Svistunova","doi":"10.15622/SP.18.2.354-389","DOIUrl":null,"url":null,"abstract":"As a result of the analysis, it was revealed that social networks (Vkontakte, Facebook), thematic communities in microblogging networks (Twitter), resources for travelers (TripAdvisor), transport portals (Autostrada) are a source of up-to-date and operational information about the traffic situation, the quality of transport services and passenger satisfaction with the quality of levels of transport services. However, the existing transport monitoring systems do not contain software tools capable of collecting and analyzing traffic information located in the Internet environment. This paper discusses the task of building a system for automatically retrieving and classifying road traffic information from transport Internet portals and testing the developed system for analyzing the transport networks of Crimea and the city of Sevastopol. To solve this problem, an analysis of open source libraries for thematic data collection and analysis was carried out. An algorithm for extracting and analyzing texts has been developed. A crawler was developed using the Scrapy package in Python3, and user feedback from the portal http://autostrada.info/ru was collected on the state of the transport system of Crimea and the city of Sevastopol. For texts lemmatization and vector text transformation, the tf, idf, tf-idf methods and their implementation in the Scikit-Learn library were considered: CountVectorizer and TF-IDF Vectorizer. For word processing, Bag-of-Words and n-gram methods were considered. During the development of the classifier model, the naive Bayes algorithm (MultinomialNB) and the linear classifier model with optimization of the stochastic gradient descent (SGDClassifier) were used. As a training sample, a corpus of 225,000 labeled texts from the Twitter resource was used. The classifier was trained, during which the cross-validation strategy and the ShuffleSplit method were used. Testing and comparison of the results of the pitch classification were carried out. According to the results of validation, the linear model with the n-gram scheme [1, 3] and the vectorizer TF-IDF turned out to be the best. During the approbation of the developed system, the collection and analysis of reviews related to the quality of transport networks of the Republic of Crimea and the city of Sevastopol were conducted. Conclusions are drawn and prospects for further functional development of the developed tools are defined.","PeriodicalId":53447,"journal":{"name":"SPIIRAS Proceedings","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Sentiment Analysis of \\\"AUTOSTRADA.INFO/RU\\\" Users’ Comments\",\"authors\":\"Y. Seliverstov, Viktoriya Chigur, Arseniy Sazanov, S. Seliverstov, A. Svistunova\",\"doi\":\"10.15622/SP.18.2.354-389\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a result of the analysis, it was revealed that social networks (Vkontakte, Facebook), thematic communities in microblogging networks (Twitter), resources for travelers (TripAdvisor), transport portals (Autostrada) are a source of up-to-date and operational information about the traffic situation, the quality of transport services and passenger satisfaction with the quality of levels of transport services. However, the existing transport monitoring systems do not contain software tools capable of collecting and analyzing traffic information located in the Internet environment. This paper discusses the task of building a system for automatically retrieving and classifying road traffic information from transport Internet portals and testing the developed system for analyzing the transport networks of Crimea and the city of Sevastopol. To solve this problem, an analysis of open source libraries for thematic data collection and analysis was carried out. An algorithm for extracting and analyzing texts has been developed. A crawler was developed using the Scrapy package in Python3, and user feedback from the portal http://autostrada.info/ru was collected on the state of the transport system of Crimea and the city of Sevastopol. For texts lemmatization and vector text transformation, the tf, idf, tf-idf methods and their implementation in the Scikit-Learn library were considered: CountVectorizer and TF-IDF Vectorizer. For word processing, Bag-of-Words and n-gram methods were considered. During the development of the classifier model, the naive Bayes algorithm (MultinomialNB) and the linear classifier model with optimization of the stochastic gradient descent (SGDClassifier) were used. As a training sample, a corpus of 225,000 labeled texts from the Twitter resource was used. The classifier was trained, during which the cross-validation strategy and the ShuffleSplit method were used. Testing and comparison of the results of the pitch classification were carried out. According to the results of validation, the linear model with the n-gram scheme [1, 3] and the vectorizer TF-IDF turned out to be the best. During the approbation of the developed system, the collection and analysis of reviews related to the quality of transport networks of the Republic of Crimea and the city of Sevastopol were conducted. Conclusions are drawn and prospects for further functional development of the developed tools are defined.\",\"PeriodicalId\":53447,\"journal\":{\"name\":\"SPIIRAS Proceedings\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SPIIRAS Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15622/SP.18.2.354-389\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SPIIRAS Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15622/SP.18.2.354-389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 4

摘要

分析结果显示，社交网络(Vkontakte、Facebook)、微博网络中的主题社区(Twitter)、旅行者资源(TripAdvisor)、交通门户网站(Autostrada)是交通状况、交通服务质量和乘客对交通服务质量水平满意度的最新和运营信息来源。然而，现有的交通监控系统不包含能够收集和分析位于互联网环境中的交通信息的软件工具。本文讨论了建立一个自动检索和分类来自交通互联网门户的道路交通信息系统的任务，并测试开发的系统分析克里米亚和塞瓦斯托波尔市的交通网络。为解决这一问题，对开源库进行专题数据收集和分析。本文提出了一种文本提取和分析算法。使用Python3中的Scrapy包开发了一个爬虫，并从门户网站http://autostrada.info/ru收集了有关克里米亚和塞瓦斯托波尔市交通系统状况的用户反馈。对于文本词序化和矢量文本转换，考虑了tf, idf, tf-idf方法及其在Scikit-Learn库中的实现:CountVectorizer和tf-idf Vectorizer。对于文字处理，我们考虑了Bag-of-Words和n-gram方法。在分类器模型的开发过程中，使用了朴素贝叶斯算法(MultinomialNB)和随机梯度下降优化的线性分类器模型(SGDClassifier)。作为训练样本，使用了来自Twitter资源的225,000个标记文本的语料库。对分类器进行训练，训练过程中使用交叉验证策略和ShuffleSplit方法。对音高分类结果进行了测试和比较。根据验证结果，采用n-gram方案的线性模型[1,3]和矢量器TF-IDF是最好的。在批准开发的系统期间，收集和分析了与克里米亚共和国和塞瓦斯托波尔市运输网络质量有关的审查。最后给出了结论，并对所开发工具的进一步功能开发进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sentiment Analysis of "AUTOSTRADA.INFO/RU" Users’ Comments

As a result of the analysis, it was revealed that social networks (Vkontakte, Facebook), thematic communities in microblogging networks (Twitter), resources for travelers (TripAdvisor), transport portals (Autostrada) are a source of up-to-date and operational information about the traffic situation, the quality of transport services and passenger satisfaction with the quality of levels of transport services. However, the existing transport monitoring systems do not contain software tools capable of collecting and analyzing traffic information located in the Internet environment. This paper discusses the task of building a system for automatically retrieving and classifying road traffic information from transport Internet portals and testing the developed system for analyzing the transport networks of Crimea and the city of Sevastopol. To solve this problem, an analysis of open source libraries for thematic data collection and analysis was carried out. An algorithm for extracting and analyzing texts has been developed. A crawler was developed using the Scrapy package in Python3, and user feedback from the portal http://autostrada.info/ru was collected on the state of the transport system of Crimea and the city of Sevastopol. For texts lemmatization and vector text transformation, the tf, idf, tf-idf methods and their implementation in the Scikit-Learn library were considered: CountVectorizer and TF-IDF Vectorizer. For word processing, Bag-of-Words and n-gram methods were considered. During the development of the classifier model, the naive Bayes algorithm (MultinomialNB) and the linear classifier model with optimization of the stochastic gradient descent (SGDClassifier) were used. As a training sample, a corpus of 225,000 labeled texts from the Twitter resource was used. The classifier was trained, during which the cross-validation strategy and the ShuffleSplit method were used. Testing and comparison of the results of the pitch classification were carried out. According to the results of validation, the linear model with the n-gram scheme [1, 3] and the vectorizer TF-IDF turned out to be the best. During the approbation of the developed system, the collection and analysis of reviews related to the quality of transport networks of the Republic of Crimea and the city of Sevastopol were conducted. Conclusions are drawn and prospects for further functional development of the developed tools are defined.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SPIIRAS Proceedings Mathematics-Applied Mathematics

CiteScore

1.90

自引率

0.00%

发文量

审稿时长

14 weeks

期刊介绍： The SPIIRAS Proceedings journal publishes scientific, scientific-educational, scientific-popular papers relating to computer science, automation, applied mathematics, interdisciplinary research, as well as information technology, the theoretical foundations of computer science (such as mathematical and related to other scientific disciplines), information security and information protection, decision making and artificial intelligence, mathematical modeling, informatization.