{"title":"基于NLP和岭回归的景区数据分析","authors":"Chen Liu","doi":"10.1109/ICETCI53161.2021.9563582","DOIUrl":null,"url":null,"abstract":"With the rapid development of Internet technology, many textual evaluation data of tourist destinations have accumulated on the Internet. Using NLP to conduct text mining on the data can effectively improve tourists' satisfaction and has a long-term and positive effect on the scientific supervision of tourism enterprises and the optimal allocation of resources. This paper uses Python to pre-process the comment data, including de-duplication, removal of English text, conversion of traditional Chinese to simplified, text correction, and compression to remove words. The reviews are divided into five categories: service, location, facility, hygiene, and cost-performance. The Paddlehub library is used to calculate the emotional scores of all reviews in the five aspects of each scenic spot and hotel and subsequently calculate the percentage of positive, neutral, and negative reviews. Afterward, use Ridge Regression and k-fold cross-validation to establish a comprehensive evaluation model, which can obtain the total score of each scenic spot and hotel in five aspects, with MSE, RMSE, MAE to verify. Furthermore, a method of extracting characteristic words in scenic spots and hotels is proposed: firstly, use the LDA subject vocabulary mining; next, select the TOP50 words through operations such as extracting keywords, selecting out nouns, filtering out irrelevant words, and synonymous merge; lastly, two parts of words are integrated to get the characteristic words. Finally, according to the total score, the scenic spots and hotels are divided into three levels: high, medium, and low levels, while three groups of scenic spots and hotels of the same type are selected respectively (each group has three scenic spots or hotels of different level). Through the characteristic words and five aspects of the total score, we can compare and analyze the selected three groups of scenic spots and hotels to make a suggestion.","PeriodicalId":170858,"journal":{"name":"2021 IEEE International Conference on Electronic Technology, Communication and Information (ICETCI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Scenic area data analysis based on NLP and ridge regression\",\"authors\":\"Chen Liu\",\"doi\":\"10.1109/ICETCI53161.2021.9563582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the rapid development of Internet technology, many textual evaluation data of tourist destinations have accumulated on the Internet. Using NLP to conduct text mining on the data can effectively improve tourists' satisfaction and has a long-term and positive effect on the scientific supervision of tourism enterprises and the optimal allocation of resources. This paper uses Python to pre-process the comment data, including de-duplication, removal of English text, conversion of traditional Chinese to simplified, text correction, and compression to remove words. The reviews are divided into five categories: service, location, facility, hygiene, and cost-performance. The Paddlehub library is used to calculate the emotional scores of all reviews in the five aspects of each scenic spot and hotel and subsequently calculate the percentage of positive, neutral, and negative reviews. Afterward, use Ridge Regression and k-fold cross-validation to establish a comprehensive evaluation model, which can obtain the total score of each scenic spot and hotel in five aspects, with MSE, RMSE, MAE to verify. Furthermore, a method of extracting characteristic words in scenic spots and hotels is proposed: firstly, use the LDA subject vocabulary mining; next, select the TOP50 words through operations such as extracting keywords, selecting out nouns, filtering out irrelevant words, and synonymous merge; lastly, two parts of words are integrated to get the characteristic words. Finally, according to the total score, the scenic spots and hotels are divided into three levels: high, medium, and low levels, while three groups of scenic spots and hotels of the same type are selected respectively (each group has three scenic spots or hotels of different level). Through the characteristic words and five aspects of the total score, we can compare and analyze the selected three groups of scenic spots and hotels to make a suggestion.\",\"PeriodicalId\":170858,\"journal\":{\"name\":\"2021 IEEE International Conference on Electronic Technology, Communication and Information (ICETCI)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Electronic Technology, Communication and Information (ICETCI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICETCI53161.2021.9563582\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Electronic Technology, Communication and Information (ICETCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICETCI53161.2021.9563582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scenic area data analysis based on NLP and ridge regression
With the rapid development of Internet technology, many textual evaluation data of tourist destinations have accumulated on the Internet. Using NLP to conduct text mining on the data can effectively improve tourists' satisfaction and has a long-term and positive effect on the scientific supervision of tourism enterprises and the optimal allocation of resources. This paper uses Python to pre-process the comment data, including de-duplication, removal of English text, conversion of traditional Chinese to simplified, text correction, and compression to remove words. The reviews are divided into five categories: service, location, facility, hygiene, and cost-performance. The Paddlehub library is used to calculate the emotional scores of all reviews in the five aspects of each scenic spot and hotel and subsequently calculate the percentage of positive, neutral, and negative reviews. Afterward, use Ridge Regression and k-fold cross-validation to establish a comprehensive evaluation model, which can obtain the total score of each scenic spot and hotel in five aspects, with MSE, RMSE, MAE to verify. Furthermore, a method of extracting characteristic words in scenic spots and hotels is proposed: firstly, use the LDA subject vocabulary mining; next, select the TOP50 words through operations such as extracting keywords, selecting out nouns, filtering out irrelevant words, and synonymous merge; lastly, two parts of words are integrated to get the characteristic words. Finally, according to the total score, the scenic spots and hotels are divided into three levels: high, medium, and low levels, while three groups of scenic spots and hotels of the same type are selected respectively (each group has three scenic spots or hotels of different level). Through the characteristic words and five aspects of the total score, we can compare and analyze the selected three groups of scenic spots and hotels to make a suggestion.