使用LDA提取土耳其文推文主题

2013 8th International Conference on Electrical and Electronics Engineering (ELECO) Pub Date : 2013-11-01 DOI:10.1109/ELECO.2013.6713899

Fahriye Gemci, K. A. Peker

{"title":"使用LDA提取土耳其文推文主题","authors":"Fahriye Gemci, K. A. Peker","doi":"10.1109/ELECO.2013.6713899","DOIUrl":null,"url":null,"abstract":"Social media is a very popular medium of communication for sharing people's activities, opinions and feelings with others. Twitter has become one of the most popular of these social media services. Finding relevant information in the large space of Twitter is a challenge. Various algorithms have been developed to find related tweets. Extracting tweet topics is one of the techniques that can be used for this purpose. Recently, LDA (Latent Dirichlet Allocation) has been successfully used in analysis of tweet topics in English. However, LDA hasn't been tried on Turkish tweets. Turkish is an agglutinative language, which makes application of LDA a new challenge compared to LDA on English tweets. A series of preprocessing steps like stemming, stop word elimination, cleaning of punctuations and spelling errors etc. to make the tweet texts suitable for LDA analysis are used in our application. 6 6.000 tweets and 4.000 control tweets with Twitter4j library are crawled [2]. Zemberek library is used for stemming [3]. The language used in tweets - with its own rules, wide spread spelling errors or made-up words - make text analysis very difficult. Some of these problems are solved by adding new words into Zemberek library and some of the problematic words are completely removed. After preprocessing, LDA i s performed and 40 topics are extracted. The initial results look promising. Some significant topics such as football teams, celebrity names or hot news topics are detected. Using these results, automatic recommendation of relevant tweets will be performed.","PeriodicalId":108357,"journal":{"name":"2013 8th International Conference on Electrical and Electronics Engineering (ELECO)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Extracting Turkish tweet topics using LDA\",\"authors\":\"Fahriye Gemci, K. A. Peker\",\"doi\":\"10.1109/ELECO.2013.6713899\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media is a very popular medium of communication for sharing people's activities, opinions and feelings with others. Twitter has become one of the most popular of these social media services. Finding relevant information in the large space of Twitter is a challenge. Various algorithms have been developed to find related tweets. Extracting tweet topics is one of the techniques that can be used for this purpose. Recently, LDA (Latent Dirichlet Allocation) has been successfully used in analysis of tweet topics in English. However, LDA hasn't been tried on Turkish tweets. Turkish is an agglutinative language, which makes application of LDA a new challenge compared to LDA on English tweets. A series of preprocessing steps like stemming, stop word elimination, cleaning of punctuations and spelling errors etc. to make the tweet texts suitable for LDA analysis are used in our application. 6 6.000 tweets and 4.000 control tweets with Twitter4j library are crawled [2]. Zemberek library is used for stemming [3]. The language used in tweets - with its own rules, wide spread spelling errors or made-up words - make text analysis very difficult. Some of these problems are solved by adding new words into Zemberek library and some of the problematic words are completely removed. After preprocessing, LDA i s performed and 40 topics are extracted. The initial results look promising. Some significant topics such as football teams, celebrity names or hot news topics are detected. Using these results, automatic recommendation of relevant tweets will be performed.\",\"PeriodicalId\":108357,\"journal\":{\"name\":\"2013 8th International Conference on Electrical and Electronics Engineering (ELECO)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 8th International Conference on Electrical and Electronics Engineering (ELECO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ELECO.2013.6713899\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 8th International Conference on Electrical and Electronics Engineering (ELECO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ELECO.2013.6713899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

社交媒体是一种非常流行的交流媒介，用于与他人分享人们的活动、观点和感受。Twitter已经成为这些社交媒体服务中最受欢迎的一个。在Twitter的大空间中找到相关信息是一个挑战。人们开发了各种算法来查找相关的推文。提取tweet主题是可用于此目的的技术之一。最近，LDA (Latent Dirichlet Allocation)被成功地应用于英语推文主题分析。然而，LDA还没有在土耳其的推特上试用过。土耳其语是一种黏性语言，相对于英语推文的LDA, LDA的应用是一个新的挑战。在我们的应用程序中使用了一系列预处理步骤，如词干提取，停止词消除，标点和拼写错误清理等，使tweet文本适合LDA分析。使用Twitter4j库抓取了6000条推文和4000条控制推文[2]。词干提取使用Zemberek库[3]。推文中使用的语言有自己的规则、广泛存在的拼写错误或合成词，这使得文本分析变得非常困难。其中一些问题通过向Zemberek库中添加新词来解决，而一些有问题的词则被完全删除。预处理后进行LDA，提取出40个主题。初步结果看起来很有希望。一些重要的话题，如足球队，名人的名字或热点新闻话题被检测。使用这些结果，将执行相关tweet的自动推荐。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting Turkish tweet topics using LDA

Social media is a very popular medium of communication for sharing people's activities, opinions and feelings with others. Twitter has become one of the most popular of these social media services. Finding relevant information in the large space of Twitter is a challenge. Various algorithms have been developed to find related tweets. Extracting tweet topics is one of the techniques that can be used for this purpose. Recently, LDA (Latent Dirichlet Allocation) has been successfully used in analysis of tweet topics in English. However, LDA hasn't been tried on Turkish tweets. Turkish is an agglutinative language, which makes application of LDA a new challenge compared to LDA on English tweets. A series of preprocessing steps like stemming, stop word elimination, cleaning of punctuations and spelling errors etc. to make the tweet texts suitable for LDA analysis are used in our application. 6 6.000 tweets and 4.000 control tweets with Twitter4j library are crawled [2]. Zemberek library is used for stemming [3]. The language used in tweets - with its own rules, wide spread spelling errors or made-up words - make text analysis very difficult. Some of these problems are solved by adding new words into Zemberek library and some of the problematic words are completely removed. After preprocessing, LDA i s performed and 40 topics are extracted. The initial results look promising. Some significant topics such as football teams, celebrity names or hot news topics are detected. Using these results, automatic recommendation of relevant tweets will be performed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 8th International Conference on Electrical and Electronics Engineering (ELECO)

自引率

0.00%

发文量