{"title":"Toward better keywords extraction","authors":"Shihua Xu, Fang Kong","doi":"10.1109/IALP.2015.7451561","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451561","url":null,"abstract":"Automatic keyword extraction is the task to identify a small set of keywords from a given document that can describe the meaning of the document. It plays an important role in information retrieval. In this paper, a clustering-based approach to do this task is proposed. And the impacts of keyword length, the window size of centroid on the performance of AKE system are discussed. Then by introducing keyword length constraint and extending the number of centroid of every cluster, the performance of our AKE system is improved by 7.5% in F-score.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132044345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mongolian Named Entity Recognition using suffixes segmentation","authors":"Weihua Wang, F. Bao, Guanglai Gao","doi":"10.1109/IALP.2015.7451558","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451558","url":null,"abstract":"Mongolian is an agglutinative language with the complex morphological structures. Building an accurate Named Entity Recognition (NER) system for Mongolian is a challenging and meaningful work. This paper analyzes the characteristic of Mongolian suffixes using Narrow Non-Break Space and investigates Mongolian NER system under three methods in the Condition Random Field framework. The experiment shows that segmenting each suffix into an individual token achieves the best performance than both without segmenting and using the suffixes as a feature. Our approach obtains an F-measure = 82.71. It is appropriate for the Mongolian large scale vocabulary NER. This research also makes sense to other agglutinative languages NER systems.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"26 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120906958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted Document Frequency for feature selection in text classification","authors":"Baoli Li, Q. Yan, Zhenqiang Xu, Guicai Wang","doi":"10.1109/IALP.2015.7451549","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451549","url":null,"abstract":"In the past research, Document Frequency (DF) has been validated to be a simple yet quite effective measure for feature selection in text classification. The calculation is based on how many documents in a collection contain a feature, which can be a word, a phrase, a n-gram, or a specially derived attribute. The counting process takes a binary strategy: if a feature appears in a document, its DF will be increased by one. This traditional DF metric concerns only about whether a feature appears in a document, but does not consider how important the feature is in that document. Obviously, thus counted document frequency is very likely to introduce much noise. Therefore, a weighted document frequency (WDF) is proposed and expected to reduce such noise to some extent. Extensive experiments on two text classification datasets demonstrate the effectiveness of the proposed measure.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121975415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Construction of Japanese semantically compatible words resource","authors":"Kazuhide Yamamoto, Kanji Takahashi","doi":"10.1109/IALP.2015.7451532","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451532","url":null,"abstract":"We have constructed a Japanese semantically compatible resource and attached it to a dictionary used in our language analyzer, which segments text into words. We expect that semantically compatible words solve the data sparseness problem of corpus-based Natural Language Processing. By grouping compatible words together, the amount of words to process can be much reduced. In this study, we define hyponymy-and-hypernymy relation groups and synonym groups as semantically compatible words. The semantically compatible resource contains 343 concepts as hyponymy-and-hypernymy relation groups and 21,784 concepts as synonymy groups. We can obtain semantically compatible words from a Japanese word analyzer, SNOWMAN. The constructed resource will be available to the public.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129682752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ridong Jiang, Seokhwan Kim, Rafael E. Banchs, Haizhou Li
{"title":"Towards improving the performance of Vector Space Model for Chinese Frequently Asked Question Answering","authors":"Ridong Jiang, Seokhwan Kim, Rafael E. Banchs, Haizhou Li","doi":"10.1109/IALP.2015.7451550","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451550","url":null,"abstract":"This paper presents a method which improves the performance of Vector Space Model (VSM) when applying it to Chinese Frequently Asked Questions (FAQ). This method combines unigram and bigram models in determining the similarity of document vectors. The performance is further improved by applying shallow lexical semantics and the document length information. Experiments showed that the proposed methods outperform baselines (segmentation and bigram) across different datasets which include FAQs from restricted domains and open domains.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116865583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative study on collectives of term weighting methods for extractive presentation speech summarization","authors":"Jian Zhang, Huaqiang Yuan","doi":"10.1109/IALP.2015.7451553","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451553","url":null,"abstract":"This paper presents a comparative study of collectives of term weighting methods for extractive speech summarization of Mandarin Presentation Speech. The summarization process can be considered as a binary classification process. The collectives of different term weighting methods can provide better summarization performance than each of them with the same classification algorithm. Several different unsupervised and supervised term weighting methods and their collectives were evaluated with summarizer based on support vector machine (SVM) classifier. The majority vote strategy is used for handling the collectives. We show that the best result is provided with the vote of the collective of all term weighting methods. We also show that Term Relevance Ratio (TRR) gives more contribution for presentation speech summarization than other term weighting methods.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127741359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Japanese sentence compression using Simple English Wikipedia","authors":"Shunsuke Takeno, Kazuhide Yamamoto","doi":"10.1109/IALP.2015.7451533","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451533","url":null,"abstract":"We describe a cross-lingual approach for sentence compression of articles of Japanese Wikipedia using the correspondence of articles of Simple English Wikipedia. Taking advantages of the nature of the corpus, we can find essential parts from encyclopedic description without highly depending on the statistical information which are noisy. We manually explored the correspondences between the articles of Japanese Wikipedia and those of Simple English Wikipedia and then proposed a cross-lingual alignment method using simple matching algorithm. We provide an analysis of the abovementioned correspondence and the preliminary result of sentence compression using Simple English Wikipedia.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115236853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rumor diffusion purpose analysis from social attribute to social content","authors":"Dazhen Lin, Yanping Lv, Donglin Cao","doi":"10.1109/IALP.2015.7451543","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451543","url":null,"abstract":"Rumor is one of the important issues for social media. Previous works mainly focus on using social attribute features in rumor analysis. However, social attribute features don't indicate the purpose of a rumor which is one of the most important aspects of a rumor. To solve that problem, we focus on not only those social attribute features, but also social content features to find out what kind of features are useful for exploring the purpose of a rumor. Finally, we propose 6 kinds of features, where four of them belong to social attribute features and two of them belong to social content features. To uncover the purpose of rumors from proposed features, we choose Sina weibo, the biggest micro-blog platform in China, and crawl 11,676 rumors for analysis. The analysis results show that the diffusion purpose of rumors can be concluded from social content attributes, and proposed two layers KL divergence approach is useful in diffusion purpose words perception.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"105 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114048755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topic2Vec: Learning distributed representations of topics","authors":"Liqiang Niu, Xinyu Dai, Jianbing Zhang, Jiajun Chen","doi":"10.1109/IALP.2015.7451564","DOIUrl":"https://doi.org/10.1109/IALP.2015.7451564","url":null,"abstract":"Latent Dirichlet Allocation (LDA) mining thematic structure of documents plays an important role in nature language processing and machine learning areas. However, the probability distribution from LDA only describes the statistical relationship of occurrences in the corpus and usually in practice, probability is not the best choice for feature representations. Recently, embedding methods have been proposed to represent words and documents by learning essential concepts and representations, such as Word2Vec and Doc2Vec. The embedded representations have shown more effectiveness than LDA-style representations in many tasks. In this paper, we propose the Topic2Vec approach which can learn topic representations in the same semantic vector space with words, as an alternative to probability distribution. The experimental results show that Topic2Vec achieves interesting and meaningful results.","PeriodicalId":256927,"journal":{"name":"2015 International Conference on Asian Language Processing (IALP)","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131994193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}