Zhitao Zhang, Muyun Yang, Sheng Li, Haoliang Qi, Chao Song
{"title":"Sogou Query Log Analysis: A Case Study for Collaborative Recommendation or Personalized IR","authors":"Zhitao Zhang, Muyun Yang, Sheng Li, Haoliang Qi, Chao Song","doi":"10.1109/IALP.2009.72","DOIUrl":"https://doi.org/10.1109/IALP.2009.72","url":null,"abstract":"Through analyzing the search engine logs, we can better understand the law o users’ search behavior, mining users’ personality, so that improving the performances of web information retrieval. This paper analyzes the user, query, clickthrough data of Sogou, a large-scale Chinese search engine. We focus on the relation of user, query and URL, revealing some new characteristic of the Web user. The result shows that the portal websites are visited most frequently. The average user of Sogou clicks 4.82 URL, including 1.72 distinct URL. This paper demonstrates the necessity of personalized information retrieval, which is enlightening for improving the performance of Chinese search engine.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130180507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Han, Muyun Yang, Haoliang Qi, Xiaoning He, Sheng Li
{"title":"The Improved Logistic Regression Models for Spam Filtering","authors":"Yong Han, Muyun Yang, Haoliang Qi, Xiaoning He, Sheng Li","doi":"10.1109/IALP.2009.74","DOIUrl":"https://doi.org/10.1109/IALP.2009.74","url":null,"abstract":"The logistic regression model has achieved success in spam filtering. But it is disadvantaged by the equal adjustment of the feature weights appeared in both spam messages and ham ones during training period. This paper presents an improved logistic regression model which reduces the impact of the features appearing in both spam messages and ham ones. Byte level n-grams are employed to extract the features from messages, and TONE (Train On or Near Error) is adopted, which are proved effective in state-of-the-art spam filtering system. The official runs of CEAS (Conference on Email and Anti-Spam) Spam-filter Challenge 2008 show that the proposed model is one of the best methods. Our system achieved competitive results in all tasks and is the winner of active learning on the live stream by 1- ROCA.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Boriboon, Kanyanut Kriengket, P. Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, T. Thanakulwarapas, K. Kosawat
{"title":"BEST Corpus Development and Analysis","authors":"M. Boriboon, Kanyanut Kriengket, P. Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, T. Thanakulwarapas, K. Kosawat","doi":"10.1109/IALP.2009.76","DOIUrl":"https://doi.org/10.1109/IALP.2009.76","url":null,"abstract":"This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is ¿¿¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126962311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges in Developing Persian Corpora from Online Resources","authors":"Masood Ghayoomi, S. Momtazi","doi":"10.1109/IALP.2009.31","DOIUrl":"https://doi.org/10.1109/IALP.2009.31","url":null,"abstract":"Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists’ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127847605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samar Husain, Phani Gadde, Bharat Ram Ambati, D. Sharma, R. Sangal
{"title":"A Modular Cascaded Approach to Complete Parsing","authors":"Samar Husain, Phani Gadde, Bharat Ram Ambati, D. Sharma, R. Sangal","doi":"10.1109/IALP.2009.37","DOIUrl":"https://doi.org/10.1109/IALP.2009.37","url":null,"abstract":"In this paper, we propose a modular cascaded approach to data driven dependency parsing. Each module or layer leading to the complete parse produces a linguistically valid partial parse. We do this by introducing an artificial root node in the dependency structure of a sentence and by catering to distinct dependency label sets that reflect the function of the set internal labels vis-à-vis a distinct and identifiable linguistic unit, at different layers. The linguistic unit in our approach is a clause. Output (partial parse) from each layer can be accessed independently. We applied this approach to Hindi, a morphologically rich free word order language using MST Parser. We did all our experiments on a part of Hyderabad Dependency Treebank. The final results show an increase of 1.35% in unlabeled attachment and 1.36% in labeled attachment accuracies over state-of-the-art data driven Hindi parser.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127740645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilingual Multimodal Integration of Sketch and Speech: A Generic Speech Representation Model for Spatial Description","authors":"L. Teh, A. Yeo","doi":"10.1109/IALP.2009.13","DOIUrl":"https://doi.org/10.1109/IALP.2009.13","url":null,"abstract":"This paper details how multiple languages are accommodated in the multimodal integration of sketch and speech, specifically, in spatial applications. The study encompasses English, Malay, Mandarin, and two under-resourced languages in Malaysia, i.e. Melanau and Iban. The preliminary study revealed that not all spatial terms (prepositions) appear in all languages. Based on these findings, we propose a method to assist in the design and development of multilingual multimodal applications. This method employs a generic representation model for spatial description.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"609 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127604802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Effects of Text Clustering on On-Line Military News Based on Quantitative Association Rule","authors":"Liang-Chu Chen, Chyi-Bao Yang, Jih-Hsin Chen, Yen-Hsuan Lien","doi":"10.1109/IALP.2009.48","DOIUrl":"https://doi.org/10.1109/IALP.2009.48","url":null,"abstract":"Text clustering is an automatic technique to group texts using the approach of feature extraction and term connection to calculate the similarities among subject contents of texts. Since the properties of terms in Chinese text (e.g. segmentation and annotation) are not as clear as the other languages, extracting and distinguishing features from Chinese text is therefore much more difficult, which greatly impacts the effects of clustering. From the perspective of military news, this paper applies both quantitative association rule and hierarchical agglomerative algorithm to cluster Chinese news published in Youth Daily News, and the application results are compared with those by the traditional vector space model approach and by the general association rule approach, respectively. F-measure is used as evaluation metric in the experiments. Experimental results show that the quantitative association rule approach performs more accurately than both the vector space model and association rule in text automatic clustering.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133909795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic Genes and the Semantic Composition of Adjectives in Modern Chinese","authors":"Dan Hu, Jinglian Gao","doi":"10.1109/IALP.2009.61","DOIUrl":"https://doi.org/10.1109/IALP.2009.61","url":null,"abstract":"Words cluster semantically together by commonly sharing semantic genes. By inheritance, recombination and variation of semantic genes, new words are produced. The semantics of adjective is composed of core semantic genes and attribute semantic genes. With these genes and the semantic composition formula, we can construct a semantic knowledge-base of adjectives accurately for NLP.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122859959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vietnamese Final Stop Consonants /p, t, k/ Described in Terms of Formant Transition Slopes","authors":"Viet Son Nguyen, E. Castelli, R. Carré","doi":"10.1109/IALP.2009.27","DOIUrl":"https://doi.org/10.1109/IALP.2009.27","url":null,"abstract":"It is well known that bursts and voiced formant transitions serve as separate cues to the place of articulation of initial stop consonants. The Vietnamese presents three final voiceless stop consonants /p, t, k/ without bursts. It is an opportunity to study these final stop consonants and to compare their characteristics with those of the corresponding initial stop consonants. As final consonants were never studied before, this paper analyses the vowel-consonant (VC) and consonant-vowel-consonant (CVC) productions in terms of the transition duration, the starting formant transition values and the slopes of the VC transitions. Measurements have shown that in the same preceding vowel contexts, the three final stop consonants /p, t, k/ are always clearly different by at least one of the three slopes of F1, F2, and F3. These final stop consonants can also be differentiated in the locus equation space. The results also pointed out the effects of the final consonants on either long vowels or short vowels. This explains why Vietnamese could not pronounce the short vowels in isolation.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125792250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Repetition in Mandarin Interaction: A Case Study on TV Shopping Channels in Taiwan","authors":"Fuhui Hsieh, Ying Liang","doi":"10.1109/IALP.2009.18","DOIUrl":"https://doi.org/10.1109/IALP.2009.18","url":null,"abstract":"Repetition is a pervasive type of spontaneous prepatterning in conversation. From an evolutionary perspective, repetition or imitation is a safe way to secure oneself from stepping into any danger caused by uncertainty. By repeating or imitating the behavior of other group members, one may survive in many situations. From a learning or pedagogical perspective, repetition or imitation is a fast way to acquire a skill or a language, including the lexicon and the structures. The main purpose of this paper is to investigate this significantly pervasive yet somewhat neglected phenomenon in Mandarin discourse. In this study, we seek to examine repetitions in social interactions on TV shopping channels in Taiwan. It is hoped that such a study may contribute to natural language processing and information processing by providing a detailed analysis of the patterns and functions of repetition in social interactions.","PeriodicalId":156840,"journal":{"name":"2009 International Conference on Asian Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116110062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}