{"title":"Chinese Main Verb Identification: From Specification to Realization","authors":"Binggong Ding, C. Huang, Degen Huang","doi":"10.30019/IJCLCLP.200503.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200503.0004","url":null,"abstract":"Main verb identification is the task of automatically identifying the predicate-verb in a sentence. It is useful for many applications in Chinese Natural Language Processing. Although most studies have focused on the model used to identify the main verb, the definition of the main verb should not be overlooked. In our specification design, we have found many complicated issues that still need to be resolved since they haven't been well discussed in previous works. Thus, the first novel aspect of our work is that we carefully design a specification for annotating the main verb and investigate various complicated cases. We hope this discussion will help to uncover the difficulties involved in this problem. Secondly, we present an approach to realizing main verb identification based on the use of chunk information, which leads to better results than the approach based on part-of-speech. Finally, based on careful observation of the studied corpus, we propose new local and contextual features for main verb identification. According to our specification, we annotate a corpus and then use a Support Vector Machine (SVM) to integrate all the features we propose. Our model, which was trained on our annotated corpus, achieved a promising F score of 92.8%. Furthermore, we show that main verb identification can improve the performance of the Chinese Sentence Breaker, one of the applications of main verb identification, by 2.4%.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130912052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Similarity Based Chinese Synonym Collocation Extraction","authors":"Wanyin Li, Q. Lu, Ruifeng Xu","doi":"10.30019/IJCLCLP.200503.0006","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200503.0006","url":null,"abstract":"Collocation extraction systems based on pure statistical methods suffer from two major problems. The first problem is their relatively low precision and recall rates. The second problem is their difficulty in dealing with sparse collocations. In order to improve performance, both statistical and lexicographic approaches should be considered. This paper presents a new method to extract synonymous collocations using semantic information. The semantic information is obtained by calculating similarities from HowNet. We have successfully extracted synonymous collocations which normally cannot be extracted using lexical statistics. Our evaluation conducted on a 60MB tagged corpus shows that we can extract synonymous collocations that occur with very low frequency and that the improvement in the recall rate is close to 100%. In addition, compared with a collocation extraction system based on the Xtract system for English, our algorithm can improve the precision rate by about 44%.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131436098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Emotions in Mandarin Speech","authors":"T. Pao, Yu-Te Chen, Jun-Heng Yeh, Wen-Yuan Liao","doi":"10.30019/IJCLCLP.200509.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200509.0004","url":null,"abstract":"The importance of automatically recognizing emotions in human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. In this paper, a Mandarin speech based emotion classification method is presented. Five primary human emotions, including anger, boredom, happiness, neutral and sadness, are investigated. Combining different feature streams to obtain a more accurate result is a well-known statistical technique. For speech emotion recognition, we combined 16 LPC coefficients, 12 LPCC components, 16 LFPC components, 16 PLP coefficients, 20 MFCC components and jitter as the basic features to form the feature vector. Two corpora were employed. The recognizer presented in this paper is based on three classification techniques: LDA, K-NN and HMMs. Results show that the selected features are robust and effective for the emotion recognition in the valence and arousal dimensions of the two corpora. Using the HMMs emotion classification method, an average accuracy of 88.7% was achieved.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated Alignment and Extraction of a Bilingual Ontology for Cross-Language Domain-Specific Applications","authors":"Jui-Feng Yeh, Chung-Hsien Wu, Ming-Jun Chen, Liang-Chih Yu","doi":"10.3115/1220355.1220519","DOIUrl":"https://doi.org/10.3115/1220355.1220519","url":null,"abstract":"In this paper we propose a novel approach for ontology alignment and domain ontology extraction from the existing knowledge bases, WordNet and HowNet. These two knowledge bases are aligned to construct a bilingual ontology based on the cooccurrence of the words in the sentence pairs of a parallel corpus. The bilingual ontology has the merit that it contains more structural and semantic information coverage from these two complementary knowledge bases. For domainspecific applications, the domain specific ontology is further extracted from the bilingual ontology by the island-driven algorithm and the domain-specific corpus. Finally, the domain-dependent terminologies and some axioms between domain terminologies are integrated into the ontology. For ontology evaluation, experiments were conducted by comparing the benchmark constructed by the ontology engineers or experts. The experimental results show that the proposed approach can extract an aligned bilingual domain-specific ontology.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129906307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Constructing A Multilingual Speech Corpus for Taiwanese (Min-nan), Hakka, and Mandarin Chinese","authors":"Ren-Yuan Lyu, Min-Siong Liang, Yuang-Chin Chiang","doi":"10.30019/IJCLCLP.200408.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200408.0001","url":null,"abstract":"The Formosa speech database (ForSDat) is a multilingual speech corpus collected at Chang Gung University and sponsored by the National Science Council of Taiwan. It is expected that a multilingual speech corpus will be collected, covering the three most frequently used languages in Taiwan: Taiwanese (Min-nan), Hakka, and Mandarin. This 3-year project has the goal of collecting a phonetically abundant speech corpus of more than 1,800 speakers and hundreds of hours of speech. Recently, the first version of this corpus containing speech of 600 speakers of Taiwanese and Mandarin was finished and is ready to be released. It contains about 49 hours of speech and 247,000 utterances.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127234042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jhing-Fa Wang, Shun-Chieh Lin, Hsueh-Wei Yang, Fan-Min Li
{"title":"Multiple-Translation Spotting for Mandarin-Taiwanese Speech-to-Speech Translation","authors":"Jhing-Fa Wang, Shun-Chieh Lin, Hsueh-Wei Yang, Fan-Min Li","doi":"10.30019/IJCLCLP.200408.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200408.0002","url":null,"abstract":"The critical issues involved in speech-to-spe ech translation are obtaining proper source segments and synthesizing accurate target speech. Therefore, this article develops a novel multiple-translation spotting method to deal with these issues efficiently. Term multiple-translation spotting refers to the task of extracting target-language synthesis patterns that correspond to a given set of source-language spotted patterns in conditional multiple pairs of speech patterns known to be translation patterns. According to the extracted synthesis patterns, the target speech can be properly synthesized by using a waveform segment concatenation-based synthesis method. Experiments were conducted with the languages of Mandarin and Taiwanese. The results reveal that the proposed approach can achieve translation understanding rates of 80% and 76% on average for Mandarin/Taiwanese translation and Taiwanese/Mandarin translation, respectively.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134486451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Properties and Further Applications of Chinese Frequent Strings","authors":"Yih-Jeng Lin, Ming-Shing Yu","doi":"10.30019/IJCLCLP.200402.0007","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0007","url":null,"abstract":"This paper reveals some important properties of CFSs and applications in Chinese natural language processing (NLP). We have previously proposed a method for extracting Chinese frequent strings that contain unknown words from a Chinese corpus [Lin and Yu 2001]. We found that CFSs contain many 4-character strings, 3-word strings, and longer n-grams. Such information can only be derived from an extremely large corpus using a traditional language model (LM). In contrast to using a traditional LM, we can achieve high precision and efficiency by using CFSs to solve Chinese toneless phoneme-to-character conversion and to correct Chinese spelling errors with a small training corpus. An accuracy rate of 92.86% was achieved for Chinese toneless phoneme-to-character conversion, and an accuracy rate of 87.32% was achieved for Chinese spelling error correction. We also attempted to assign syntactic categories to a CFS. The accuracy rate for assigning syntactic categories to the CFSs was 88.53% for outside testing when the syntactic categories of the highest level were used.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126768406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard Tzong-Han Tsai, Shih-Hung Wu, Cheng-Wei Lee, Cheng-Wei Shih, W. Hsu
{"title":"Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model","authors":"Richard Tzong-Han Tsai, Shih-Hung Wu, Cheng-Wei Lee, Cheng-Wei Shih, W. Hsu","doi":"10.30019/IJCLCLP.200402.0004","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0004","url":null,"abstract":"This paper presents a Chinese named entity recognizer (NER): Mencius. It aims to address Chinese NER problems by combining the advantages of rule-based and machine learning (ML) based NER systems. Rule-based NER systems can explicitly encode human comprehension and can be tuned conveniently, while ML-based systems are robust, portable and inexpensive to develop. Our hybrid system incorporates a rule-based knowledge representation and template-matching tool, called InfoMap [Wu et al. 2002], into a maximum entropy (ME) framework. Named entities are represented in InfoMap as templates, which serve as ME features in Mencius. These features are edited manually, and their weights are estimated by the ME framework according to the training data. To understand how word segmentation might influence Chinese NER and the differences between a pure template-based method and our hybrid method, we configure Mencius using four distinct settings. The F-Measures of person names (PER), location names (LOC) and organization names (ORO) of the best configuration in our experiment were respectively 94.3%, 77.8% and 75.3%. From comparing the experiment results obtained using these configurations reveals that hybrid NER Systems always perform better performance in identifying person names. On the other hand, they have a little difficulty identifying location and organization names. Furthermore, using a word segmentation module improves the performance of pure Template-based NER Systems, but, it has little effect on hybrid NER systems.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126639489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Pronominal Anaphora Resolution in English Texts","authors":"Tyne Liang, Dian-Song Wu","doi":"10.30019/IJCLCLP.200402.0002","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0002","url":null,"abstract":"Anaphora is a common phenomenon in discourses as well as an important research issue in the applications of natural language processing. In this paper, anaphora resolution is achieved by employing WordNet ontology and heuristic rules. The proposed system identifies both intra-sentential and inter-sentential antecedents of anaphors. Information about animacy is obtained by analyzing the hierarchical relations of nouns and verbs in the surrounding context. The identification of animacy entities and pleonastic-it usage in English discourses are employed to promote resolution accuracy. Traditionally, anaphora resolution systems have relied on syntactic, semantic or pragmatic clues to identify the antecedent of an anaphor. Our proposed method makes use of WordNet ontology to identify animate entities as well as essential gender information. In the animacy agreement module, the property is identified by the hypernym relation between entities and their unique beginners defined in WordNet. In addition, the verb of the entity is also an important clue used to reduce the uncertainty. An experiment was conducted using a balanced corpus to resolve the pronominal anaphora phenomenon. The methods proposed in (Lappin and Leass, 94) and (Mitkov, 01) focus on the corpora with only inanimate pronouns such as \"it\" or \"its\". Thus the results of intra-sentential and inter-sentential anaphora distribution are different. In an experiment using Brown corpus, we found that the distribution proportion of intra-sentential anaphora is about 60%. Seven heuristic rules are applied in our system; five of them are preference rules, and two are constraint rules. They are derived from syntactic, semantic, pragmatic conventions and from the analysis of training data. A relative measurement indicates that about 30% of the errors can be eliminated by applying heuristic module.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133617079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses","authors":"Chien-Cheng Wu, Jason J. S. Chang","doi":"10.30019/IJCLCLP.200402.0001","DOIUrl":"https://doi.org/10.30019/IJCLCLP.200402.0001","url":null,"abstract":"In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. Collocations are pervasive in all types of writing and can be found in phrases, chunks, proper names, idioms, and terminology. Therefore, automatic extraction of monolingual and bilingual collocations is important for many applications, including natural language generation, word sense disambiguation, machine translation, lexicography, and cross language information retrieval. Collocations can be classified as lexical or grammatical collocations. Lexical collocations exist between content words, while a grammatical collocation exists between a content word and function words or a syntactic structure. In addition, bilingual collocations can be rigid or flexible in both languages. Rigid collocation refers to words in a collocation must appear next to each other, or otherwise (flexible/elastic). We focus in this paper on extracting rigid lexical bilingual collocations. In our method, the preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Collocations matching the patterns are extracted from aligned sentences in a parallel corpus. We use a new alignment method based on punctuation statistics for sentence alignment. The punctuation-based approach is found to outperform the length-based approach with precision rates approaching 98%. The obtained collocations are subsequently matched up based on cross-linguistic statistical association. Statistical association between the whole collocations as well as words in collocations is used to link a collocation with its counterpart collocation in the other language. We implemented the proposed method on a very large Chinese-English parallel corpus and obtained satisfactory results.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115553036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}