{"title":"MWUs Extraction Based on Continuous Measurement of Inter-word Association with Frequency Adjustment","authors":"Zhifei Wang, Yue Chen, Xiaoyu Jiang","doi":"10.1109/ICCRD.2010.140","DOIUrl":null,"url":null,"abstract":"Extracting Multi-Word Units (MWUs) from raw text is a significant problem in natural language processing due to MWUs describe concept more accurate than single word. The statistical methods such as Mutual Information, Log- Likelihood Ratio and Chi-Squared test etc., rely on frequency of words extremely because the component words of MWUs tend to co-occur more often, and that the main components of multi-word phrase are the core terms in the text document. These core terms have a very high frequency generally and their word-building powers are very strong, so the frequency of these core terms is far higher than other component words of MWUs, and thus reduce the accuracy of the method. We proposed a method to adjust the frequency of the core words. Experimental results show that the method significantly improved the recall of the multi-word combinations and preserving the precision.","PeriodicalId":158568,"journal":{"name":"2010 Second International Conference on Computer Research and Development","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Second International Conference on Computer Research and Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCRD.2010.140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Extracting Multi-Word Units (MWUs) from raw text is a significant problem in natural language processing due to MWUs describe concept more accurate than single word. The statistical methods such as Mutual Information, Log- Likelihood Ratio and Chi-Squared test etc., rely on frequency of words extremely because the component words of MWUs tend to co-occur more often, and that the main components of multi-word phrase are the core terms in the text document. These core terms have a very high frequency generally and their word-building powers are very strong, so the frequency of these core terms is far higher than other component words of MWUs, and thus reduce the accuracy of the method. We proposed a method to adjust the frequency of the core words. Experimental results show that the method significantly improved the recall of the multi-word combinations and preserving the precision.