{"title":"基于用户日志聚类的中文查询扩展","authors":"Shufang Jia, Lei Li","doi":"10.1109/ICNIDC.2009.5360836","DOIUrl":null,"url":null,"abstract":"Most previous query expansion researches are based on pseudo relevant documents. In this study, we present a novel expansion method by clustering the real user log. Because not all of the clicked pages are suitable for query expansion, we de-noised the clicked results by reliability to enhance the performance. After HTML labels removing, the page body contents are clustered and the cluster centers cover various aspects of the original query. The terms used in log queries can provide a better choice of features, from the user's point of view, for summarizing the web pages that were clicked from these queries. Therefore, the associated queries, reverse queries, webpage title and keyword phrases are combined with the cluster centers to attain high-quality expansion terms for new queries. We also propose a new terminology extraction method through Baidu Baike. It can identify and extract the terminology phrase based on the manual edited dictionary online.","PeriodicalId":127306,"journal":{"name":"2009 IEEE International Conference on Network Infrastructure and Digital Content","volume":"35 7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Chinese query expansion based on user log clustering\",\"authors\":\"Shufang Jia, Lei Li\",\"doi\":\"10.1109/ICNIDC.2009.5360836\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most previous query expansion researches are based on pseudo relevant documents. In this study, we present a novel expansion method by clustering the real user log. Because not all of the clicked pages are suitable for query expansion, we de-noised the clicked results by reliability to enhance the performance. After HTML labels removing, the page body contents are clustered and the cluster centers cover various aspects of the original query. The terms used in log queries can provide a better choice of features, from the user's point of view, for summarizing the web pages that were clicked from these queries. Therefore, the associated queries, reverse queries, webpage title and keyword phrases are combined with the cluster centers to attain high-quality expansion terms for new queries. We also propose a new terminology extraction method through Baidu Baike. It can identify and extract the terminology phrase based on the manual edited dictionary online.\",\"PeriodicalId\":127306,\"journal\":{\"name\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"volume\":\"35 7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNIDC.2009.5360836\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Network Infrastructure and Digital Content","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNIDC.2009.5360836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Chinese query expansion based on user log clustering
Most previous query expansion researches are based on pseudo relevant documents. In this study, we present a novel expansion method by clustering the real user log. Because not all of the clicked pages are suitable for query expansion, we de-noised the clicked results by reliability to enhance the performance. After HTML labels removing, the page body contents are clustered and the cluster centers cover various aspects of the original query. The terms used in log queries can provide a better choice of features, from the user's point of view, for summarizing the web pages that were clicked from these queries. Therefore, the associated queries, reverse queries, webpage title and keyword phrases are combined with the cluster centers to attain high-quality expansion terms for new queries. We also propose a new terminology extraction method through Baidu Baike. It can identify and extract the terminology phrase based on the manual edited dictionary online.