{"title":"通过Metacrawler清除灰尘","authors":"Smita Deshmukh, Priti Chittekar","doi":"10.1109/ICOEI48184.2020.9142922","DOIUrl":null,"url":null,"abstract":"Nowadays URLs collected by Search engine contain mirrored data. Some of the pages gathered by the crawler contain duplicated data. Different URLs with Similar Text are generally known as DUST. With the effect of DUST, the disk storage is wasted, quality rankings are degraded and lower user experiences. To avoid such problems, many kind of research has been recommended and the methods which are already available define only URL DUST removal and detection. The system which is going to be implemented can find and erase content DUST and URL DUST. The concept of the Metacrawler is introduced that crawl the documents and gets results from all the three search engines. The comparisons of content of every website with the another to eliminate mirrored data using k- gram paraphrase technique is defined in current method.","PeriodicalId":267795,"journal":{"name":"2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)","volume":"241 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Removing Dust By Metacrawler\",\"authors\":\"Smita Deshmukh, Priti Chittekar\",\"doi\":\"10.1109/ICOEI48184.2020.9142922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays URLs collected by Search engine contain mirrored data. Some of the pages gathered by the crawler contain duplicated data. Different URLs with Similar Text are generally known as DUST. With the effect of DUST, the disk storage is wasted, quality rankings are degraded and lower user experiences. To avoid such problems, many kind of research has been recommended and the methods which are already available define only URL DUST removal and detection. The system which is going to be implemented can find and erase content DUST and URL DUST. The concept of the Metacrawler is introduced that crawl the documents and gets results from all the three search engines. The comparisons of content of every website with the another to eliminate mirrored data using k- gram paraphrase technique is defined in current method.\",\"PeriodicalId\":267795,\"journal\":{\"name\":\"2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)\",\"volume\":\"241 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICOEI48184.2020.9142922\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOEI48184.2020.9142922","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Nowadays URLs collected by Search engine contain mirrored data. Some of the pages gathered by the crawler contain duplicated data. Different URLs with Similar Text are generally known as DUST. With the effect of DUST, the disk storage is wasted, quality rankings are degraded and lower user experiences. To avoid such problems, many kind of research has been recommended and the methods which are already available define only URL DUST removal and detection. The system which is going to be implemented can find and erase content DUST and URL DUST. The concept of the Metacrawler is introduced that crawl the documents and gets results from all the three search engines. The comparisons of content of every website with the another to eliminate mirrored data using k- gram paraphrase technique is defined in current method.