{"title":"基于特征分割和注意机制的多标签网页文本分类","authors":"Yanan Cheng, Wenling Li, Zhichao Zhang, Hao Chen, Zhaoxin Zhang","doi":"10.1016/j.neucom.2025.131635","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the natural distribution differences of webpage content, multi-label webpage text datasets suffer from the long-tailed label problem. Moreover, the length of multi-label webpage text varies, making it difficult for sequence based deep learning models to set the sequence length. In order to solve the above problems, a feature self segmentation strategy is proposed in this paper, which executes different segmentation strategies for webpage texts of different lengths based on the sequence length of the deep learning model, so as to preserve long webpage texts without introducing too much noisy data for short webpage texts. In addition, by calculating the attention of adjacent segments, calculating the attention of labels and different segments, and constructing the co-attention networks, not only can important content in the document be highlighted, but also content related to labels can be highlighted, which can effectively extract features associated with low-frequency labels and solve the long-tailed label problem. The comparative experimental results on the manually annotated Energy Website Multi-Label Webpage Text dataset and three benchmark multi-label text classification datasets demonstrate that the method constructed in this paper outperforms all baseline methods. The main codes are available at <span><span>https://github.com/sgysgywaityou/MLWT-FSAM/tree/main/MLWT-FSAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131635"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-label webpage text classification based on feature segmentation and attention mechanism\",\"authors\":\"Yanan Cheng, Wenling Li, Zhichao Zhang, Hao Chen, Zhaoxin Zhang\",\"doi\":\"10.1016/j.neucom.2025.131635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Due to the natural distribution differences of webpage content, multi-label webpage text datasets suffer from the long-tailed label problem. Moreover, the length of multi-label webpage text varies, making it difficult for sequence based deep learning models to set the sequence length. In order to solve the above problems, a feature self segmentation strategy is proposed in this paper, which executes different segmentation strategies for webpage texts of different lengths based on the sequence length of the deep learning model, so as to preserve long webpage texts without introducing too much noisy data for short webpage texts. In addition, by calculating the attention of adjacent segments, calculating the attention of labels and different segments, and constructing the co-attention networks, not only can important content in the document be highlighted, but also content related to labels can be highlighted, which can effectively extract features associated with low-frequency labels and solve the long-tailed label problem. The comparative experimental results on the manually annotated Energy Website Multi-Label Webpage Text dataset and three benchmark multi-label text classification datasets demonstrate that the method constructed in this paper outperforms all baseline methods. The main codes are available at <span><span>https://github.com/sgysgywaityou/MLWT-FSAM/tree/main/MLWT-FSAM</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"657 \",\"pages\":\"Article 131635\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225023070\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023070","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multi-label webpage text classification based on feature segmentation and attention mechanism
Due to the natural distribution differences of webpage content, multi-label webpage text datasets suffer from the long-tailed label problem. Moreover, the length of multi-label webpage text varies, making it difficult for sequence based deep learning models to set the sequence length. In order to solve the above problems, a feature self segmentation strategy is proposed in this paper, which executes different segmentation strategies for webpage texts of different lengths based on the sequence length of the deep learning model, so as to preserve long webpage texts without introducing too much noisy data for short webpage texts. In addition, by calculating the attention of adjacent segments, calculating the attention of labels and different segments, and constructing the co-attention networks, not only can important content in the document be highlighted, but also content related to labels can be highlighted, which can effectively extract features associated with low-frequency labels and solve the long-tailed label problem. The comparative experimental results on the manually annotated Energy Website Multi-Label Webpage Text dataset and three benchmark multi-label text classification datasets demonstrate that the method constructed in this paper outperforms all baseline methods. The main codes are available at https://github.com/sgysgywaityou/MLWT-FSAM/tree/main/MLWT-FSAM.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.