Kaiting Lai, Yinong Long, Bowen Wu, Ying Li, Baoxun Wang
{"title":"Semorph:用于中文垃圾文本检测的形态学语义增强预训练模型","authors":"Kaiting Lai, Yinong Long, Bowen Wu, Ying Li, Baoxun Wang","doi":"10.1145/3511808.3557448","DOIUrl":null,"url":null,"abstract":"Chinese spam text detection is essential for social media since these texts affect the user experience of Chinese speakers and pollute the community. The underlying text classification method is employed to explore the unique combinations of characters that represent clues of spam information from annotated or further augmented data. However, based on the diversity of Chinese characters in glyphs, the spammers frequently wrap the spam content in another visually close text to fool the model but make sure people understand. This paper proposes to adopt the essence of human cognition of these adversarial texts into spam text detection models, by designing a pre-trained model to learn the morphology semantics of Chinese characters and represent their contextual meanings from scratch. The model pre-trains on self-supervised Chinese corpus and fine-tunes on spam-annotated community texts. Besides, cooperating with the pre-trained model that can capture the morphological features of Chinese, a new data perturbation method is introduced to guide the optimization towards the direction of recognizing the actual meaning of a text after spammers tamper with partial characters by visually close ones. The experimental results have shown that our proposed methodology can notably improve the performance of spam text detection as well as maintain robustness against adversarial samples.","PeriodicalId":389624,"journal":{"name":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semorph: A Morphology Semantic Enhanced Pre-trained Model for Chinese Spam Text Detection\",\"authors\":\"Kaiting Lai, Yinong Long, Bowen Wu, Ying Li, Baoxun Wang\",\"doi\":\"10.1145/3511808.3557448\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Chinese spam text detection is essential for social media since these texts affect the user experience of Chinese speakers and pollute the community. The underlying text classification method is employed to explore the unique combinations of characters that represent clues of spam information from annotated or further augmented data. However, based on the diversity of Chinese characters in glyphs, the spammers frequently wrap the spam content in another visually close text to fool the model but make sure people understand. This paper proposes to adopt the essence of human cognition of these adversarial texts into spam text detection models, by designing a pre-trained model to learn the morphology semantics of Chinese characters and represent their contextual meanings from scratch. The model pre-trains on self-supervised Chinese corpus and fine-tunes on spam-annotated community texts. Besides, cooperating with the pre-trained model that can capture the morphological features of Chinese, a new data perturbation method is introduced to guide the optimization towards the direction of recognizing the actual meaning of a text after spammers tamper with partial characters by visually close ones. The experimental results have shown that our proposed methodology can notably improve the performance of spam text detection as well as maintain robustness against adversarial samples.\",\"PeriodicalId\":389624,\"journal\":{\"name\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"volume\":\"120 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM International Conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3511808.3557448\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3511808.3557448","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semorph: A Morphology Semantic Enhanced Pre-trained Model for Chinese Spam Text Detection
Chinese spam text detection is essential for social media since these texts affect the user experience of Chinese speakers and pollute the community. The underlying text classification method is employed to explore the unique combinations of characters that represent clues of spam information from annotated or further augmented data. However, based on the diversity of Chinese characters in glyphs, the spammers frequently wrap the spam content in another visually close text to fool the model but make sure people understand. This paper proposes to adopt the essence of human cognition of these adversarial texts into spam text detection models, by designing a pre-trained model to learn the morphology semantics of Chinese characters and represent their contextual meanings from scratch. The model pre-trains on self-supervised Chinese corpus and fine-tunes on spam-annotated community texts. Besides, cooperating with the pre-trained model that can capture the morphological features of Chinese, a new data perturbation method is introduced to guide the optimization towards the direction of recognizing the actual meaning of a text after spammers tamper with partial characters by visually close ones. The experimental results have shown that our proposed methodology can notably improve the performance of spam text detection as well as maintain robustness against adversarial samples.