Liner Yang;Yujie Wang;Zhixuan Fang;Yaping Huang;Erhong Yang
{"title":"通过工人选择和数据增强的成本优化的NLP众包","authors":"Liner Yang;Yujie Wang;Zhixuan Fang;Yaping Huang;Erhong Yang","doi":"10.1109/TNSE.2025.3559342","DOIUrl":null,"url":null,"abstract":"This paper presents worker selection and data augmentation algorithms aimed at improving annotation quality and reducing costs in crowdsourcing for Natural Language Processing (NLP). Unlike previous studies targeting simpler tasks like binary classification, which require less contextual understanding, this study aims to provide a unified paradigm for a wider spectrum of NLP tasks, with sequence labeling and text generation as application showcases. Utilizing a Combinatorial Multi-Armed Bandit (CMAB) approach and a cost-effective human feedback mechanism, the proposed worker selection algorithm effectively addresses the challenge of label inter-dependency in NLP tasks. Additionally, our algorithm tackles the issues presented by imbalanced and small-scale datasets through data augmentation methods. Experiments on the CoNLL 2003 NER, Chinese OEI, and YACLC datasets demonstrated the algorithm's efficiency, achieving up to 100.04% of the expert-only baseline <inline-formula><tex-math>${\\text{F}}$</tex-math></inline-formula>-score and 65.97% cost savings. A dataset-independent experiment yielded 97.56% of the expert baseline <inline-formula><tex-math>${\\text{F}}$</tex-math></inline-formula>-score and 59.88% cost savings. We also provide a theoretical analysis proving our worker selection framework achieves sub-linear regret.","PeriodicalId":54229,"journal":{"name":"IEEE Transactions on Network Science and Engineering","volume":"12 4","pages":"3343-3359"},"PeriodicalIF":6.7000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cost-Optimized Crowdsourcing for NLP via Worker Selection and Data Augmentation\",\"authors\":\"Liner Yang;Yujie Wang;Zhixuan Fang;Yaping Huang;Erhong Yang\",\"doi\":\"10.1109/TNSE.2025.3559342\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents worker selection and data augmentation algorithms aimed at improving annotation quality and reducing costs in crowdsourcing for Natural Language Processing (NLP). Unlike previous studies targeting simpler tasks like binary classification, which require less contextual understanding, this study aims to provide a unified paradigm for a wider spectrum of NLP tasks, with sequence labeling and text generation as application showcases. Utilizing a Combinatorial Multi-Armed Bandit (CMAB) approach and a cost-effective human feedback mechanism, the proposed worker selection algorithm effectively addresses the challenge of label inter-dependency in NLP tasks. Additionally, our algorithm tackles the issues presented by imbalanced and small-scale datasets through data augmentation methods. Experiments on the CoNLL 2003 NER, Chinese OEI, and YACLC datasets demonstrated the algorithm's efficiency, achieving up to 100.04% of the expert-only baseline <inline-formula><tex-math>${\\\\text{F}}$</tex-math></inline-formula>-score and 65.97% cost savings. A dataset-independent experiment yielded 97.56% of the expert baseline <inline-formula><tex-math>${\\\\text{F}}$</tex-math></inline-formula>-score and 59.88% cost savings. We also provide a theoretical analysis proving our worker selection framework achieves sub-linear regret.\",\"PeriodicalId\":54229,\"journal\":{\"name\":\"IEEE Transactions on Network Science and Engineering\",\"volume\":\"12 4\",\"pages\":\"3343-3359\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2025-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Network Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10959726/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Network Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10959726/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
Cost-Optimized Crowdsourcing for NLP via Worker Selection and Data Augmentation
This paper presents worker selection and data augmentation algorithms aimed at improving annotation quality and reducing costs in crowdsourcing for Natural Language Processing (NLP). Unlike previous studies targeting simpler tasks like binary classification, which require less contextual understanding, this study aims to provide a unified paradigm for a wider spectrum of NLP tasks, with sequence labeling and text generation as application showcases. Utilizing a Combinatorial Multi-Armed Bandit (CMAB) approach and a cost-effective human feedback mechanism, the proposed worker selection algorithm effectively addresses the challenge of label inter-dependency in NLP tasks. Additionally, our algorithm tackles the issues presented by imbalanced and small-scale datasets through data augmentation methods. Experiments on the CoNLL 2003 NER, Chinese OEI, and YACLC datasets demonstrated the algorithm's efficiency, achieving up to 100.04% of the expert-only baseline ${\text{F}}$-score and 65.97% cost savings. A dataset-independent experiment yielded 97.56% of the expert baseline ${\text{F}}$-score and 59.88% cost savings. We also provide a theoretical analysis proving our worker selection framework achieves sub-linear regret.
期刊介绍:
The proposed journal, called the IEEE Transactions on Network Science and Engineering (TNSE), is committed to timely publishing of peer-reviewed technical articles that deal with the theory and applications of network science and the interconnections among the elements in a system that form a network. In particular, the IEEE Transactions on Network Science and Engineering publishes articles on understanding, prediction, and control of structures and behaviors of networks at the fundamental level. The types of networks covered include physical or engineered networks, information networks, biological networks, semantic networks, economic networks, social networks, and ecological networks. Aimed at discovering common principles that govern network structures, network functionalities and behaviors of networks, the journal seeks articles on understanding, prediction, and control of structures and behaviors of networks. Another trans-disciplinary focus of the IEEE Transactions on Network Science and Engineering is the interactions between and co-evolution of different genres of networks.