{"title":"Identifying Sexism and Misogyny in Pull Request Comments","authors":"Sayma Sultana","doi":"10.1145/3551349.3559515","DOIUrl":null,"url":null,"abstract":"Being extremely dominated by men, software development organizations lack diversity. People from other groups often encounter sexist, misogynistic, and discriminatory (SMD) speech during communication. To identify SMD contents, I aim to build an automatic misogyny identification (AMI) tool for the domain of software developers. On this goal, I built a dataset of 10,138 pull request comments mined from Github based on a keyword-based selection, followed by manual validation. Using ten-fold cross-validation, I evaluated ten machine learning algorithms for automatic identification. The best performing model achieved 80% precision, 67.07% recall, 72.5% f-score, and 95.96% accuracy.","PeriodicalId":197939,"journal":{"name":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3551349.3559515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Being extremely dominated by men, software development organizations lack diversity. People from other groups often encounter sexist, misogynistic, and discriminatory (SMD) speech during communication. To identify SMD contents, I aim to build an automatic misogyny identification (AMI) tool for the domain of software developers. On this goal, I built a dataset of 10,138 pull request comments mined from Github based on a keyword-based selection, followed by manual validation. Using ten-fold cross-validation, I evaluated ten machine learning algorithms for automatic identification. The best performing model achieved 80% precision, 67.07% recall, 72.5% f-score, and 95.96% accuracy.