Building the Pornography Corpus for Bahasa Indonesia Based on TRUST+™ Positif Database

D. Gunawan, Syaiful Anwar Husen Lubis, R. Rahmat, A. Hizriadi
{"title":"Building the Pornography Corpus for Bahasa Indonesia Based on TRUST+™ Positif Database","authors":"D. Gunawan, Syaiful Anwar Husen Lubis, R. Rahmat, A. Hizriadi","doi":"10.1109/ICISS48059.2019.8969831","DOIUrl":null,"url":null,"abstract":"The Indonesian government has developed a database called TRUST+™ Positif which contains the list of blacklisted URLs. These URLs are blacklisted because they contain some prohibited materials or negative contents such as pornography, radicalism, fraud, racism, violence, gambling, and security threat. The government requires all the Internet Service Provider (ISP) in Indonesia to block Internet access to websites listed on TRUST+™ Positif database. The government expects this action will help reducing the spread of negative contents, especially pornography. One of the many ways to disseminate pornographic content is by publishing those articles in the websites. This research purpose is to provide the pornographic corpus in order to be the reference for pornography identification research. The corpus is built from 1,000 articles from 150 websites based on TRUST+™ Positif database. This research only extracts the sentences that related to pornography. The extraction is done based on the selected keywords. These keywords are generated by the most frequent words the articles and should be related to pornography. There are 447 keywords that have been selected manually. The result of this research is a pornographic corpus in Bahasa Indonesia that consists of 35,753 sentences.","PeriodicalId":125643,"journal":{"name":"2019 International Conference on ICT for Smart Society (ICISS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on ICT for Smart Society (ICISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISS48059.2019.8969831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The Indonesian government has developed a database called TRUST+™ Positif which contains the list of blacklisted URLs. These URLs are blacklisted because they contain some prohibited materials or negative contents such as pornography, radicalism, fraud, racism, violence, gambling, and security threat. The government requires all the Internet Service Provider (ISP) in Indonesia to block Internet access to websites listed on TRUST+™ Positif database. The government expects this action will help reducing the spread of negative contents, especially pornography. One of the many ways to disseminate pornographic content is by publishing those articles in the websites. This research purpose is to provide the pornographic corpus in order to be the reference for pornography identification research. The corpus is built from 1,000 articles from 150 websites based on TRUST+™ Positif database. This research only extracts the sentences that related to pornography. The extraction is done based on the selected keywords. These keywords are generated by the most frequent words the articles and should be related to pornography. There are 447 keywords that have been selected manually. The result of this research is a pornographic corpus in Bahasa Indonesia that consists of 35,753 sentences.
基于TRUST+™Positif数据库构建印尼语色情语料库
印尼政府开发了一个名为TRUST+™Positif的数据库,其中包含列入黑名单的网址列表。这些网址被列入黑名单,因为它们包含一些禁止的材料或负面内容,如色情、激进主义、欺诈、种族主义、暴力、赌博和安全威胁。印尼政府要求所有互联网服务提供商(ISP)封锁TRUST+™Positif数据库所列网站的互联网接入。政府希望这一行动将有助于减少负面内容的传播,尤其是色情内容。传播色情内容的许多方法之一是在网站上发布这些文章。本研究旨在提供色情语料库,为色情鉴定研究提供参考。该语料库基于TRUST+™Positif数据库,从150个网站的1000篇文章中构建而成。这项研究只提取了与色情有关的句子。根据所选的关键字进行提取。这些关键词是由文章中出现频率最高的单词产生的,应该与色情有关。手工选择的关键字有447个。本研究的结果是一个由35,753个句子组成的印尼语色情语料库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信