Son Huynh, Khiem H. Le, Nhi Dang, Bao Le, Dang T. Huynh, Binh T. Nguyen, T. T. Nguyen, N. Ho
{"title":"Named Entity Recognition for Vietnamese Real Estate Advertisements","authors":"Son Huynh, Khiem H. Le, Nhi Dang, Bao Le, Dang T. Huynh, Binh T. Nguyen, T. T. Nguyen, N. Ho","doi":"10.1109/NICS54270.2021.9701519","DOIUrl":null,"url":null,"abstract":"With the booming development of the Internet and e-Commerce, advertising has appeared in almost all areas of life, especially in the real estate domain. Understanding these advertising posts is necessary to capture the status of real estate transactions and rent and sale prices in different areas with various properties. Motivated by that, we present the first manually annotated Vietnamese dataset in the real estate domain. Remarkably, our dataset is annotated for the named entity recognition task with lots of entity types. In comparison to other Vietnamese NER datasets, our dataset contains the largest number of entities. We empirically investigate a strong baseline on our dataset using the API supported by the spaCy library, which comprises four main components: tokenization, embedding, encoding, and parsing. For the encoding, we conduct experiments with various encoders, including Convolutions with Maxout activation (MaxoutWindowEncoder), Convolutions with Mish activation (MishWindowEncoder), and bidirectional Long short-term memory (BiLSTMEncoder). The experimental results show that the MishWindowEncoder gives the best performance in terms of micro F1-score (90.72 %). Finally, we aim to publish our dataset later to contribute to the current research community related to named entity recognition.","PeriodicalId":296963,"journal":{"name":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS54270.2021.9701519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
With the booming development of the Internet and e-Commerce, advertising has appeared in almost all areas of life, especially in the real estate domain. Understanding these advertising posts is necessary to capture the status of real estate transactions and rent and sale prices in different areas with various properties. Motivated by that, we present the first manually annotated Vietnamese dataset in the real estate domain. Remarkably, our dataset is annotated for the named entity recognition task with lots of entity types. In comparison to other Vietnamese NER datasets, our dataset contains the largest number of entities. We empirically investigate a strong baseline on our dataset using the API supported by the spaCy library, which comprises four main components: tokenization, embedding, encoding, and parsing. For the encoding, we conduct experiments with various encoders, including Convolutions with Maxout activation (MaxoutWindowEncoder), Convolutions with Mish activation (MishWindowEncoder), and bidirectional Long short-term memory (BiLSTMEncoder). The experimental results show that the MishWindowEncoder gives the best performance in terms of micro F1-score (90.72 %). Finally, we aim to publish our dataset later to contribute to the current research community related to named entity recognition.