Yusuke Tozaki, Takahiko Suzuki, Tsunenori Mine, S. Hirokawa
{"title":"利用文本挖掘和Benford定律提取大学录取统计中的不规则数据集","authors":"Yusuke Tozaki, Takahiko Suzuki, Tsunenori Mine, S. Hirokawa","doi":"10.1109/IIAI-AAI.2019.00207","DOIUrl":null,"url":null,"abstract":"It is known as Benford's law that the distribution of the first digits forms a specific shape for natural numerical datasets. Deviation from the Benford's distribution indicates the irregularity of the dataset. However, it does not tell any clue to interpret the reason of irregularity. The present paper constructs a search engine of cells that appear in tables by correlating a cell with the words in the title of row or column or in the explanation of the table. We generate an exhaustive dataset of cells for testing irregularity by enumerating the search conditions. We applied the method to the number of applicants, the number of candidates, and the number of successful applicants in each department of 565 private universities in Japan. We confirmed the effectiveness of the proposed method by extracting the characteristics of the irregular datasets.","PeriodicalId":136474,"journal":{"name":"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Extracting Irregular Datasets in University Admission Statistics using Text Mining and Benford's Law\",\"authors\":\"Yusuke Tozaki, Takahiko Suzuki, Tsunenori Mine, S. Hirokawa\",\"doi\":\"10.1109/IIAI-AAI.2019.00207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is known as Benford's law that the distribution of the first digits forms a specific shape for natural numerical datasets. Deviation from the Benford's distribution indicates the irregularity of the dataset. However, it does not tell any clue to interpret the reason of irregularity. The present paper constructs a search engine of cells that appear in tables by correlating a cell with the words in the title of row or column or in the explanation of the table. We generate an exhaustive dataset of cells for testing irregularity by enumerating the search conditions. We applied the method to the number of applicants, the number of candidates, and the number of successful applicants in each department of 565 private universities in Japan. We confirmed the effectiveness of the proposed method by extracting the characteristics of the irregular datasets.\",\"PeriodicalId\":136474,\"journal\":{\"name\":\"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IIAI-AAI.2019.00207\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 8th International Congress on Advanced Applied Informatics (IIAI-AAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI.2019.00207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Extracting Irregular Datasets in University Admission Statistics using Text Mining and Benford's Law
It is known as Benford's law that the distribution of the first digits forms a specific shape for natural numerical datasets. Deviation from the Benford's distribution indicates the irregularity of the dataset. However, it does not tell any clue to interpret the reason of irregularity. The present paper constructs a search engine of cells that appear in tables by correlating a cell with the words in the title of row or column or in the explanation of the table. We generate an exhaustive dataset of cells for testing irregularity by enumerating the search conditions. We applied the method to the number of applicants, the number of candidates, and the number of successful applicants in each department of 565 private universities in Japan. We confirmed the effectiveness of the proposed method by extracting the characteristics of the irregular datasets.