一种可解释的口语转录本和书面文本自动分类方法。

IF 2.6 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Evolutionary Intelligence Pub Date : 2023-05-04 DOI:10.1007/s12065-023-00851-1

Mattias Wahde, Marco L Della Vedova, Marco Virgolin, Minerva Suvanto

{"title":"一种可解释的口语转录本和书面文本自动分类方法。","authors":"Mattias Wahde, Marco L Della Vedova, Marco Virgolin, Minerva Suvanto","doi":"10.1007/s12065-023-00851-1","DOIUrl":null,"url":null,"abstract":"We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of <math><mrow><mi>n</mi><mo>-</mo></mrow></math>gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.","PeriodicalId":46237,"journal":{"name":"Evolutionary Intelligence","volume":" ","pages":"1-13"},"PeriodicalIF":2.6000,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10157555/pdf/","citationCount":"0","resultStr":"{\"title\":\"An interpretable method for automated classification of spoken transcripts and written text.\",\"authors\":\"Mattias Wahde, Marco L Della Vedova, Marco Virgolin, Minerva Suvanto\",\"doi\":\"10.1007/s12065-023-00851-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of <math><mrow><mi>n</mi><mo>-</mo></mrow></math>gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.\",\"PeriodicalId\":46237,\"journal\":{\"name\":\"Evolutionary Intelligence\",\"volume\":\" \",\"pages\":\"1-13\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-05-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10157555/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Evolutionary Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s12065-023-00851-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12065-023-00851-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在文本分类的背景下，我们调查了口语（以广播节目记录的形式）和书面语言（维基百科文章）之间的差异。我们提出了一种新颖的、可解释的文本分类方法，包括使用一大组n-gram特征的线性分类器，并将其应用于新生成的数据集，该数据集的句子来源于口语转录本或书面文本。我们的分类器的精度比常用的基于深度神经网络（DNN）的分类器（DistilBERT）低0.02以下。此外，我们的分类器有一个综合的置信度度量，用于评估给定分类的可靠性。提供了一个在线工具来演示我们的分类器，特别是其可解释性，这是涉及高风险决策的分类任务的一个关键特征。我们还研究了DistilBERT在口语或书面文本中执行填空任务的能力，并发现它在这两种情况下的表现相似。我们的主要结论是，经过仔细改进，经典方法和基于DNN的方法之间的性能差距可能会显著缩小，因此分类方法的选择取决于对可解释性的需求（如果有的话）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

An interpretable method for automated classification of spoken transcripts and written text.

查看原文本刊更多论文

An interpretable method for automated classification of spoken transcripts and written text.

We investigate the differences between spoken language (in the form of radio show transcripts) and written language (Wikipedia articles) in the context of text classification. We present a novel, interpretable method for text classification, involving a linear classifier using a large set of $n -$ gram features, and apply it to a newly generated data set with sentences originating either from spoken transcripts or written text. Our classifier reaches an accuracy less than 0.02 below that of a commonly used classifier (DistilBERT) based on deep neural networks (DNNs). Moreover, our classifier has an integrated measure of confidence, for assessing the reliability of a given classification. An online tool is provided for demonstrating our classifier, particularly its interpretable nature, which is a crucial feature in classification tasks involving high-stakes decision-making. We also study the capability of DistilBERT to carry out fill-in-the-blank tasks in either spoken or written text, and find it to perform similarly in both cases. Our main conclusion is that, with careful improvements, the performance gap between classical methods and DNN-based methods may be reduced significantly, such that the choice of classification method comes down to the need (if any) for interpretability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Evolutionary Intelligence COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

6.80

自引率

0.00%

发文量

108

期刊介绍： This Journal provides an international forum for the timely publication and dissemination of foundational and applied research in the domain of Evolutionary Intelligence. The spectrum of emerging fields in contemporary artificial intelligence, including Big Data, Deep Learning, Computational Neuroscience bridged with evolutionary computing and other population-based search methods constitute the flag of Evolutionary Intelligence Journal.Topics of interest for Evolutionary Intelligence refer to different aspects of evolutionary models of computation empowered with intelligence-based approaches, including but not limited to architectures, model optimization and tuning, machine learning algorithms, life inspired adaptive algorithms, swarm-oriented strategies, high performance computing, massive data processing, with applications to domains like computer vision, image processing, simulation, robotics, computational finance, media, internet of things, medicine, bioinformatics, smart cities, and similar. Surveys outlining the state of art in specific subfields and applications are welcome.