V. Lomas-Barrie, Michelle Reyes-Camacho, Antonio Neme
{"title":"通过异常检测和自然语言启发的可解释嵌入识别水平基因转移","authors":"V. Lomas-Barrie, Michelle Reyes-Camacho, Antonio Neme","doi":"10.3233/jifs-219337","DOIUrl":null,"url":null,"abstract":"Horizontal gene transference is a biological process that involves the donation of DNA or RNA from an organism to a second, unrelated organism. This process is different from the more common one, vertical transference, which is present whenever an organism or pair of organisms reproduce and transmit their genetic material to the descendants. The identification of segments of genetic material that are the result of horizontal transference is relevant to construct accurate phylogenetic trees, on one hand, and to detect possible drug-resistance mechanisms, on the other, since this movement of genetic material is the main cause behind antibiotic resistance in bacteria. Here, we describe a novel algorithm able to detect sequences of foreign origin, and thus, possible acquired via horizontal transference. The general idea of our method is that within the genome of an organism, there might be sequences that are different from the vast majority of the remaining sequences from the same organism. The former are candidate anomalies, and thus, their origin may be explained by horizontal transference. This approach is equivalent to a particular instance of the authorship attribution problem, that in which from a set of texts or paragraphs, almost all of them were written by the same author, whereas a minority has a different authorship. The constraint is that the author of each text is not known, so the algorithm has to attribute the authorship of each one of the texts. The texts detected to be written by a different author are the equivalent of the sequences of foreign origin for the case of genetic material. We describe here a novel method to detect anomalous sequences, based on interpretable embeddings derived from a common attention mechanism in humans, that of identifying novel tokens within a given sequence. Our proposal achieves novel and consistent results over the genome of a well known organism.","PeriodicalId":509313,"journal":{"name":"Journal of Intelligent & Fuzzy Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identification of horizontal gene transference by means of anomaly detection and natural language-inspired interpretable embeddings\",\"authors\":\"V. Lomas-Barrie, Michelle Reyes-Camacho, Antonio Neme\",\"doi\":\"10.3233/jifs-219337\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Horizontal gene transference is a biological process that involves the donation of DNA or RNA from an organism to a second, unrelated organism. This process is different from the more common one, vertical transference, which is present whenever an organism or pair of organisms reproduce and transmit their genetic material to the descendants. The identification of segments of genetic material that are the result of horizontal transference is relevant to construct accurate phylogenetic trees, on one hand, and to detect possible drug-resistance mechanisms, on the other, since this movement of genetic material is the main cause behind antibiotic resistance in bacteria. Here, we describe a novel algorithm able to detect sequences of foreign origin, and thus, possible acquired via horizontal transference. The general idea of our method is that within the genome of an organism, there might be sequences that are different from the vast majority of the remaining sequences from the same organism. The former are candidate anomalies, and thus, their origin may be explained by horizontal transference. This approach is equivalent to a particular instance of the authorship attribution problem, that in which from a set of texts or paragraphs, almost all of them were written by the same author, whereas a minority has a different authorship. The constraint is that the author of each text is not known, so the algorithm has to attribute the authorship of each one of the texts. The texts detected to be written by a different author are the equivalent of the sequences of foreign origin for the case of genetic material. We describe here a novel method to detect anomalous sequences, based on interpretable embeddings derived from a common attention mechanism in humans, that of identifying novel tokens within a given sequence. Our proposal achieves novel and consistent results over the genome of a well known organism.\",\"PeriodicalId\":509313,\"journal\":{\"name\":\"Journal of Intelligent & Fuzzy Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Intelligent & Fuzzy Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/jifs-219337\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent & Fuzzy Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jifs-219337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
横向基因转移是一个生物过程,涉及从一个生物体向另一个不相关的生物体捐赠 DNA 或 RNA。这一过程与更常见的垂直转移不同,后者是指当一种生物或一对生物进行繁殖并将其遗传物质传给后代时,就会发生基因水平转移。识别水平转移过程中的遗传物质片段,一方面有助于构建准确的系统发生树,另一方面也有助于检测可能的耐药性机制,因为这种遗传物质的移动是细菌产生抗生素耐药性的主要原因。在此,我们介绍一种新颖的算法,该算法能够检测外源序列,从而检测可能通过水平转移获得的序列。我们的方法的总体思路是,在一个生物体的基因组中,可能有一些序列与同一生物体的绝大多数其余序列不同。前者是候选异常,因此,它们的起源可以用水平转移来解释。这种方法等同于作者归属问题的一个特殊实例,即在一组文本或段落中,几乎所有文本或段落都是由同一作者所写,而少数文本或段落的作者则不同。问题的限制条件是不知道每篇文本的作者,因此算法必须确定每篇文本的作者归属。被检测出作者不同的文本就相当于遗传物质中的异源序列。我们在此介绍一种检测异常序列的新方法,该方法基于人类常见的注意力机制(即识别给定序列中的新标记)衍生出的可解释嵌入。我们的建议在已知生物的基因组上取得了新颖而一致的结果。
Identification of horizontal gene transference by means of anomaly detection and natural language-inspired interpretable embeddings
Horizontal gene transference is a biological process that involves the donation of DNA or RNA from an organism to a second, unrelated organism. This process is different from the more common one, vertical transference, which is present whenever an organism or pair of organisms reproduce and transmit their genetic material to the descendants. The identification of segments of genetic material that are the result of horizontal transference is relevant to construct accurate phylogenetic trees, on one hand, and to detect possible drug-resistance mechanisms, on the other, since this movement of genetic material is the main cause behind antibiotic resistance in bacteria. Here, we describe a novel algorithm able to detect sequences of foreign origin, and thus, possible acquired via horizontal transference. The general idea of our method is that within the genome of an organism, there might be sequences that are different from the vast majority of the remaining sequences from the same organism. The former are candidate anomalies, and thus, their origin may be explained by horizontal transference. This approach is equivalent to a particular instance of the authorship attribution problem, that in which from a set of texts or paragraphs, almost all of them were written by the same author, whereas a minority has a different authorship. The constraint is that the author of each text is not known, so the algorithm has to attribute the authorship of each one of the texts. The texts detected to be written by a different author are the equivalent of the sequences of foreign origin for the case of genetic material. We describe here a novel method to detect anomalous sequences, based on interpretable embeddings derived from a common attention mechanism in humans, that of identifying novel tokens within a given sequence. Our proposal achieves novel and consistent results over the genome of a well known organism.