AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS最新文献

筛选
英文 中文
Word Search in Handwritten Text Based on Stroke Segmentation 基于笔画分割的手写体文字搜索
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701535
I. D. Morozov, L. M. Mestetskiy
{"title":"Word Search in Handwritten Text Based on Stroke Segmentation","authors":"I. D. Morozov,&nbsp;L. M. Mestetskiy","doi":"10.3103/S0005105525701535","DOIUrl":"10.3103/S0005105525701535","url":null,"abstract":"<p>Handwritten archival documents form a fundamental part of humanity’s cultural heritage. However, their analysis remains a labor-intensive task for professional researchers, including historians, philologists, and linguists. Working with historical manuscripts requires a fundamentally different approach from commercial OCR applications due to the extreme diversity of handwriting, the presence of corrections, and material degradation. This paper proposes a method for searching within handwritten texts based on stroke segmentation. Instead of performing full text recognition, which is often unattainable for historical documents, this method allows for efficiently answering researcher search queries. The key idea involves decomposing the text into elementary strokes, forming semantic vector representations using contrastive learning, followed by clustering and classification to create an adaptive handwriting dictionary. It is experimentally shown that search by comparing tuples of reduced sequences of the most informative strokes using the Levenshtein distance provides sufficient quality for the task at hand. This method demonstrates resilience to individual handwriting characteristics and writing variations, which is particularly important for working with authors’ archives and historical documents. The proposed approach opens up new possibilities for accelerating scientific research in the humanities, reducing the time required to find relevant information from weeks to minutes, thereby qualitatively transforming research capabilities when working with large archives of handwritten documents.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S549 - S556"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of Hallucinations Based on the Internal States of Large Language Models 基于大型语言模型内部状态的幻觉检测
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701456
T. R. Aisin, T. V. Shamardina
{"title":"Detection of Hallucinations Based on the Internal States of Large Language Models","authors":"T. R. Aisin,&nbsp;T. V. Shamardina","doi":"10.3103/S0005105525701456","DOIUrl":"10.3103/S0005105525701456","url":null,"abstract":"<div><p>In recent years, large language models (LLMs) have achieved substantial progress in natural language processing tasks and have become key instruments for addressing a wide range of applied and research problems. However, as their scale and capabilities grow, the issue of hallucinations or the generation of false, unreliable, or nonexistent information presented in a credible manner has become increasingly acute. Consequently, analyzing the nature of hallucinations and developing methods for their detection has acquired both scientific and practical significance. This study examines the phenomenon of hallucinations in LLMs, reviews their existing classification, and investigates potential causes. Using the Flan-T5 model, we analyze differences in the model’s internal states when generating hallucinations versus correct responses. Based on these discrepancies, we propose two approaches for hallucination detection: one leveraging attention maps and the other utilizing the model’s hidden states. These methods are evaluated on data from HaluEval and Shroom 2024 benchmarks in tasks such as summarization, question answering, paraphrasing, machine translation, and the generation of definitions. Additionally, we assess the transferability of the trained detectors across different hallucination types to evaluate the robustness of the proposed methods.</p></div>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S489 - S497"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Some Approaches to Improving Prediction Accuracy Using Ensemble Methods 利用集成方法提高预测精度的几种方法
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701523
X. Ma, O. V. Senko
{"title":"Some Approaches to Improving Prediction Accuracy Using Ensemble Methods","authors":"X. Ma,&nbsp;O. V. Senko","doi":"10.3103/S0005105525701523","DOIUrl":"10.3103/S0005105525701523","url":null,"abstract":"<p>This study presents the results of an experimental analysis evaluating the effectiveness of Extra Trees within gradient boosting models, as well as in a newly proposed ensemble framework where the forest is generated under conditions of enhanced internal divergence. Additionally, the paper explores the performance of extra trees when applied to novel feature representations computed as IDO distances to a selected set of reference examples. It has been shown that the use of extra randomized trees in gradient boosting and divergent forest models improves generalization ability. The use of expanded feature sets leads to even greater generalization ability.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S542 - S548"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Postcorrection of Weak Transcriptions by Large Language Models in the Iterative Process of Handwritten Text Recognition 手写体文本识别迭代过程中大型语言模型的弱转录后校正
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701511
V. P. Zykov, L. M. Mestetskiy
{"title":"Postcorrection of Weak Transcriptions by Large Language Models in the Iterative Process of Handwritten Text Recognition","authors":"V. P. Zykov,&nbsp;L. M. Mestetskiy","doi":"10.3103/S0005105525701511","DOIUrl":"10.3103/S0005105525701511","url":null,"abstract":"<p>The problem of accelerating the construction of accurate editorial annotations for handwritten archival texts within an incremental training cycle based on weak transcription is considered. Unlike previously published results, this work is focused on integrating automatic postcorrection of weak transcriptions using large language models (LLMs). A protocol for applying LLMs at the line level is proposed and implemented in a few-shot setup with carefully designed prompts and strict output format control (preservation of prereform orthography, protection of proper names and numerals, prohibition of structural changes to lines). Experiments have been conducted on the corpus of diaries of A.V. Sukhovo-Kobylin. As the base recognition model, we use the line-level version of the vertical attention network (VAN). The results show that LLM postcorrection (exemplified by the ChatGPT-4o service) substantially improves the readability of weak transcriptions and significantly reduces the word error rate (in our experiments, by about –12 percentage points), without degrading the character error rate. Another service tested, DeepSeek-R1, has demonstrated less stable behavior. Practical prompt engineering and limitations (context length limits, risk of “hallucinations”) are discussed, and recommendations are provided for the safe integration of LLM postcorrection into an iterative annotation pipeline to reduce expert annotators’ workload and speed up the digitization of historical archives.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S529 - S541"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A System for Testing Controllers Based on On-Screen Text Recognition 基于屏幕文本识别的控制器测试系统
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S000510552570150X
A. A. Dokukin
{"title":"A System for Testing Controllers Based on On-Screen Text Recognition","authors":"A. A. Dokukin","doi":"10.3103/S000510552570150X","DOIUrl":"10.3103/S000510552570150X","url":null,"abstract":"<p>A solution for the problem of testing controllers based on reading information from their screens is described. A hardware and software system has been developed for this purpose, which consists of a camera and software modules implementing the necessary algorithms and methods: an image preprocessing module; a menu type detection module; a font character processing module; a text reading module, including one written in various fonts; and the testing module itself. The system has been developed for a specific type of controller with a monochrome 128 × 64 pixel display. All methods are implemented in Python with commonly used libraries. The system has been launched into test operation and currently automates several of the most labor-intensive tests. The test set can be expanded using plugins.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S521 - S528"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search 面向知识图谱和语义搜索集成的科学期刊结构化表示的形成
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701468
O. M. Ataeva, M. G. Kobuk
{"title":"Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search","authors":"O. M. Ataeva,&nbsp;M. G. Kobuk","doi":"10.3103/S0005105525701468","DOIUrl":"10.3103/S0005105525701468","url":null,"abstract":"<p>This paper examines the development of the SciLibRu library of scientific subject areas as a continuation of the semantic description of scientific works from across the library LibMeta project. This library is based on a conceptual data model, the structure and semantics of which are formed from the principles of ontological modeling. This approach ensures the strict description of the subject area, a formalization of the relationships between entities, and the possibility of additional automated data analysis. The goal of the study is to develop and experimentally apply methods for structuring scientific journal data in LaTeX format for their integration into a library ontology and to support semantic search. An algorithm for translating data that are represented by multiple files into XML format is proposed for integration into the library ontology. A vector search module that is based on embedding calculation using language models is implemented. Patterns in the distribution of embeddings and factors influencing the accuracy of search results ranking are identified. Testing of the two components is conducted. The developed method forms the basis for automatically incorporating scientific journal data into the SciLibRu knowledge graph and creating training corpora for language models limited to scientific subject areas. The results contribute to the development of journal knowledge graph navigation systems, recommendation engines, and intelligent search tools for Russian-language scientific texts.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S498 - S504"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic and Semiautomatic Methods for Domain Knowledge-Graph Construction and Ontology Expansion 领域知识图构建和本体扩展的自动化和半自动方法
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-05 DOI: 10.3103/S0005105525701547
A. P. Khalov, O. M. Ataeva
{"title":"Automatic and Semiautomatic Methods for Domain Knowledge-Graph Construction and Ontology Expansion","authors":"A. P. Khalov,&nbsp;O. M. Ataeva","doi":"10.3103/S0005105525701547","DOIUrl":"10.3103/S0005105525701547","url":null,"abstract":"<p>We present a combined pipeline for knowledge-graph construction and ontology expansion. This approach creates a BIO-tagged corpus via fully automatic LLM-based pseudoannotation and introduces dedicated UNK reserve categories to capture previously unseen classes and relations. A specialized NER/RE model is trained on a 3-million-token dataset with 92 labels. This model exhibits a conservative quality profile—high precision with moderate recall—suited for safe graph enrichment: integrating the extracted facts expands the graph to ~0.98 million triples, while the expansion ratio (total inferred facts to explicit triples) increases from 2.65 to 3.52, with logical consistency preserved. UNK label pools are converted into stable synsets, enabling semiautomatic ontology expansion; 12 new classes derived from unstructured texts were added. We also demonstrate practical value for querying and analytics using an LLM + SPARQL setup.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 6","pages":"S571 - S590"},"PeriodicalIF":0.5,"publicationDate":"2026-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Posttraining Quantization of Large Language Models: An Efficiency Evaluation with a Focus on Russian-Language Tasks 探索大型语言模型的训练后量化:以俄语任务为中心的效率评估
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-04 DOI: 10.3103/S0005105525701389
D. R. Poimanov, M. S. Shutov
{"title":"Exploring Posttraining Quantization of Large Language Models: An Efficiency Evaluation with a Focus on Russian-Language Tasks","authors":"D. R. Poimanov,&nbsp;M. S. Shutov","doi":"10.3103/S0005105525701389","DOIUrl":"10.3103/S0005105525701389","url":null,"abstract":"<p>Quantization has become a key technique for the compression and acceleration of large language models (LLMs). Although research into low-bit quantization is actively advancing for English-language LLMs, its impact on morphologically rich and resource-diverse languages, including Russian, remains far less studied. Therefore, additional research into this problem is required, driven by the development of high-performance Russian-language and multilingual LLMs. We have conducted a systematic study of quantizing pretrained models to 2.0–4.25 bits per parameter for modern Russian-language LLMs at various scales, ranging from 4 to 32 billion parameters (4B and 32B). Our experimental setup covers both standard uniform quantization and specialized low-bit formats. Our findings highlight several key trends: (i) the tolerance of Russian-language LLMs to quantization varies across model architectures and sizes; (ii) 4-bit quantization demonstrates high robustness, particularly when advanced formats are employed; (iii) 3-bit and 2-bit quantizations prove to be the most sensitive to calibration data and scaling strategies. Empirical results show that the model’s domain must be considered when employing different quantization techniques.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 5","pages":"S437 - S446"},"PeriodicalIF":0.5,"publicationDate":"2026-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Extraction of Argumentative Relations from Scientific Communication Texts 科学传播文本中争论关系的自动提取
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-04 DOI: 10.3103/S0005105525701341
Yu. A. Zagorulko, E. A. Sidorova, I. R. Akhmadeeva
{"title":"Automatic Extraction of Argumentative Relations from Scientific Communication Texts","authors":"Yu. A. Zagorulko,&nbsp;E. A. Sidorova,&nbsp;I. R. Akhmadeeva","doi":"10.3103/S0005105525701341","DOIUrl":"10.3103/S0005105525701341","url":null,"abstract":"<p>The complexity of the problem of extracting argumentative structures is associated with such problems as selecting argumentative segments, predicting long-range connections between noncontact segments, and training on data labeled with a low degree of interannotator consistency. In this paper, we consider an approach to extracting argumentative relations from fairly large texts related to scientific communication. A comparative analysis was performed of fine-tuning methods using a pretrained Longformer-type language model that takes into account long contexts and two methods that take into account annotator discrepancies in argument labeling by using the so-called soft labels obtained by uniformly smoothing labels and averaging expert assessments. The experiments were conducted on four datasets containing positive and negative examples of statement pairs (premise, conclusion) and differing in segmentation methods and average text size. The best results were obtained using the model with averaging expert assessments. At the same time, it is noted that the model using smoothed labels also increases the accuracy of classifiers, but worsens the recall.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 5","pages":"S410 - S414"},"PeriodicalIF":0.5,"publicationDate":"2026-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Verified Explainability Core: A GD–ANFIS/SHAP Hybrid Architecture for XAI 2.0 可解释性验证核心:用于XAI 2.0的GD-ANFIS /SHAP混合架构
IF 0.5
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS Pub Date : 2026-04-04 DOI: 10.3103/S0005105525701420
Y. V. Trofimov, A. D. Lebedev, A. S. Ilin, A. N. Averkin
{"title":"Verified Explainability Core: A GD–ANFIS/SHAP Hybrid Architecture for XAI 2.0","authors":"Y. V. Trofimov,&nbsp;A. D. Lebedev,&nbsp;A. S. Ilin,&nbsp;A. N. Averkin","doi":"10.3103/S0005105525701420","DOIUrl":"10.3103/S0005105525701420","url":null,"abstract":"<p>This paper proposes a hybrid Explainable AI architecture that fuses a fully differentiable neuro-fuzzy GD–ANFIS model with the post-hoc SHAP method. The integration is designed to meet XAI 2.0 principles, which call for explanations that are transparent, verifiable, and adaptable at the same time. GD–ANFIS produces human-readable Takagi–Sugeno rules, ensuring structural interpretability, whereas SHAP delivers quantitative feature contributions that are derived from Shapley theory. To merge these layers, we introduce a comparative-audit mechanism that automatically matches the sets of key features identified by both methods, checks whether the directions of influence coincide, and assesses the consistency between SHAP numerical scores and GD–ANFIS linguistic rules. In regression tests on the Boston Housing dataset and surface-water-quality monitoring, RMSE values of 2.30 and 2.36 were obtained, respectively, all with full interpretability preserved. In every case, top-feature overlap between the two explanation layers exceeded 60%, demonstrating strong agreement between structural and numerical interpretations. The proposed architecture therefore offers a practical foundation for responsible XAI 2.0 deployment in critical domains ranging from medicine and ecology to geoinformation systems and finance.</p>","PeriodicalId":42995,"journal":{"name":"AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS","volume":"59 5","pages":"S469 - S478"},"PeriodicalIF":0.5,"publicationDate":"2026-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147614794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书