Towards building Urdu language document retrieval framework

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-03-18 DOI:10.1016/j.csl.2025.101797

Samreen Kazi, Shakeel Khoja

{"title":"Towards building Urdu language document retrieval framework","authors":"Samreen Kazi, Shakeel Khoja","doi":"10.1016/j.csl.2025.101797","DOIUrl":null,"url":null,"abstract":"<div><div>Research on document retrieval has mainly focused on high-resource languages like English, with limited attention to low-resource languages such as Urdu<strong>.</strong> Urdu, the 10th-most spoken language globally, with 230 million speakers, faces significant IR challenges due to its complex script, word variations, and lack of standardization. This study introduces U-RR² (Urdu Document Retrieval & Ranking Framework), a two-stage retrieval framework integrating traditional IR models with advanced feature representation techniques to improve Urdu document retrieval. In the first stage, documents and queries are represented using a combination of TF-IDF and embedding methods such as Word2Vec, FastText, and multilingual BERT (mBERT). After that, multiple retrieval models—Vector Space Model, BM25, DFR, and Jelinek-Mercer smoothing—are evaluated, and the best-performing model is selected for initial retrieval. The second stage refines rankings using SVMrank with a feature ensemble technique for improved relevance. Evaluation on CURE, ROSHNI, and UIR_21 benchmarks demonstrates our framework's effectiveness. Our weighted Word2Vec approach achieves MAP scores of 0.78, 0.81, and 0.80, and P@10 scores of 0.79, 0.82, and 0.81<strong>.</strong> The feature ensemble further improves performance, reaching F<sub>1</sub> scores of 0.89, 0.90, and 0.81<strong>,</strong> significantly outperforming baselines and establishing a new state-of-the-art for Urdu document retrieval.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101797"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000221","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Research on document retrieval has mainly focused on high-resource languages like English, with limited attention to low-resource languages such as Urdu. Urdu, the 10th-most spoken language globally, with 230 million speakers, faces significant IR challenges due to its complex script, word variations, and lack of standardization. This study introduces U-RR² (Urdu Document Retrieval & Ranking Framework), a two-stage retrieval framework integrating traditional IR models with advanced feature representation techniques to improve Urdu document retrieval. In the first stage, documents and queries are represented using a combination of TF-IDF and embedding methods such as Word2Vec, FastText, and multilingual BERT (mBERT). After that, multiple retrieval models—Vector Space Model, BM25, DFR, and Jelinek-Mercer smoothing—are evaluated, and the best-performing model is selected for initial retrieval. The second stage refines rankings using SVMrank with a feature ensemble technique for improved relevance. Evaluation on CURE, ROSHNI, and UIR_21 benchmarks demonstrates our framework's effectiveness. Our weighted Word2Vec approach achieves MAP scores of 0.78, 0.81, and 0.80, and P@10 scores of 0.79, 0.82, and 0.81. The feature ensemble further improves performance, reaching F₁ scores of 0.89, 0.90, and 0.81, significantly outperforming baselines and establishing a new state-of-the-art for Urdu document retrieval.

查看原文本刊更多论文

乌尔都语文献检索框架的构建

文献检索的研究主要集中在英语等资源丰富的语言上，而对乌尔都语等资源贫乏的语言关注较少。乌尔都语是全球第十大使用语言，拥有2.3亿使用者，由于其复杂的文字，单词变化和缺乏标准化，面临着重大的IR挑战。本研究介绍了U-RR²(Urdu Document Retrieval &；排序框架)是一个两阶段检索框架，它将传统的IR模型与先进的特征表示技术相结合，以改进乌尔都语文档检索。在第一阶段，使用TF-IDF和嵌入方法（如Word2Vec、FastText和多语言BERT）的组合来表示文档和查询。然后，对多个检索模型——向量空间模型、BM25模型、DFR模型和Jelinek-Mercer平滑模型——进行评估，选择性能最好的模型进行初始检索。第二阶段使用SVMrank和特征集成技术来改进排名，以提高相关性。对CURE、ROSHNI和UIR_21基准的评估证明了我们的框架的有效性。我们的加权Word2Vec方法的MAP得分分别为0.78、0.81和0.80，P@10得分分别为0.79、0.82和0.81。特征集合进一步提高了性能，达到了0.89、0.90和0.81的F1分数，显著优于基线，并为乌尔都语文档检索建立了新的技术水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.