{"title":"Towards building Urdu language document retrieval framework","authors":"Samreen Kazi, Shakeel Khoja","doi":"10.1016/j.csl.2025.101797","DOIUrl":null,"url":null,"abstract":"<div><div>Research on document retrieval has mainly focused on high-resource languages like English, with limited attention to low-resource languages such as Urdu<strong>.</strong> Urdu, the 10th-most spoken language globally, with 230 million speakers, faces significant IR challenges due to its complex script, word variations, and lack of standardization. This study introduces U-RR² (Urdu Document Retrieval & Ranking Framework), a two-stage retrieval framework integrating traditional IR models with advanced feature representation techniques to improve Urdu document retrieval. In the first stage, documents and queries are represented using a combination of TF-IDF and embedding methods such as Word2Vec, FastText, and multilingual BERT (mBERT). After that, multiple retrieval models—Vector Space Model, BM25, DFR, and Jelinek-Mercer smoothing—are evaluated, and the best-performing model is selected for initial retrieval. The second stage refines rankings using SVMrank with a feature ensemble technique for improved relevance. Evaluation on CURE, ROSHNI, and UIR_21 benchmarks demonstrates our framework's effectiveness. Our weighted Word2Vec approach achieves MAP scores of 0.78, 0.81, and 0.80, and P@10 scores of 0.79, 0.82, and 0.81<strong>.</strong> The feature ensemble further improves performance, reaching F<sub>1</sub> scores of 0.89, 0.90, and 0.81<strong>,</strong> significantly outperforming baselines and establishing a new state-of-the-art for Urdu document retrieval.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101797"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000221","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Research on document retrieval has mainly focused on high-resource languages like English, with limited attention to low-resource languages such as Urdu. Urdu, the 10th-most spoken language globally, with 230 million speakers, faces significant IR challenges due to its complex script, word variations, and lack of standardization. This study introduces U-RR² (Urdu Document Retrieval & Ranking Framework), a two-stage retrieval framework integrating traditional IR models with advanced feature representation techniques to improve Urdu document retrieval. In the first stage, documents and queries are represented using a combination of TF-IDF and embedding methods such as Word2Vec, FastText, and multilingual BERT (mBERT). After that, multiple retrieval models—Vector Space Model, BM25, DFR, and Jelinek-Mercer smoothing—are evaluated, and the best-performing model is selected for initial retrieval. The second stage refines rankings using SVMrank with a feature ensemble technique for improved relevance. Evaluation on CURE, ROSHNI, and UIR_21 benchmarks demonstrates our framework's effectiveness. Our weighted Word2Vec approach achieves MAP scores of 0.78, 0.81, and 0.80, and P@10 scores of 0.79, 0.82, and 0.81. The feature ensemble further improves performance, reaching F1 scores of 0.89, 0.90, and 0.81, significantly outperforming baselines and establishing a new state-of-the-art for Urdu document retrieval.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.