{"title":"SOLAR: Illuminating LLM performance in API discovery and service ranking for edge AI and IoT","authors":"Eyhab Al-Masri, Ishwarya Narayana Subramanian","doi":"10.1016/j.iot.2025.101630","DOIUrl":null,"url":null,"abstract":"<div><div>The growing complexity of web service and API discovery calls for robust methods to evaluate how well Large Language Models (LLMs) retrieve, rank, and assess APIs. However, current LLMs often produce inconsistent results, highlighting the need for structured, multi-dimensional evaluation. This paper introduces SOLAR (Systematic Observability of LLM API Retrieval), a framework that assesses LLM performance across three key dimensions: functional capability, implementation feasibility, and service sustainability. We evaluate four leading LLMs—GPT-4 Turbo (OpenAI), Claude 3.5 Sonnet (Anthropic), LLaMA 3.2 (Meta), and Gemini 2.0 Flash (Google)—on their ability to identify, prioritize, and evaluate APIs across varying query complexities. Results show GPT-4 Turbo and Claude 3.5 Sonnet achieve high functional alignment (FCA ≥ 0.75 for simple queries) and strong ranking consistency (Spearman’s ρ ≈ 0.95). However, all models struggle with implementation feasibility and long-term sustainability, with feasibility scores declining as complexity increases and sustainability scores remaining low (SSI ≈ 0.40), limiting deployment potential. Despite retrieving overlapping APIs, models often rank them inconsistently, raising concerns for AI-driven service selection. SOLAR identifies strong correlations between functional accuracy and ranking stability but weaker links to real-world feasibility and longevity. These findings are particularly relevant for Edge AI environments, where real-time processing, distributed intelligence, and reliable API integration are critical. SOLAR offers a comprehensive lens for evaluating LLM effectiveness in service discovery, providing actionable insights to advance robust, intelligent API integration across IoT and AI-driven systems. Our work aims to inform both future model development and deployment practices in high-stakes computing environments.</div></div>","PeriodicalId":29968,"journal":{"name":"Internet of Things","volume":"32 ","pages":"Article 101630"},"PeriodicalIF":6.0000,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet of Things","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542660525001441","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The growing complexity of web service and API discovery calls for robust methods to evaluate how well Large Language Models (LLMs) retrieve, rank, and assess APIs. However, current LLMs often produce inconsistent results, highlighting the need for structured, multi-dimensional evaluation. This paper introduces SOLAR (Systematic Observability of LLM API Retrieval), a framework that assesses LLM performance across three key dimensions: functional capability, implementation feasibility, and service sustainability. We evaluate four leading LLMs—GPT-4 Turbo (OpenAI), Claude 3.5 Sonnet (Anthropic), LLaMA 3.2 (Meta), and Gemini 2.0 Flash (Google)—on their ability to identify, prioritize, and evaluate APIs across varying query complexities. Results show GPT-4 Turbo and Claude 3.5 Sonnet achieve high functional alignment (FCA ≥ 0.75 for simple queries) and strong ranking consistency (Spearman’s ρ ≈ 0.95). However, all models struggle with implementation feasibility and long-term sustainability, with feasibility scores declining as complexity increases and sustainability scores remaining low (SSI ≈ 0.40), limiting deployment potential. Despite retrieving overlapping APIs, models often rank them inconsistently, raising concerns for AI-driven service selection. SOLAR identifies strong correlations between functional accuracy and ranking stability but weaker links to real-world feasibility and longevity. These findings are particularly relevant for Edge AI environments, where real-time processing, distributed intelligence, and reliable API integration are critical. SOLAR offers a comprehensive lens for evaluating LLM effectiveness in service discovery, providing actionable insights to advance robust, intelligent API integration across IoT and AI-driven systems. Our work aims to inform both future model development and deployment practices in high-stakes computing environments.
期刊介绍:
Internet of Things; Engineering Cyber Physical Human Systems is a comprehensive journal encouraging cross collaboration between researchers, engineers and practitioners in the field of IoT & Cyber Physical Human Systems. The journal offers a unique platform to exchange scientific information on the entire breadth of technology, science, and societal applications of the IoT.
The journal will place a high priority on timely publication, and provide a home for high quality.
Furthermore, IOT is interested in publishing topical Special Issues on any aspect of IOT.