Nils C Lehnen, Johannes Kürsch, Barbara D Wichtmann, Moritz Wolter, Zeynep Bendella, Felix J Bode, Hanna Zimmermann, Alexander Radbruch, Philipp Vollmuth, Franziska Dorn
{"title":"Llama 3.1 405B Is Comparable to GPT-4 for Extraction of Data from Thrombectomy Reports-A Step Towards Secure Data Extraction.","authors":"Nils C Lehnen, Johannes Kürsch, Barbara D Wichtmann, Moritz Wolter, Zeynep Bendella, Felix J Bode, Hanna Zimmermann, Alexander Radbruch, Philipp Vollmuth, Franziska Dorn","doi":"10.1007/s00062-025-01500-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>GPT‑4 has been shown to correctly extract procedural details from free-text reports on mechanical thrombectomy. However, GPT may not be suitable for analyzing reports containing personal data. The purpose of this study was to evaluate the ability of the large language models (LLM) Llama3.1 405B, Llama3 70B, Llama3 8B, and Mixtral 8X7B, that can be operated offline, to extract procedural details from free-text reports on mechanical thrombectomies.</p><p><strong>Methods: </strong>Free-text reports on mechanical thrombectomy from two institutions were included. A detailed prompt was used in German and English languages. The ability of the LLMs to extract procedural data was compared to GPT‑4 using McNemar's test. The manual data entries made by an interventional neuroradiologist served as the reference standard.</p><p><strong>Results: </strong>100 reports from institution 1 (mean age 74.7 ± 13.2 years; 53 females) and 30 reports from institution 2 (mean age 72.7 ± 13.5 years; 18 males) were included. Llama 3.1 405B extracted 2619 of 2800 data points correctly (93.5% [95%CI: 92.6%, 94.4%], p = 0.39 vs. GPT-4). Llama3 70B with the English prompt extracted 2537 data points correctly (90.6% [95%CI: 89.5%, 91.7%], p < 0.001 vs. GPT-4), and 2471 (88.2% [95%CI: 87.0%, 89.4%], p < 0.001 vs. GPT-4) with the German prompt. Llama 3 8B extracted 2314 data points correctly (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4), and Mixtral 8X7B extracted 2411 (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4) correctly.</p><p><strong>Conclusion: </strong>Llama 3.1 405B was equal to GPT‑4 for data extraction from free-text reports on mechanical thrombectomies and may represent a data secure alternative, when operated locally.</p>","PeriodicalId":49298,"journal":{"name":"Clinical Neuroradiology","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuroradiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00062-025-01500-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: GPT‑4 has been shown to correctly extract procedural details from free-text reports on mechanical thrombectomy. However, GPT may not be suitable for analyzing reports containing personal data. The purpose of this study was to evaluate the ability of the large language models (LLM) Llama3.1 405B, Llama3 70B, Llama3 8B, and Mixtral 8X7B, that can be operated offline, to extract procedural details from free-text reports on mechanical thrombectomies.
Methods: Free-text reports on mechanical thrombectomy from two institutions were included. A detailed prompt was used in German and English languages. The ability of the LLMs to extract procedural data was compared to GPT‑4 using McNemar's test. The manual data entries made by an interventional neuroradiologist served as the reference standard.
Results: 100 reports from institution 1 (mean age 74.7 ± 13.2 years; 53 females) and 30 reports from institution 2 (mean age 72.7 ± 13.5 years; 18 males) were included. Llama 3.1 405B extracted 2619 of 2800 data points correctly (93.5% [95%CI: 92.6%, 94.4%], p = 0.39 vs. GPT-4). Llama3 70B with the English prompt extracted 2537 data points correctly (90.6% [95%CI: 89.5%, 91.7%], p < 0.001 vs. GPT-4), and 2471 (88.2% [95%CI: 87.0%, 89.4%], p < 0.001 vs. GPT-4) with the German prompt. Llama 3 8B extracted 2314 data points correctly (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4), and Mixtral 8X7B extracted 2411 (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4) correctly.
Conclusion: Llama 3.1 405B was equal to GPT‑4 for data extraction from free-text reports on mechanical thrombectomies and may represent a data secure alternative, when operated locally.
期刊介绍:
Clinical Neuroradiology provides current information, original contributions, and reviews in the field of neuroradiology. An interdisciplinary approach is accomplished by diagnostic and therapeutic contributions related to associated subjects.
The international coverage and relevance of the journal is underlined by its being the official journal of the German, Swiss, and Austrian Societies of Neuroradiology.