Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush
{"title":"A Controlled Study on Long Context Extension and Generalization in LLMs","authors":"Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush","doi":"arxiv-2409.12181","DOIUrl":"https://doi.org/arxiv-2409.12181","url":null,"abstract":"Broad textual understanding and in-context learning require language models\u0000that utilize full document contexts. Due to the implementation challenges\u0000associated with directly training long-context models, many methods have been\u0000proposed for extending models to handle long contexts. However, owing to\u0000differences in data and model classes, it has been challenging to compare these\u0000approaches, leading to uncertainty as to how to evaluate long-context\u0000performance and whether it differs from standard evaluation. We implement a\u0000controlled protocol for extension methods with a standardized evaluation,\u0000utilizing consistent base models and extension data. Our study yields several\u0000insights into long-context behavior. First, we reaffirm the critical role of\u0000perplexity as a general-purpose performance indicator even in longer-context\u0000tasks. Second, we find that current approximate attention methods\u0000systematically underperform across long-context tasks. Finally, we confirm that\u0000exact fine-tuning based methods are generally effective within the range of\u0000their extension, whereas extrapolation remains challenging. All codebases,\u0000models, and checkpoints will be made available open-source, promoting\u0000transparency and facilitating further research in this critical area of AI\u0000development.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng
{"title":"You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL","authors":"Hideo Kobayashi, Wuwei Lan, Peng Shi, Shuaichen Chang, Jiang Guo, Henghui Zhu, Zhiguo Wang, Patrick Ng","doi":"arxiv-2409.12172","DOIUrl":"https://doi.org/arxiv-2409.12172","url":null,"abstract":"While significant progress has been made on the text-to-SQL task, recent\u0000solutions repeatedly encode the same database schema for every question,\u0000resulting in unnecessary high inference cost and often overlooking crucial\u0000database knowledge. To address these issues, we propose You Only Read Once\u0000(YORO), a novel paradigm that directly internalizes database knowledge into the\u0000parametric knowledge of a text-to-SQL model during training and eliminates the\u0000need for schema encoding during inference. YORO significantly reduces the input\u0000token length by 66%-98%. Despite its shorter inputs, our empirical results\u0000demonstrate YORO's competitive performances with traditional systems on three\u0000benchmarks as well as its significant outperformance on large databases.\u0000Furthermore, YORO excels in handling questions with challenging value\u0000retrievals such as abbreviation.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yumeng Yang, Peter Krusche, Kristyn Pantoja, Cheng Shi, Ethan Ludmir, Kirk Roberts, Gen Zhu
{"title":"Using Large Language Models to Generate Clinical Trial Tables and Figures","authors":"Yumeng Yang, Peter Krusche, Kristyn Pantoja, Cheng Shi, Ethan Ludmir, Kirk Roberts, Gen Zhu","doi":"arxiv-2409.12046","DOIUrl":"https://doi.org/arxiv-2409.12046","url":null,"abstract":"Tables, figures, and listings (TFLs) are essential tools for summarizing\u0000clinical trial data. Creation of TFLs for reporting activities is often a\u0000time-consuming task encountered routinely during the execution of clinical\u0000trials. This study explored the use of large language models (LLMs) to automate\u0000the generation of TFLs through prompt engineering and few-shot transfer\u0000learning. Using public clinical trial data in ADaM format, our results\u0000demonstrated that LLMs can efficiently generate TFLs with prompt instructions,\u0000showcasing their potential in this domain. Furthermore, we developed a\u0000conservational agent named Clinical Trial TFL Generation Agent: An app that\u0000matches user queries to predefined prompts that produce customized programs to\u0000generate specific predefined TFLs.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From Lists to Emojis: How Format Bias Affects Model Alignment","authors":"Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang","doi":"arxiv-2409.11704","DOIUrl":"https://doi.org/arxiv-2409.11704","url":null,"abstract":"In this paper, we study format biases in reinforcement learning from human\u0000feedback (RLHF). We observe that many widely-used preference models, including\u0000human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark,\u0000exhibit strong biases towards specific format patterns, such as lists, links,\u0000bold text, and emojis. Furthermore, large language models (LLMs) can exploit\u0000these biases to achieve higher rankings on popular benchmarks like AlpacaEval\u0000and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where\u0000current preference models favor longer responses that appear more\u0000comprehensive, even when their quality is equal to or lower than shorter,\u0000competing responses. However, format biases beyond verbosity remain largely\u0000underexplored in the literature. In this work, we extend the study of biases in\u0000preference learning beyond the commonly recognized length bias, offering a\u0000comprehensive analysis of a wider range of format biases. Additionally, we show\u0000that with a small amount of biased data (less than 1%), we can inject\u0000significant bias into the reward model. Moreover, these format biases can also\u0000be easily exploited by downstream alignment algorithms, such as best-of-n\u0000sampling and online iterative DPO, as it is usually easier to manipulate the\u0000format than to improve the quality of responses. Our findings emphasize the\u0000need to disentangle format and content both for designing alignment algorithms\u0000and evaluating models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiale Wang, Junhui Yu, Huanyong Liu, Chenanran Kong
{"title":"Enhancing Complex Formula Recognition with Hierarchical Detail-Focused Network","authors":"Jiale Wang, Junhui Yu, Huanyong Liu, Chenanran Kong","doi":"arxiv-2409.11677","DOIUrl":"https://doi.org/arxiv-2409.11677","url":null,"abstract":"Hierarchical and complex Mathematical Expression Recognition (MER) is\u0000challenging due to multiple possible interpretations of a formula, complicating\u0000both parsing and evaluation. In this paper, we introduce the Hierarchical\u0000Detail-Focused Recognition dataset (HDR), the first dataset specifically\u0000designed to address these issues. It consists of a large-scale training set,\u0000HDR-100M, offering an unprecedented scale and diversity with one hundred\u0000million training instances. And the test set, HDR-Test, includes multiple\u0000interpretations of complex hierarchical formulas for comprehensive model\u0000performance evaluation. Additionally, the parsing of complex formulas often\u0000suffers from errors in fine-grained details. To address this, we propose the\u0000Hierarchical Detail-Focused Recognition Network (HDNet), an innovative\u0000framework that incorporates a hierarchical sub-formula module, focusing on the\u0000precise handling of formula details, thereby significantly enhancing MER\u0000performance. Experimental results demonstrate that HDNet outperforms existing\u0000MER models across various datasets.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà
{"title":"Linguini: A benchmark for language-agnostic linguistic reasoning","authors":"Eduardo Sánchez, Belen Alastruey, Christophe Ropers, Pontus Stenetorp, Mikel Artetxe, Marta R. Costa-jussà","doi":"arxiv-2409.12126","DOIUrl":"https://doi.org/arxiv-2409.12126","url":null,"abstract":"We propose a new benchmark to measure a language model's linguistic reasoning\u0000skills without relying on pre-existing language-specific knowledge. The test\u0000covers 894 questions grouped in 160 problems across 75 (mostly) extremely\u0000low-resource languages, extracted from the International Linguistic Olympiad\u0000corpus. To attain high accuracy on this benchmark, models don't need previous\u0000knowledge of the tested language, as all the information needed to solve the\u0000linguistic puzzle is presented in the context. We find that, while all analyzed\u0000models rank below 25% accuracy, there is a significant gap between open and\u0000closed models, with the best-performing proprietary model at 24.05% and the\u0000best-performing open model at 8.84%.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang
{"title":"DocMamba: Efficient Document Pre-training with State Space Model","authors":"Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang","doi":"arxiv-2409.11887","DOIUrl":"https://doi.org/arxiv-2409.11887","url":null,"abstract":"In recent years, visually-rich document understanding has attracted\u0000increasing attention. Transformer-based pre-trained models have become the\u0000mainstream approach, yielding significant performance gains in this field.\u0000However, the self-attention mechanism's quadratic computational complexity\u0000hinders their efficiency and ability to process long documents. In this paper,\u0000we present DocMamba, a novel framework based on the state space model. It is\u0000designed to reduce computational complexity to linear while preserving global\u0000modeling capabilities. To further enhance its effectiveness in document\u0000processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture\u0000contiguous semantic information. Experimental results demonstrate that DocMamba\u0000achieves new state-of-the-art results on downstream datasets such as FUNSD,\u0000CORD, and SORIE, while significantly improving speed and reducing memory usage.\u0000Notably, experiments on the HRDoc confirm DocMamba's potential for length\u0000extrapolation. The code will be available online.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources","authors":"Issey Sukeda","doi":"arxiv-2409.11783","DOIUrl":"https://doi.org/arxiv-2409.11783","url":null,"abstract":"The recent success of large language models (LLMs) and the scaling law has\u0000led to a widespread adoption of larger models. Particularly in the healthcare\u0000industry, there is an increasing demand for locally operated LLMs due to\u0000security concerns. However, the majority of high quality open-source LLMs have\u0000a size of 70B parameters, imposing significant financial burdens on users for\u0000GPU preparation and operation. To overcome these issues, we present a medical\u0000adaptation based on the recent 7B models, which enables the operation in low\u0000computational resources. We compare the performance on medical\u0000question-answering benchmarks in two languages (Japanese and English),\u0000demonstrating that its scores reach parity with or surpass those of currently\u0000existing medical LLMs that are ten times larger. We find that fine-tuning an\u0000English-centric base model on Japanese medical dataset improves the score in\u0000both language, supporting the effect of cross-lingual knowledge transfer. We\u0000hope that this study will alleviate financial challenges, serving as a stepping\u0000stone for clinical institutions to practically utilize LLMs locally. Our\u0000evaluation code is available at\u0000https://huggingface.co/stardust-coder/jmedllm-7b-v1.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, Gene Louis Kim
{"title":"\"A Woman is More Culturally Knowledgeable than A Man?\": The Effect of Personas on Cultural Norm Interpretation in LLMs","authors":"Mahammed Kamruzzaman, Hieu Nguyen, Nazmul Hassan, Gene Louis Kim","doi":"arxiv-2409.11636","DOIUrl":"https://doi.org/arxiv-2409.11636","url":null,"abstract":"As the deployment of large language models (LLMs) expands, there is an\u0000increasing demand for personalized LLMs. One method to personalize and guide\u0000the outputs of these models is by assigning a persona -- a role that describes\u0000the expected behavior of the LLM (e.g., a man, a woman, an engineer). This\u0000study investigates whether an LLM's understanding of social norms varies across\u0000assigned personas. Ideally, the perception of a social norm should remain\u0000consistent regardless of the persona, since acceptability of a social norm\u0000should be determined by the region the norm originates from, rather than by\u0000individual characteristics such as gender, body size, or race. A norm is\u0000universal within its cultural context. In our research, we tested 36 distinct\u0000personas from 12 sociodemographic categories (e.g., age, gender, beauty) across\u0000four different LLMs. We find that LLMs' cultural norm interpretation varies\u0000based on the persona used and the norm interpretation also varies within a\u0000sociodemographic category (e.g., a fat person and a thin person as in physical\u0000appearance group) where an LLM with the more socially desirable persona (e.g.,\u0000a thin person) interprets social norms more accurately than with the less\u0000socially desirable persona (e.g., a fat person). We also discuss how different\u0000types of social biases may contribute to the results that we observe.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BERT-VBD: Vietnamese Multi-Document Summarization Framework","authors":"Tuan-Cuong Vuong, Trang Mai Xuan, Thien Van Luong","doi":"arxiv-2409.12134","DOIUrl":"https://doi.org/arxiv-2409.12134","url":null,"abstract":"In tackling the challenge of Multi-Document Summarization (MDS), numerous\u0000methods have been proposed, spanning both extractive and abstractive\u0000summarization techniques. However, each approach has its own limitations,\u0000making it less effective to rely solely on either one. An emerging and\u0000promising strategy involves a synergistic fusion of extractive and abstractive\u0000summarization methods. Despite the plethora of studies in this domain, research\u0000on the combined methodology remains scarce, particularly in the context of\u0000Vietnamese language processing. This paper presents a novel Vietnamese MDS\u0000framework leveraging a two-component pipeline architecture that integrates\u0000extractive and abstractive techniques. The first component employs an\u0000extractive approach to identify key sentences within each document. This is\u0000achieved by a modification of the pre-trained BERT network, which derives\u0000semantically meaningful phrase embeddings using siamese and triplet network\u0000structures. The second component utilizes the VBD-LLaMA2-7B-50b model for\u0000abstractive summarization, ultimately generating the final summary document.\u0000Our proposed framework demonstrates a positive performance, attaining ROUGE-2\u0000scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art\u0000baselines.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}