{"id":675759,"date":"2025-06-18T08:34:11","date_gmt":"2025-06-18T05:34:11","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-llm-architecture-and-training-data-shape-ais-position-bias\/"},"modified":"2025-06-18T08:34:11","modified_gmt":"2025-06-18T05:34:11","slug":"how-llm-architecture-and-training-data-shape-ais-position-bias","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-llm-architecture-and-training-data-shape-ais-position-bias\/","title":{"rendered":"How LLM architecture and training data shape AI&#8217;s position bias"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_84 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a22becb5f979\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a22becb5f979\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-llm-architecture-and-training-data-shape-ais-position-bias\/#Analyzing_attention\" >Analyzing attention<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-llm-architecture-and-training-data-shape-ais-position-bias\/#Lost_in_the_middle\" >Lost in the middle<\/a><\/li><\/ul><\/nav><\/div>\n<div>\n<div class=\"article-gallery lightGallery\">\n<div data-thumb=\"https:\/\/scx1.b-cdn.net\/csz\/news\/tmb\/2025\/unpacking-the-bias-of.jpg\" data-src=\"https:\/\/scx2.b-cdn.net\/gfx\/news\/2025\/unpacking-the-bias-of.jpg\" data-sub-html=\"Three types of attention masks and their corresponding directed graphs G used in the analysis (self-loops are omitted for clarity). A directed edge from token j to i indicates that i attends to j. The center node(s) (Definition 3.1), highlighted in yellow, represent tokens that can be directly or indirectly attended to by all other tokens in the sequence. As depicted in the top row, the graph-theoretic formulation captures both direct and indirect contributions of tokens to the overall context, providing a comprehensive view of the token interactions under multi-layer attention. Credit: &lt;i&gt;arXiv&lt;\/i&gt; (2025). DOI: 10.48550\/arxiv.2502.01951\">\n<figure class=\"article-img\">\n            <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/scx1.b-cdn.net\/csz\/news\/800a\/2025\/unpacking-the-bias-of.jpg\" alt=\"Unpacking the bias of large language models\" title=\"Three types of attention masks and their corresponding directed graphs G used in the analysis (self-loops are omitted for clarity). A directed edge from token j to i indicates that i attends to j. The center node(s) (Definition 3.1), highlighted in yellow, represent tokens that can be directly or indirectly attended to by all other tokens in the sequence. As depicted in the top row, the graph-theoretic formulation captures both direct and indirect contributions of tokens to the overall context, providing a comprehensive view of the token interactions under multi-layer attention. Credit: arXiv (2025). DOI: 10.48550\/arxiv.2502.01951\" width=\"800\" height=\"530\"\/><figcaption class=\"text-darken text-low-up text-truncate-js text-truncate mt-3\">\n                Three types of attention masks and their corresponding directed graphs G used in the analysis (self-loops are omitted for clarity). A directed edge from token j to i indicates that i attends to j. The center node(s) (Definition 3.1), highlighted in yellow, represent tokens that can be directly or indirectly attended to by all other tokens in the sequence. As depicted in the top row, the graph-theoretic formulation captures both direct and indirect contributions of tokens to the overall context, providing a comprehensive view of the token interactions under multi-layer attention. Credit: <i>arXiv<\/i> (2025). DOI: 10.48550\/arxiv.2502.01951<br \/>\n            <\/figcaption><\/figure>\n<\/p><\/div>\n<\/div>\n<p>Research has shown that large language models (LLMs) tend to overemphasize information at the beginning and end of a document or conversation, while neglecting the middle.<\/p>\n<p>This &#8220;position bias&#8221; means that if a lawyer is using an LLM-powered virtual assistant to retrieve a certain phrase in a 30-page affidavit, the LLM is more likely to find the right text if it is on the initial or final pages.<\/p>\n<p>MIT researchers have discovered the mechanism behind this phenomenon.<\/p>\n<p>They created a theoretical framework to study how information flows through the machine-learning architecture that forms the backbone of LLMs. They found that certain design choices which control how the model processes input data can cause position bias.<\/p>\n<p>Their experiments revealed that model architectures, particularly those affecting how information is spread across input words within the model, can give rise to or intensify position bias, and that training data also contribute to the problem.<\/p>\n<p>The work is <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2502.01951\" target=\"_blank\">published<\/a> on the <i>arXiv<\/i> preprint server.<\/p>\n<p>In addition to pinpointing the origins of position bias, their framework can be used to diagnose and correct it in future model designs.<\/p>\n<p>This could lead to more reliable chatbots that stay on topic during long conversations, medical AI systems that reason more fairly when handling a trove of patient data, and code assistants that pay closer attention to all parts of a program.<\/p>\n<p>&#8220;These models are black boxes, so as an LLM user, you probably don&#8217;t know that position bias can cause your model to be inconsistent. You just feed it your documents in whatever order you want and expect it to work. But by understanding the underlying mechanism of these black-box models better, we can improve them by addressing these limitations,&#8221; says Xinyi Wu, a graduate student in the MIT Institute for Data, Systems, and Society (IDSS) and the Laboratory for Information and Decision Systems (LIDS), and first author of the paper.<\/p>\n<p>Her co-authors include Yifei Wang, an MIT postdoc; and senior authors Stefanie Jegelka, an associate professor of electrical engineering and computer <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">science<\/a> (EECS) and a member of IDSS and the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Ali Jadbabaie, professor and head of the Department of Civil and Environmental Engineering, a core faculty member of IDSS, and a principal investigator in LIDS. The research will be presented at the International Conference on Machine Learning.<\/p>\n<p>                                                                                                        <!-- TechX - News - In-article --><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Analyzing_attention\"><\/span>Analyzing attention<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>LLMs like Claude, Llama, and GPT-4 are powered by a type of neural network architecture known as a transformer. Transformers are designed to process sequential data, encoding a sentence into chunks called tokens and then learning the relationships between tokens to predict which words come next.<\/p>\n<p>These models have gotten very good at this because of the attention mechanism, which uses interconnected layers of data processing nodes to make sense of context by allowing tokens to selectively focus on or attend to related tokens.<\/p>\n<p>But if every token can attend to every other token in a 30-page document, that quickly becomes computationally intractable. So, when engineers build transformer models, they often employ attention-masking techniques that limit the words to which a token can attend. For instance, a causal mask only allows words to attend to those that came before it.<\/p>\n<p>Engineers also use positional encodings to help the model understand the location of each word in a sentence, improving performance.<\/p>\n<p>The MIT researchers built a graph-based theoretical framework to explore how these modeling choices, attention masks and positional encodings, could affect position bias.<\/p>\n<p>&#8220;Everything is coupled and tangled within the attention mechanism, so it is very hard to study. Graphs are a flexible language to describe the dependent relationship among words within the attention mechanism and trace them across multiple layers,&#8221; Wu says.<\/p>\n<p>Their theoretical analysis suggested that causal masking gives the model an inherent bias toward the beginning of an input, even when that bias doesn&#8217;t exist in the data.<\/p>\n<p>If the earlier words are relatively unimportant for a sentence&#8217;s meaning, causal masking can cause the transformer to pay more attention to its beginning anyway.<\/p>\n<p>&#8220;While it is often true that earlier words and later words in a sentence are more important, if an LLM is used on a task that is not natural language generation, like ranking or information retrieval, these biases can be extremely harmful,&#8221; Wu says.<\/p>\n<p>As a model grows, with additional layers of attention mechanism, this bias is amplified because earlier parts of the input are used more frequently in the model&#8217;s reasoning process.<\/p>\n<p>They also found that using positional encodings to link words more strongly to nearby words can mitigate position bias. The technique refocuses the model&#8217;s attention in the right place, but its effect can be diluted in models with more attention layers. These design choices are only one cause of position bias\u2014some can come from training data the model uses to learn how to prioritize words in a sequence.<\/p>\n<p>&#8220;If you know your data is biased in a certain way, then you should also finetune your model on top of adjusting your modeling choices,&#8221; Wu says.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Lost_in_the_middle\"><\/span>Lost in the middle<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After they&#8217;d established a theoretical framework, the researchers performed experiments in which they systematically varied the position of the correct answer in text sequences for an information retrieval task.<\/p>\n<p>The experiments showed a &#8220;lost-in-the-middle&#8221; phenomenon, where retrieval accuracy followed a U-shaped pattern. Models performed best if the right answer was located at the beginning of the sequence. Performance declined the closer it got to the middle before rebounding a bit if the correct answer was near the end.<\/p>\n<p>Ultimately, their work suggests that using a different masking technique, removing extra layers from the attention mechanism, or strategically employing positional encodings could reduce position bias and improve a model&#8217;s accuracy.<\/p>\n<p>&#8220;By doing a combination of theory and experiments, we were able to look at the consequences of model design choices that weren&#8217;t clear at the time. If you want to use a model in high-stakes <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lications, you must know when it will work, when it won&#8217;t, and why,&#8221; Jadbabaie says.<\/p>\n<p>In the future, the researchers want to further explore the effects of positional encodings and study how position bias could be strategically exploited in certain applications.<\/p>\n<p>&#8220;These researchers offer a rare theoretical lens into the attention mechanism at the heart of the transformer model. They provide a compelling analysis that clarifies longstanding quirks in transformer behavior, showing that attention mechanisms, especially with causal masks, inherently bias models toward the beginning of sequences. The paper achieves the best of both worlds\u2014mathematical clarity paired with insights that reach into the guts of real-world systems,&#8221; says Amin Saberi, professor and director of the Stanford University Center for Computational Market Design, who was not involved with this work.<\/p>\n<div class=\"article-main__more p-4\">\n<p><strong>More information:<\/strong><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tXinyi Wu et al, On the Emergence of Position Bias in Transformers, <i>arXiv<\/i> (2025). <a rel=\"nofollow\" target=\"_blank\" data-doi=\"1\" href=\"https:\/\/dx.doi.org\/10.48550\/arxiv.2502.01951\" target=\"_blank\">DOI: 10.48550\/arxiv.2502.01951<\/a><\/p>\n<div class=\"mt-3\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<strong>Journal information:<\/strong><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<cite>arXiv<\/cite><br \/>\n                                                        <a rel=\"nofollow\" target=\"_blank\" class=\"icon_open\" href=\"http:\/\/arxiv.org\/\" target=\"_blank\" rel=\"nofollow\"><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<svg>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<use href=\"https:\/\/techx.b-cdn.net\/tmpl\/v2\/img\/svg\/sprite.svg#icon_open\" x=\"0\" y=\"0\"\/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/svg><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n<\/p><\/div>\n<div class=\"d-inline-block text-medium my-4\">\n                                                Provided by<br \/>\n                                                                                                    Massachusetts Institute of <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">Technology<\/a><br \/>\n                                                    \t\t\t\t\t\t\t\t\t\t\t\t\t<a rel=\"nofollow\" target=\"_blank\" class=\"icon_open\" href=\"http:\/\/web.mit.edu\/\" target=\"_blank\" rel=\"nofollow\"><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<svg>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<use href=\"https:\/\/techx.b-cdn.net\/tmpl\/v2\/img\/svg\/sprite.svg#icon_open\" x=\"0\" y=\"0\"\/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/svg><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<\/a><\/p><\/div>\n<p class=\"article-main__note mt-4\">\n                                                <i>This story is republished courtesy of MIT <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">News<\/a> (<a rel=\"nofollow\" target=\"_blank\" href=\"http:\/\/web.mit.edu\/newsoffice\/\" target=\"_blank\">web.mit.edu\/newsoffice\/<\/a>), a popular site that covers news about MIT research, innovation and teaching.<\/i>\n                                            <\/p>\n<p>                                        <!-- print only --><\/p>\n<div class=\"d-none d-print-block\">\n<p>\n                                                <strong>Citation<\/strong>:<br \/>\n                                                Lost in the middle: How LLM architecture and training data shape AI&#8217;s position bias (2025, June 17)<br \/>\n                                                retrieved 18 June 2025<br \/>\n                                                from https:\/\/techxplore.com\/news\/2025-06-lost-middle-llm-architecture-ai.html\n                                            <\/p>\n<p>\n                                            This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no<br \/>\n                                            part may be reproduced without the written permission. The content is provided for information purposes only.\n                                            <\/p>\n<\/p><\/div>\n<\/p><\/div>\n<p><script id=\"facebook-jssdk\" async=\"\" src=\"https:\/\/connect.facebook.net\/en_US\/sdk.js\"><\/script><\/p>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more Like this articles, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/category\/sciencee\/\" target=\"_blank\" >Science category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techxplore.com\/news\/2025-06-lost-middle-llm-architecture-ai.html\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Three types of attention masks and their corresponding directed graphs G used in the analysis (self-loops are omitted for clarity). A directed edge from token j to i indicates that i attends to j. The center node(s) (Definition 3.1), highlighted in yellow, represent tokens that can be directly or indirectly attended to by all other&#8230;<\/p>\n","protected":false},"author":1,"featured_media":675760,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/scx2.b-cdn.net\/gfx\/news\/2025\/unpacking-the-bias-of.jpg","fifu_image_alt":"","footnotes":""},"categories":[16],"tags":[],"class_list":["post-675759","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sciencee"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/675759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=675759"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/675759\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/675760"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=675759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=675759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=675759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}