{"id":685900,"date":"2025-08-19T10:40:38","date_gmt":"2025-08-19T07:40:38","guid":{"rendered":"https:\/\/buradabiliyorum.com\/en\/researchers-glimpse-the-inner-workings-of-protein-language-models\/"},"modified":"2025-08-19T10:40:38","modified_gmt":"2025-08-19T07:40:38","slug":"researchers-glimpse-the-inner-workings-of-protein-language-models","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/researchers-glimpse-the-inner-workings-of-protein-language-models\/","title":{"rendered":"Researchers glimpse the inner workings of protein language models"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a4143f2e935b\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a4143f2e935b\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/researchers-glimpse-the-inner-workings-of-protein-language-models\/#Opening_the_black_box\" >Opening the black box<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/researchers-glimpse-the-inner-workings-of-protein-language-models\/#Interpretable_models\" >Interpretable models<\/a><\/li><\/ul><\/nav><\/div>\n<div>\n<div class=\"article-gallery lightGallery\">\n<div data-thumb=\"https:\/\/scx1.b-cdn.net\/csz\/news\/tmb\/2019\/10-dna.jpg\" data-src=\"https:\/\/scx2.b-cdn.net\/gfx\/news\/hires\/2019\/10-dna.jpg\" data-sub-html=\"Credit: CC0 Public Domain\">\n<figure class=\"article-img\">\n            <img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/scx1.b-cdn.net\/csz\/news\/800a\/2019\/10-dna.jpg\" alt=\"dna\" title=\"Credit: CC0 Public Domain\" width=\"800\" height=\"480\"\/><figcaption class=\"text-darken text-low-up text-truncate-js text-truncate mt-3\">\n                Credit: CC0 Public Domain<br \/>\n            <\/figcaption><\/figure>\n<\/p><\/div>\n<\/div>\n<p>Within the past few years, models that can predict the structure or function of proteins have been widely used for a variety of biological <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lications, such as identifying drug targets and designing new therapeutic antibodies.<\/p>\n<p>These models, which are based on large language models (LLMs), can make very accurate predictions of a protein&#8217;s suitability for a given application. However, there&#8217;s no way to determine how these models make their predictions or which protein features play the most important role in those decisions.<\/p>\n<p>In a new study, MIT researchers have used a novel technique to open up that &#8220;black box&#8221; and allow them to determine what features a protein language model takes into account when making predictions. Understanding what is happening inside that black box could help researchers to choose better models for a particular task, helping to streamline the process of identifying new drugs or vaccine targets.<\/p>\n<p>&#8220;Our work has broad implications for enhanced explainability in downstream tasks that rely on these representations,&#8221; says Bonnie Berger, the Simons Professor of Mathematics, head of the Computation and Biology group in MIT&#8217;s Computer <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">Science<\/a> and Artificial Intelligence Laboratory, and the senior author of the study. &#8220;Additionally, identifying features that protein language models track has the potential to reveal novel biological insights from these representations.&#8221;<\/p>\n<p>Onkar Gujral, an MIT graduate student, is the lead author of the study, which appears this week in the <i>Proceedings of the National Academy of Sciences<\/i>. Mihir Bafna, an MIT graduate student, and Eric Alm, an MIT professor of biological engineering, are also authors of the paper.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Opening_the_black_box\"><\/span>Opening the black box<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In 2018, Berger and former MIT graduate student Tristan Bepler, Ph.D. introduced the first protein language model. Their model, like subsequent protein models that accelerated the development of AlphaFold, such as ESM2 and OmegaFold, was based on LLMs. These models, which include ChatGPT, can analyze huge amounts of text and figure out which words are most likely to appear together.<\/p>\n<p>Protein language models use a similar approach, but instead of analyzing words, they analyze amino acid sequences. Researchers have used these models to predict the structure and function of proteins, and for applications such as identifying proteins that might bind to particular drugs.<\/p>\n<p>In a 2021 study, Berger and colleagues used a protein language model to predict which sections of viral surface proteins are less likely to mutate in a way that enables viral escape. This allowed them to identify possible targets for vaccines against influenza, HIV, and SARS-CoV-2.<\/p>\n<p>However, in all of these studies, it has been impossible to know how the models were making their predictions.<\/p>\n<p>&#8220;We would get out some prediction at the end, but we had absolutely no idea what was happening in the individual components of this black box,&#8221; Berger says.<\/p>\n<p>In the new study, the researchers wanted to dig into how protein language models make their predictions. Just like LLMs, protein language models encode information as representations that consist of a pattern of activation of different &#8220;nodes&#8221; within a neural network. These nodes are analogous to the networks of neurons that store memories and other information within the brain.<\/p>\n<div class=\"ads w-100 my-4 article-main__more bg-light p-3 border\" aria-hidden=\"true\">\n<p class=\"mb-3\">\n        Discover the latest in science, tech, and space with over <strong>100,000 subscribers<\/strong> who rely on Phys.org for daily insights.<br \/>\n        Sign up for our <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/sciencex.com\/help\/newsletter\/\" target=\"_blank\">free newsletter<\/a> and get updates on breakthroughs,<br \/>\n        innovations, and research that matter\u2014<strong>daily or weekly<\/strong>.\n    <\/p>\n<\/div>\n<p>The inner workings of LLMs are not easy to interpret, but within the past couple of years, researchers have begun using a type of algorithm known as a sparse autoencoder to help shed some light on how those models make their predictions. The new study from Berger&#8217;s lab is the first to use this algorithm on protein language models.<\/p>\n<p>Sparse autoencoders work by adjusting how a protein is represented within a neural network. Typically, a given protein will be represented by a pattern of activation of a constrained number of neurons, for example, 480. A sparse autoencoder will expand that representation into a much larger number of nodes, say 20,000.<\/p>\n<p>When information about a protein is encoded by only 480 neurons, each node lights up for multiple features, making it very difficult to know what features each node is encoding. However, when the neural network is expanded to 20,000 nodes, this extra space along with a sparsity constraint gives the information room to &#8220;spread out.&#8221; Now, a feature of the protein that was previously encoded by multiple nodes can occupy a single node.<\/p>\n<p>&#8220;In a sparse representation, the neurons lighting up are doing so in a more meaningful manner,&#8221; Gujral says. &#8220;Before the sparse representations are created, the networks pack information so tightly together that it&#8217;s hard to interpret the neurons.&#8221;<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Interpretable_models\"><\/span>Interpretable models<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Once the researchers obtained sparse representations of many proteins, they used an AI assistant called Claude (related to the popular Anthropic chatbot of the same name), to analyze the representations. In this case, they asked Claude to compare the sparse representations with the known features of each protein, such as molecular function, protein family, or location within a cell.<\/p>\n<p>By analyzing thousands of representations, Claude can determine which nodes correspond to specific protein features, then describe them in plain English. For example, the algorithm might say, &#8220;This neuron appears to be detecting proteins involved in transmembrane transport of ions or amino acids, particularly those located in the plasma membrane.&#8221;<\/p>\n<p>This process makes the nodes far more &#8220;interpretable,&#8221; meaning the researchers can tell what each node is encoding. They found that the features most likely to be encoded by these nodes were protein family and certain functions, including several different metabolic and biosynthetic processes.<\/p>\n<p>&#8220;When you train a sparse autoencoder, you aren&#8217;t training it to be interpretable, but it turns out that by incentivizing the representation to be really sparse, that ends up resulting in interpretability,&#8221; Gujral says.<\/p>\n<p>Understanding what features a particular protein model is encoding could help researchers choose the right model for a particular task, or tweak the type of input they give the model, to generate the best results. Additionally, analyzing the features that a model encodes could one day help biologists to learn more about the proteins that they are studying.<\/p>\n<p>&#8220;At some point when the models get a lot more powerful, you could learn more biology than you already know, from opening up the models,&#8221; Gujral says.<\/p>\n<div class=\"article-main__more p-4\">\n<p><strong>More information:<\/strong><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tBerger, Bonnie, Sparse autoencoders uncover biologically interpretable features in protein language model representations, <i>Proceedings of the National Academy of Sciences<\/i> (2025). <a rel=\"nofollow\" target=\"_blank\" data-doi=\"1\" href=\"https:\/\/dx.doi.org\/10.1073\/pnas.2506316122\" target=\"_blank\">DOI: 10.1073\/pnas.2506316122<\/a>. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/doi.org\/10.1073\/pnas.2506316122\" target=\"_blank\">doi.org\/10.1073\/pnas.2506316122<\/a><\/p>\n<\/p><\/div>\n<div class=\"d-inline-block text-medium mt-4\">\n<p>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMassachusetts Institute of <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">Technology<\/a><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<a rel=\"nofollow\" target=\"_blank\" class=\"icon_open\" href=\"http:\/\/web.mit.edu\/\" target=\"_blank\" rel=\"nofollow\"><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<svg>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<use href=\"https:\/\/phys.b-cdn.net\/tmpl\/v6\/img\/svg\/sprite.svg#icon_open\" x=\"0\" y=\"0\"\/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/svg><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/p>\n<\/p><\/div>\n<p class=\"article-main__note mt-4\">\n\t\t\t\t\t\t\t\t\t\t\t\t  <i>This story is republished courtesy of MIT <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">News<\/a> (<a rel=\"nofollow\" target=\"_blank\" href=\"http:\/\/web.mit.edu\/newsoffice\/\" target=\"_blank\">web.mit.edu\/newsoffice\/<\/a>), a popular site that covers news about MIT research, innovation and teaching.<\/i>\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n<p>\t\t\t\t\t\t\t\t\t\t<!-- print only --><\/p>\n<div class=\"d-none d-print-block\">\n<p>\n\t\t\t\t\t\t\t\t\t\t\t\t<strong>Citation<\/strong>:<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tResearchers glimpse the inner workings of protein language models (2025, August 19)<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tretrieved 19 August 2025<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\tfrom https:\/\/phys.org\/news\/2025-08-glimpse-protein-language.html\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n<p>\n\t\t\t\t\t\t\t\t\t\t\t This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no<br \/>\n\t\t\t\t\t\t\t\t\t\t\t part may be reproduced without the written permission. The content is provided for information purposes only.\n\t\t\t\t\t\t\t\t\t\t\t <\/p>\n<\/p><\/div>\n<\/p><\/div>\n<p><script id=\"facebook-jssdk\" async=\"\" src=\"https:\/\/connect.facebook.net\/en_US\/sdk.js\"><\/script><\/p>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMN63nwsw68G3Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more Like this articles, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" target=\"_blank\" >Science category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/phys.org\/news\/2025-08-glimpse-protein-language.html\" target=\"_blank\" >Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Credit: CC0 Public Domain Within the past few years, models that can predict the structure or function of proteins have been widely used for a variety of biological applications, such as identifying drug targets and designing new therapeutic antibodies. These models, which are based on large language models (LLMs), can make very accurate predictions of&#8230;<\/p>\n","protected":false},"author":1,"featured_media":685901,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/scx2.b-cdn.net\/gfx\/news\/hires\/2019\/10-dna.jpg","fifu_image_alt":"","footnotes":""},"categories":[16],"tags":[],"class_list":["post-685900","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sciencee"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/685900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=685900"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/685900\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/685901"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=685900"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=685900"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=685900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}