{"id":97973,"date":"2020-10-26T21:00:22","date_gmt":"2020-10-26T18:00:22","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-knowledge-distillation-compresses-neural-networks\/"},"modified":"2020-10-26T21:00:22","modified_gmt":"2020-10-26T18:00:22","slug":"how-knowledge-distillation-compresses-neural-networks","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/","title":{"rendered":"#How knowledge distillation compresses neural networks"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a3385aab1dc2\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a3385aab1dc2\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#So_what_is_knowledge_distillation\" >So, what is knowledge distillation?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#Why_not_train_a_small_network_from_the_start\" >Why not train a small network from the start?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#The_difference_between_transfer_learning\" >The difference between transfer learning<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#Using_decision_trees\" >Using decision trees<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#Distilling_BERT\" >Distilling BERT<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/buradabiliyorum.com\/en\/how-knowledge-distillation-compresses-neural-networks\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<p>&#8220;<strong>#How knowledge distillation compresses neural networks<\/strong>&#8221;<\/p>\n<div>\n                                <span style=\"font-weight: 400;\">If you\u2019ve ever used a neural network to solve a complex problem, you know they can be enormous in size, containing millions of parameters. For instance, the famous BERT model has about ~110 million.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To illustrate the point, this is the number of parameters for the most common architectures in (natural language processing)\u00a0NLP, as summarized in the recent <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.stateof.ai\/\"><span style=\"font-weight: 400;\">State of AI Report 2020<\/span><\/a><span style=\"font-weight: 400;\"> by <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.twitter.com\/nathanbenaich\"><span style=\"font-weight: 400;\">Nathan Benaich<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.twitter.com\/soundboy\"><span style=\"font-weight: 400;\">Ian Hogarth<\/span><\/a><span style=\"font-weight: 400;\">.<\/span> You can see this below:<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325378 size-full lazy\" alt=\"\" width=\"683\" height=\"382\" sizes=\"auto, (max-width: 683px) 100vw, 683px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.41.34-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.41.34-PM.png 683w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.41.34-PM-280x157.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.41.34-PM-483x270.png 483w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.41.34-PM-241x135.png 241w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: The number of parameters in given architectures. Source: State of AI Report 2020 by Nathan Benaich and Ian Hogarth\" data-title=\"Share The number of parameters in given architectures. Source: State of AI Report 2020 by Nathan Benaich and Ian Hogarth on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share The number of parameters in given architectures. Source: State of AI Report 2020 by Nathan Benaich and Ian Hogarth on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>The number of parameters in given architectures. Source: State of AI Report 2020 by Nathan Benaich and Ian Hogarth<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">In Kaggle competitions, the winner models are often ensembles, composed of several predictors. Although they can beat simple models by a large margin in terms of accuracy, their enormous computational costs make them utterly unusable in practice.<\/span><\/p>\n<p>[Read:\u00a0<em>What audience intelligence data tells us about the 2020 US presidential election<\/em>]<\/p>\n<p><span style=\"font-weight: 400;\">Is there any way to somehow leverage these powerful but massive models to train state of the art models, without scaling the hardware?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Currently, there are three main methods out there to compress a neural network while preserving the predictive performance:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this post, my goal is to introduce you to the fundamentals of <\/span><i><span style=\"font-weight: 400;\">knowledge distillation<\/span><\/i><span style=\"font-weight: 400;\">, which is an incredibly exciting idea, building on training a smaller network to <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>roximate the large one.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"So_what_is_knowledge_distillation\"><\/span><strong>So, what is knowledge distillation?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Let\u2019s imagine a very complex task, such as image classification for thousands of classes. Often, you can\u2019t just slap on a ResNet50 and expect it to achieve 99% accuracy. So, you build an ensemble of models, balancing out the flaws of each one. Now you have a huge model, which, although performs excellently, there is no way to deploy it into production and get predictions in a reasonable time.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, the model <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>izes pretty well to the unseen data, so it is safe to trust its predictions. (I know, this might not be the case, but let\u2019s just roll with the thought experiment for now.)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">What if we use the predictions from the large and <\/span><i><span style=\"font-weight: 400;\">cumbersome<\/span><\/i><span style=\"font-weight: 400;\"> model to train a smaller, so-called <\/span><i><span style=\"font-weight: 400;\">student<\/span><\/i><span style=\"font-weight: 400;\"> model to approximate the big one?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is knowledge distillation in essence, which was introduced in the paper <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1503.02531\"><span style=\"font-weight: 400;\">Distilling the Knowledge in a Neural Network<\/span><\/a><span style=\"font-weight: 400;\"> by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In broad strokes, the process is the following.<\/span><\/p>\n<ul>\n<li>\n<span style=\"font-weight: 400;\">Train a large model that performs and generalizes very well. This is called the <\/span><i><span style=\"font-weight: 400;\">teacher model<\/span><\/i><span style=\"font-weight: 400;\">.<\/span>\n<\/li>\n<li>\n<span style=\"font-weight: 400;\">Take all the data you have, and compute the predictions of the teacher model. The total dataset with these predictions is called the <\/span><i><span style=\"font-weight: 400;\">knowledge, <\/span><\/i><span style=\"font-weight: 400;\">and the predictions themselves are often referred to as <\/span><i><span style=\"font-weight: 400;\">soft targets<\/span><\/i><span style=\"font-weight: 400;\">. This is the <\/span><i><span style=\"font-weight: 400;\">knowledge distillation<\/span><\/i><span style=\"font-weight: 400;\"> step.<\/span>\n<\/li>\n<li>\n<span style=\"font-weight: 400;\">Use the previously obtained knowledge to train the smaller network, called the <\/span><i><span style=\"font-weight: 400;\">student model<\/span><\/i><span style=\"font-weight: 400;\">.<\/span>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">To visualize the process, you can think of the following.<\/span><\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325379 size-full lazy\" alt=\"\" width=\"675\" height=\"378\" sizes=\"auto, (max-width: 675px) 100vw, 675px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.42.19-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.42.19-PM.png 675w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.42.19-PM-280x157.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.42.19-PM-482x270.png 482w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.42.19-PM-241x135.png 241w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Knowledge distillation (Image by the author)\" data-title=\"Share Knowledge distillation (Image by the author) on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Knowledge distillation (Image by the author) on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Knowledge distillation (Image by the author)<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">Let\u2019s focus on the details a bit. How is the knowledge obtained?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In classifier models, the class probabilities are given by a <\/span><i><span style=\"font-weight: 400;\">softmax<\/span><\/i><span style=\"font-weight: 400;\"> layer, converting the <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Logit\"><i><span style=\"font-weight: 400;\">logits<\/span><\/i><\/a><span style=\"font-weight: 400;\"> to probabilities:<\/span><\/p>\n<figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1325380 lazy\" alt=\"\" width=\"336\" height=\"378\" sizes=\"auto, (max-width: 336px) 100vw, 336px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.43.47-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.43.47-PM.png 336w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.43.47-PM-187x210.png 187w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.43.47-PM-240x270.png 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.43.47-PM-120x135.png 120w\"\/><\/figure>\n<p><span style=\"font-weight: 400;\">The logits produced by the last layer. Instead of these, a slightly modified version is used:<\/span><\/p>\n<figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1325381 lazy\" alt=\"\" width=\"339\" height=\"146\" sizes=\"auto, (max-width: 339px) 100vw, 339px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.37-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.37-PM.png 339w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.37-PM-280x121.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.37-PM-270x116.png 270w\"\/><\/figure>\n<p><span style=\"font-weight: 400;\">Where <\/span><i><span style=\"font-weight: 400;\">T<\/span><\/i><span style=\"font-weight: 400;\"> is a hyperparameter called <\/span><i><span style=\"font-weight: 400;\">temperature<\/span><\/i><span style=\"font-weight: 400;\">. These values are called <\/span><i><span style=\"font-weight: 400;\">soft targets<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If <\/span><i><span style=\"font-weight: 400;\">T<\/span><\/i><span style=\"font-weight: 400;\"> is large, the class probabilities are \u201csofter\u201d, that is, they will be closer to each other. In the extreme case, when <\/span><i><span style=\"font-weight: 400;\">T<\/span><\/i><span style=\"font-weight: 400;\"> approaches infinity,<\/span><\/p>\n<figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1325382 lazy\" alt=\"\" width=\"380\" height=\"97\" sizes=\"auto, (max-width: 380px) 100vw, 380px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.55-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.55-PM.png 380w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.55-PM-280x71.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.44.55-PM-270x69.png 270w\"\/><\/figure>\n<p><span style=\"font-weight: 400;\">If <\/span><i><span style=\"font-weight: 400;\">T = 1<\/span><\/i><span style=\"font-weight: 400;\">, we obtain the softmax function. For our purposes, the temperature is set to higher than 1, thus the name <\/span><i><span style=\"font-weight: 400;\">distillation<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Hinton, Vinyals, and Dean showed that a distilled model can perform as good as an ensemble composed of 10 large models.<\/span><\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325383 size-full lazy\" alt=\"\" width=\"721\" height=\"165\" sizes=\"auto, (max-width: 721px) 100vw, 721px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.09-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.09-PM.png 721w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.09-PM-280x64.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.09-PM-540x124.png 540w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.09-PM-270x62.png 270w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Knowledge distillation results on a speech recognition problem from the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\" data-title=\"Share Knowledge distillation results on a speech recognition problem from the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Knowledge distillation results on a speech recognition problem from the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Knowledge distillation results on a speech recognition problem from the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean<span style=\"font-size: 16px;\"\/><\/figcaption><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Why_not_train_a_small_network_from_the_start\"><\/span><strong>Why not train a small network from the start?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">You might ask, why not train a smaller network from the start? Wouldn\u2019t it be easier? Sure, but it <\/span><i><span style=\"font-weight: 400;\">wouldn\u2019t work <\/span><\/i><span style=\"font-weight: 400;\">necessarily.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Empirical evidence suggests that more parameters result in better generalization and faster convergence. For instance, this was studied by <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1802.06509\"><span style=\"font-weight: 400;\">Sanjeev Arora, Nadav Cohen, and Elad Hazan in their paper On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325384 size-full lazy\" alt=\"\" width=\"702\" height=\"264\" sizes=\"auto, (max-width: 702px) 100vw, 702px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.42-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.42-PM.png 702w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.42-PM-280x105.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.42-PM-540x203.png 540w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.45.42-PM-270x102.png 270w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Left: single-layer network vs. linear networks with 4 and 8 layers. Right: overparametrized vs. baseline model for MNIST classification using the TensorFlow tutorial. Source: On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization by Sanjeev Arora, Nadav Cohen, and Elad Hazan\" data-title=\"Share Left: single-layer network vs. linear networks with 4 and 8 layers. Right: overparametrized vs. baseline model for MNIST classification using the TensorFlow tutorial. Source: On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization by Sanjeev Arora, Nadav Cohen, and Elad Hazan on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Left: single-layer network vs. linear networks with 4 and 8 layers. Right: overparametrized vs. baseline model for MNIST classification using the TensorFlow tutorial. Source: On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization by Sanjeev Arora, Nadav Cohen, and Elad Hazan on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Left: single-layer network vs. linear networks with 4 and 8 layers. Right: overparametrized vs. baseline model for MNIST classification using the TensorFlow tutorial. Source: On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization by Sanjeev Arora, Nadav Cohen, and Elad Hazan<span style=\"font-size: 16px;\"\/><\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">For complex problems, simple models have trouble learning to generalize well on the given training data. However, we have much more than the training data: the teacher model\u2019s predictions for all the available data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This benefits us in two ways.<\/span><\/p>\n<p>First, the teacher model\u2019s knowledge can teach the student model how to generalize via available predictions outside the training dataset. Recall that we use the teacher model\u2019s predictions for all available data to train the student model, instead of the original training dataset.<\/p>\n<p>Second, the soft targets provide more useful information than class labels: it indicates if two classes are similar to each other<i>. <\/i>For instance, if the task is to classify dog breeds, information like <i>\u201cShiba Inu and Akita are very similar\u201d<\/i> is extremely valuable regarding model generalization.<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325385 size-full lazy\" alt=\"\" width=\"749\" height=\"254\" sizes=\"auto, (max-width: 749px) 100vw, 749px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.11-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.11-PM.png 749w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.11-PM-280x95.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.11-PM-540x183.png 540w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.11-PM-270x92.png 270w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Left: Akita dog. Right: Shiba Inu dog. Source: Wikipedia\" data-title=\"Share Left: Akita dog. Right: Shiba Inu dog. Source: Wikipedia on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Left: Akita dog. Right: Shiba Inu dog. Source: Wikipedia on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Left: Akita dog. Right: Shiba Inu dog. Source: Wikipedia<span style=\"font-size: 16px;\"\/><\/figcaption><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"The_difference_between_transfer_learning\"><\/span><strong>The difference between transfer learning<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">As noted by <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1503.02531.pdf\"><span style=\"font-weight: 400;\">Hinton et al.<\/span><\/a><span style=\"font-weight: 400;\">, one of the earliest attempts to compress models by transferring knowledge was to reuse some layers of a trained ensemble, as done by <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1150402.1150464\"><span style=\"font-weight: 400;\">Cristian Bucilu\u01ce, Rich Caruana, and Alexandru Niculescu-Mizil in their 2006 paper titled Model compression<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the words of Hinton et al.,<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">\u201c\u2026we tend to identify the knowledge in a trained model with the learned parameter values and this makes it hard to see how we can change the form of the model but keep the same knowledge. A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.\u201d \u2014<\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1503.02531.pdf\"><i><span style=\"font-weight: 400;\"> Distilling the Knowledge in a Neural Network<\/span><\/i><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Thus, the knowledge distillation doesn\u2019t use the learned weights directly, as opposed to transfer learning.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Using_decision_trees\"><\/span><strong>Using decision trees<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">If you want to compress the model even further, you can try using even simpler models like decision trees. Although they are not as expressive as neural networks, their predictions can be explained by looking at the nodes individually.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This was done by Nicholas Frosst and Geoffrey Hinton, who studied this in their paper <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1711.09784\"><span style=\"font-weight: 400;\">Distilling a Neural Network Into a Soft Decision Tree<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325386 size-full lazy\" alt=\"\" width=\"701\" height=\"538\" sizes=\"auto, (max-width: 701px) 100vw, 701px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.45-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.45-PM.png 701w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.45-PM-274x210.png 274w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.45-PM-352x270.png 352w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.46.45-PM-176x135.png 176w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Source: Distilling a Neural Network Into a Soft Decision Tree\" data-title=\"Share Source: Distilling a Neural Network Into a Soft Decision Tree on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Source: Distilling a Neural Network Into a Soft Decision Tree on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Source: Distilling a Neural Network Into a Soft Decision Tree<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">They showed that distilling indeed helped a little, although even simpler neural networks have outperformed them. On the MNIST dataset, the distilled decision tree model achieved 96.76% test accuracy, which was an improvement from the baseline 94.34% model. However, a straightforward two-layer deep convolutional network still reached 99.21% accuracy. Thus, there is a trade-off between performance and explainability.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Distilling_BERT\"><\/span><strong>Distilling BERT<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">So far, we have only seen theoretical results instead of practical examples. To change this, let\u2019s consider one of the most popular and useful models in recent years: BERT.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Originally published in the paper <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\"><span style=\"font-weight: 400;\">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding<\/span><\/a><span style=\"font-weight: 400;\"> by Jacob Devlin et al. from Google, it soon became widely used for various NLP tasks like document retrieval or sentiment analysis. It was a real breakthrough, pushing state of the art in several fields.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There is one issue, however. BERT contains ~110 million parameters and takes a lot of time to train. The authors reported that the training required 4 days using 16 TPU chips in 4 pods. Calculating with the <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/cloud.google.com\/tpu\/pricing#pod-pricing\"><span style=\"font-weight: 400;\">currently available TPU pod pricing per hour<\/span><\/a><span style=\"font-weight: 400;\">, training costs would be around 10000 USD, <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.technologyreview.com\/2019\/06\/06\/239031\/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes\/\"><span style=\"font-weight: 400;\">not mentioning the environmental costs like carbon emissions<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">One successful attempt to reduce the size and computational cost of BERT was made by <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/huggingface.co\/\"><span style=\"font-weight: 400;\">Hugging Face<\/span><\/a><span style=\"font-weight: 400;\">. They used knowledge distillation to train <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.01108\"><span style=\"font-weight: 400;\">DistilBERT<\/span><\/a><span style=\"font-weight: 400;\">, which is 60% the original model\u2019s size while being 60% faster and keeping 97% of its language understanding capabilities.<\/span><\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1325387 size-full lazy\" alt=\"\" width=\"714\" height=\"432\" sizes=\"auto, (max-width: 714px) 100vw, 714px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.47.30-PM.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.47.30-PM.png 714w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.47.30-PM-280x169.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.47.30-PM-446x270.png 446w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/Screen-Shot-2020-10-26-at-12.47.30-PM-223x135.png 223w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F26%2Fhow-knowledge-distillation-compresses-neural-networks-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Performance of DistilBERT. Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf\" data-title=\"Share Performance of DistilBERT. Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Performance of DistilBERT. Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Performance of DistilBERT. Source: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">The smaller architecture requires much less time and computational resources: 90 hours on 8 16GB V100 GPUs.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you are interested in more details, you can read the original paper <\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.01108\"><span style=\"font-weight: 400;\">DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter<\/span><\/a><span style=\"font-weight: 400;\">, or the summarizing article was written by one of the authors. This is a fantastic read, so I strongly recommend you to do so.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span><strong>Conclusion<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Knowledge distillation is one of the three main methods to compress neural networks and make them suitable for less powerful hardware.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Unlike weight pruning and quantization, the other two powerful compression methods, knowledge distillation does not reduce the network directly. Rather, it uses the original model to train a smaller one called the <\/span><i><span style=\"font-weight: 400;\">student model<\/span><\/i><span style=\"font-weight: 400;\">. Since the teacher model can provide its predictions even on unlabelled data, the student model can learn how to generalize like the teacher.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here, we have looked at two key results: the original paper, which introduced the idea, and a follow-up, showing that simple models such as decision trees can be used as student models.<\/span><\/p>\n<hr\/>\n<section class=\"gp gq gr gs gt\">\n<\/section>\n<p class=\"c-post-pubDate\">\n                                    Published October 26, 2020 \u2014 18:00 UTC\n                                <\/p>\n<\/p><\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><script data-src=\"https:\/\/connect.facebook.net\/en_US\/sdk.js#xfbml=1&amp;appId=378011798897423&amp;version=v2.6\" id=\"socialSrcFacebook\" type=\"text\/template\"><\/script><\/p>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/26\/how-knowledge-distillation-compresses-neural-networks-syndication\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How knowledge distillation compresses neural networks&#8221; If you\u2019ve ever used a neural network to solve a complex problem, you know they can be enormous in size, containing millions of parameters. For instance, the famous BERT model has about ~110 million. To illustrate the point, this is the number of parameters for the most common architectures&#8230;<\/p>\n","protected":false},"author":1,"featured_media":97974,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/neural?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/image-7-5.png&signature=3ee4810115dd3661d98c3dfe16bcbb62","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-97973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/97973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=97973"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/97973\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/97974"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=97973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=97973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=97973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}