{"id":91909,"date":"2020-10-18T12:00:25","date_gmt":"2020-10-18T09:00:25","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/why-training-neural-networks-comes-with-a-hefty-price-tag\/"},"modified":"2020-10-18T12:00:25","modified_gmt":"2020-10-18T09:00:25","slug":"why-training-neural-networks-comes-with-a-hefty-price-tag","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/","title":{"rendered":"#Why training neural networks comes with a hefty price tag"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a301fdabd29b\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a301fdabd29b\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#Pruning_deep_neural_networks_after_training\" >Pruning deep neural networks after training<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#Pruning_neural_networks_early\" >Pruning neural networks early<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#How_does_early_neural_network_pruning_perform\" >How does early neural network pruning perform?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#Investigating_early_pruning_methods\" >Investigating early pruning methods<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#Future_directions_for_research\" >Future directions for research<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/buradabiliyorum.com\/en\/why-training-neural-networks-comes-with-a-hefty-price-tag\/#Making_deep_learning_research_more_accessible\" >Making deep learning research more accessible<\/a><\/li><\/ul><\/nav><\/div>\n<p>&#8220;<strong>#Why training neural networks comes with a hefty price tag<\/strong>&#8221;<\/p>\n<div>\n                                In recent years, deep learning has proven to be an effective solution to many of the hard problems of<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/04\/09\/what-is-narrow-artificial-intelligence-ani\/\">artificial intelligence<\/a>. But deep learning is also becoming increasingly expensive. Running deep neural networks requires a lot of compute resources, training them even more.<\/p>\n<p>The costs of deep learning are causing several challenges for the artificial intelligence community, including a<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.pcmag.com\/news\/ai-could-save-the-world-if-it-doesnt-ruin-the-environment-first\">large carbon footprint<\/a><span>\u00a0<\/span>and the<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/08\/26\/deepmind-mustafa-suleyman-commercial-ai\/\">commercialization of AI research<\/a>. And with more demand for AI capabilities away from cloud servers and on \u201c<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/03\/06\/artificial-intelligence-edge-ai\/\">edge devices<\/a>,\u201d there\u2019s a growing need for neural networks that are cost-effective.<\/p>\n<p>While AI researchers have made progress in reducing the costs of running\u00a0<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/02\/15\/what-is-deep-learning-neural-networks\/\">deep learning models<\/a>, the larger problem of reducing the costs of training deep neural networks remains unsolved.<\/p>\n<p>Recent work by AI researchers at MIT Computer <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">Science<\/a> and Artificial Intelligence Lab (MIT CSAIL), University of Toronto Vector Institute, and Element AI, explores the progress made in the field. In a paper titled, \u201c<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2009.08576\">Pruning Neural Networks at Initialization: Why are We Missing the Mark<\/a>,\u201d the researchers discuss why current state-of-the-art methods fail to reduce the costs of neural network training without having a considerable impact on their performance. They also suggest directions for future research.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Pruning_deep_neural_networks_after_training\"><\/span>Pruning deep neural networks after training<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The recent decade has shown that in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>,<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/11\/25\/ai-research-neural-networks-compute-costs\/\">large neural networks provide better results<\/a>. But large deep learning models come at an enormous cost. For instance, to train OpenAI\u2019s<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/09\/21\/gpt-3-economy-business-model\/\">GPT-3<\/a>, which has 175 billion parameters, you\u2019ll need access to huge server clusters with very strong graphics cards, and the costs can soar at several million dollars. Furthermore, you need hundreds of gigabytes worth of VRAM and a strong server to run the model.<\/p>\n<p>There\u2019s a body of work that proves neural networks can be \u201cpruned.\u201d This means that given a very large neural network, there\u2019s a much smaller subset that can provide the same accuracy as the original AI model without significant penalty on its performance. For instance, earlier this year, a pair of AI researchers showed that while a large deep learning model could learn to predict future steps in<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/09\/16\/deep-learning-game-of-life\/\">John Conway\u2019s Game of Life<\/a>, there almost always exists a much smaller neural network that can be trained to perform the same task with perfect accuracy.<\/p>\n<p>There is already much progress in post-training pruning. After a deep learning model goes through the entire training process, you can throw away many of its parameters, sometimes shrinking it to 10 percent of its original size. You do this by scoring the parameters based on the impact their weights have on the final value of the network.<\/p>\n<p>Many tech companies are already using this method to<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/05\/13\/google-assistant-on-device-machine-learning\/\">compress their AI models<\/a><span>\u00a0<\/span>and fit them on smartphones, laptops, and smart-home devices. Aside from slashing inference costs, this provides many benefits such as obviating the need to send user data to cloud servers and providing real-time inference. In many areas, small neural networks make it possible to employ deep learning on devices that are powered by solar batteries or button cells.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Pruning_neural_networks_early\"><\/span>Pruning neural networks early<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<figure class=\"wp-block-image size-large\">\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-6889 lazy\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" alt=\"gradient descent deep learning\" width=\"696\" height=\"392\" data-attachment-id=\"6889\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/03\/23\/yann-lecun-self-supervised-learning\/gradient-descent-deep-learning\/\" data-orig-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?fit=3840%2C2160&amp;ssl=1\" data-orig-size=\"3840,2160\" data-comments-opened=\"1\" data-image-meta=\"{\" aperture=\"\" data-image-title=\"gradient descent deep learning\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?fit=300%2C169&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?fit=696%2C392&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" src=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=696%2C392&amp;ssl=1\" data-lazy=\"true\" srcset=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=1024%2C576&amp;ssl=1 1024w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=300%2C169&amp;ssl=1 300w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=768%2C432&amp;ssl=1 768w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=1536%2C864&amp;ssl=1 1536w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=2048%2C1152&amp;ssl=1 2048w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=696%2C392&amp;ssl=1 696w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=1068%2C601&amp;ssl=1 1068w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=747%2C420&amp;ssl=1 747w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?resize=1920%2C1080&amp;ssl=1 1920w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/03\/gradient-descent-deep-learning.jpg?w=1392&amp;ssl=1 1392w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/18\/why-training-neural-networks-comes-with-a-hefty-price-tag-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F18%2Fwhy-training-neural-networks-comes-with-a-hefty-price-tag-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Image credit: Depositphotos\" data-title=\"Share Image credit: Depositphotos on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Image credit: Depositphotos on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Image credit: Depositphotos<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>The problem with pruning of neural networks after training is that it doesn\u2019t cut the costs of tuning all the excessive parameters. Even if you can compress a trained neural network into a fraction of its original size, you\u2019ll still need to pay the full costs of training it.<\/p>\n<p>The question is, can you find the optimal sub-network without training the full neural network?<\/p>\n<p>In 2018, Jonathan Frankle and Michael Carbin, two AI researchers at MIT CSAIL and co-authors of the new paper, published a paper titled, \u201c<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1803.03635\">The Lottery Ticket Hypothesis<\/a>,\u201d which proved that for many deep learning models, there exist small subsets that can be trained to full accuracy.<br \/>\n<iframe loading=\"lazy\" title=\"J. Frankle &amp; M. Carbin: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/s7DqRZVvRiQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><br \/>\nFinding those subnetworks can considerably reduce the time and cost to train deep learning models. The publication of the Lottery Ticket Hypothesis led to research on methods to prune neural networks at initialization or early in training.<\/p>\n<p>In their new paper, the AI researchers examine some of the better known early pruning methods: Single-shot Network Pruning (<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.02340\">SNIP<\/a>), presented at ICLR 2019; Gradient Signal Preservation (<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2002.07376\">GraSP<\/a>), presented at ICLR 2020, and Iterative Synaptic Flow Pruning (<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2006.05467\">SynFlow<\/a>).<\/p>\n<p>\u201cSNIP aims to prune weights that are least salient for the loss. GraSP aims to prune weights that harm or have the smallest benefit for gradient flow. SynFlow iteratively prunes weights, aiming to avoid<span>\u00a0<\/span><em>layer collapse<\/em>, where pruning concentrates on certain layers of the network and degrades performance prematurely,\u201d the authors write.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"How_does_early_neural_network_pruning_perform\"><\/span>How does early neural network pruning perform?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<figure class=\"wp-block-image size-large\">\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-8501 lazy\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" alt=\"Techniques for pruning neural networks\" width=\"696\" height=\"332\" data-attachment-id=\"8501\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/10\/12\/deep-learning-neural-network-pruning\/techniques-for-pruning-neural-networks\/\" data-orig-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?fit=2328%2C1110&amp;ssl=1\" data-orig-size=\"2328,1110\" data-comments-opened=\"1\" data-image-meta=\"{\" aperture=\"\" data-image-title=\"Techniques for pruning neural networks\" data-image-description=\"\" data-medium-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?fit=300%2C143&amp;ssl=1\" data-large-file=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?fit=696%2C332&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" src=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=696%2C332&amp;ssl=1\" data-lazy=\"true\" srcset=\"https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=1024%2C488&amp;ssl=1 1024w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=300%2C143&amp;ssl=1 300w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=768%2C366&amp;ssl=1 768w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=1536%2C732&amp;ssl=1 1536w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=2048%2C976&amp;ssl=1 2048w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=696%2C332&amp;ssl=1 696w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=1068%2C509&amp;ssl=1 1068w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=881%2C420&amp;ssl=1 881w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?resize=1920%2C915&amp;ssl=1 1920w, https:\/\/i2.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/Techniques-for-pruning-neural-networks.jpg?w=1392&amp;ssl=1 1392w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/18\/why-training-neural-networks-comes-with-a-hefty-price-tag-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F18%2Fwhy-training-neural-networks-comes-with-a-hefty-price-tag-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Several new techniques enable the pruning of deep neural networks during the initialization phase. While they perform better than random pruning, they still fall short of the pos-training benchmarks.\" data-title=\"Share Several new techniques enable the pruning of deep neural networks during the initialization phase. While they perform better than random pruning, they still fall short of the pos-training benchmarks. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Several new techniques enable the pruning of deep neural networks during the initialization phase. While they perform better than random pruning, they still fall short of the pos-training benchmarks. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Several new techniques enable the pruning of deep neural networks during the initialization phase. While they perform better than random pruning, they still fall short of the pos-training benchmarks.<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>In their work, the AI researchers compared the performance of the early pruning methods against two baselines: Magnitude pruning after training and lottery-ticket rewinding (LTR). Magnitude pruning is the standard method that removes excessive parameters after the neural network is fully trained. Lottery-ticket rewinding uses the technique Frankle and Carbin developed in their earlier work to retrain the optimal subnetwork. As mentioned earlier, these methods prove the suboptimal networks exist, but they only do so after the full network is trained. These pre-training pruning methods are supposed to find the minimal networks at the initialization phase, before training the neural network.<\/p>\n<p>The researchers also compared the early pruning methods against two simple techniques. One of them randomly removes weights from the neural network. Checking against random performance is important to validate whether a method is providing significant results or not. \u201cRandom pruning is a naive method for early pruning whose performance any new proposal should surpass,\u201d the AI researchers write.<\/p>\n<p>The other method removes parameters based on their absolute weights. \u201cMagnitude pruning is a standard way to prune for inference and is an additional naive point of comparison for early pruning,\u201d the authors write.<\/p>\n<p>The experiments were performed on VGG-16 and three variations of ResNet, two popular\u00a0<a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/01\/06\/convolutional-neural-networks-cnn-convnets\/\">convolutional neural networks<\/a><span>\u00a0<\/span>(CNN).<\/p>\n<p>No single early method stands out among the early pruning techniques the AI researchers evaluated, and the performances vary based on the chosen neural network structure and the percent of pruning performed. But their findings show that these state-of-the-art methods outperform crude random pruning by a considerable margin in most cases.<\/p>\n<p>None of the methods, however, match the accuracy of the benchmark post-training pruning.<\/p>\n<p>\u201cOverall, the methods make some progress, generally outperforming random pruning. However, this progress remains far short of magnitude pruning after training in terms of both overall accuracy and the sparsities at which it is possible to match full accuracy,\u201d the authors write.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Investigating_early_pruning_methods\"><\/span>Investigating early pruning methods<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<figure class=\"wp-block-image size-large\">\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"jetpack-lazy-image jetpack-lazy-image--handled wp-image-8504 lazy\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" alt=\"testing early pruning methods\" width=\"696\" height=\"396\" data-attachment-id=\"8504\" data-permalink=\"https:\/\/bdtechtalks.com\/2020\/10\/12\/deep-learning-neural-network-pruning\/testing-early-pruning-methods\/\" data-orig-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?fit=2598%2C1476&amp;ssl=1\" data-orig-size=\"2598,1476\" data-comments-opened=\"1\" data-image-meta=\"{\" aperture=\"\" data-image-title=\"testing early pruning methods\" data-image-description=\"\" data-medium-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?fit=300%2C170&amp;ssl=1\" data-large-file=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?fit=696%2C396&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" src=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=696%2C396&amp;ssl=1\" data-lazy=\"true\" srcset=\"https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=1024%2C582&amp;ssl=1 1024w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=300%2C170&amp;ssl=1 300w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=768%2C436&amp;ssl=1 768w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=1536%2C873&amp;ssl=1 1536w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=2048%2C1164&amp;ssl=1 2048w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=696%2C395&amp;ssl=1 696w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=1068%2C607&amp;ssl=1 1068w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=739%2C420&amp;ssl=1 739w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?resize=1920%2C1091&amp;ssl=1 1920w, https:\/\/i1.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2020\/10\/testing-early-pruning-methods.jpg?w=1392&amp;ssl=1 1392w\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/18\/why-training-neural-networks-comes-with-a-hefty-price-tag-syndication\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fneural%2F2020%2F10%2F18%2Fwhy-training-neural-networks-comes-with-a-hefty-price-tag-syndication%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Tests on early pruning methods showed that they were robust against random shuffling and reinitiliazation, which suggests they are not finding specific weights to prune in the target neural network.\" data-title=\"Share Tests on early pruning methods showed that they were robust against random shuffling and reinitiliazation, which suggests they are not finding specific weights to prune in the target neural network. on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Tests on early pruning methods showed that they were robust against random shuffling and reinitiliazation, which suggests they are not finding specific weights to prune in the target neural network. on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Tests on early pruning methods showed that they were robust against random shuffling and reinitiliazation, which suggests they are not finding specific weights to prune in the target neural network.<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>To test why the pruning methods underperform, the AI researchers carried out several tests. First, they tested \u201crandom shuffling.\u201d For each method, they randomly switched the parameters it removed from each layer of the neural network to see if it had an impact on the performance. If, as the pruning methods suggest, they remove parameters based on their relevance and impact, then random switching should severely degrade the performance.<\/p>\n<p>Surprisingly, the researchers found that random shuffling did not have a severe impact on the outcome. Instead, what really decided the result was the amount of weights they removed from each layer.<\/p>\n<p>\u201cAll methods maintain accuracy or improve when randomly shuffled. In other words, the useful information these techniques extract is not which individual weights to remove, but rather the layerwise proportions in which to prune the network,\u201d the authors write, adding that while layer-wise pruning proportions are important, they\u2019re not enough. The proof is that post-training pruning methods reach full accuracy by choosing specific weights and randomly changing them causes a sudden drop in the accuracy of the pruned network.<\/p>\n<p>Next, the researchers checked whether reinitializing the network would change the performance of the pruning methods. Before training, all parameters in a neural network are initialized with random values from a chosen distribution. Previous work, including by Frankle and Carbin, as well as the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/game\/\" data-internallinksmanager029f6b8e52c=\"7\" title=\"Game\" target=\"_blank\" rel=\"noopener\">Game<\/a> of Life research mentioned earlier in this article, show that these initial values often have considerable impact on the final outcome of the training. In fact, the term \u201clottery ticket\u201d was coined based on the fact there are lucky initial values that enable a small neural network to reach high accuracy in training.<\/p>\n<p>Therefore, parameters should be chosen based on their values, and if their initial values are changed, it should severely impact the performance of the pruned network. Again, the tests didn\u2019t show significant changes.<\/p>\n<p>\u201cAll early pruning techniques are robust to reinitialization: accuracy is the same whether the network is trained with the original initialization or a newly sampled initialization. As with<\/p>\n<p>random shuffling, this insensitivity to initialization may reflect a limitation in the information that these methods use for pruning that restricts performance,\u201d the AI researchers write.<\/p>\n<p>Finally, they tried inverting the pruned weights. This means that for each method, they kept the weights marked as removable and instead removed the ones that were supposed to remain. This final test would check the efficiency of the scoring method used to select the pruned weights. Two of the methods, SNIP and SynFlow, showed extreme sensitivity to the inversion and their accuracy declined, which is a good thing. But GraSP\u2019s performance did not degrade after inverting the pruned weights, and in some cases, it even performed better.<\/p>\n<p>The key takeaway from these tests is that current early pruning methods fail to detect the specific connections that define the optimal subnetwork in a deep learning model.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Future_directions_for_research\"><\/span>Future directions for research<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5312 jetpack-lazy-image jetpack-lazy-image--handled\" style=\"box-sizing: border-box; border: 0px; max-width: 100%; height: auto; margin-bottom: 0px; display: block;\" src=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=696%2C464&amp;ssl=1\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" srcset=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1024%2C683&amp;ssl=1 1024w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=300%2C200&amp;ssl=1 300w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=768%2C512&amp;ssl=1 768w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=696%2C464&amp;ssl=1 696w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1068%2C712&amp;ssl=1 1068w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=630%2C420&amp;ssl=1 630w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?resize=1920%2C1280&amp;ssl=1 1920w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?w=1392&amp;ssl=1 1392w, https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?w=2088&amp;ssl=1 2088w\" alt=\"neural networks deep learning stochastic gradient descent\" width=\"696\" height=\"464\" data-attachment-id=\"5312\" data-permalink=\"https:\/\/bdtechtalks.com\/2019\/08\/20\/ai-adversarial-examples-hierarchical-random-switching\/neural-networks-deep-learning-stochastic-gradient-descent\/\" data-orig-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=3600%2C2400&amp;ssl=1\" data-orig-size=\"3600,2400\" data-comments-opened=\"1\" data-image-meta=\"{\" aperture=\"\" data-image-title=\"neural networks deep learning stochastic gradient descent\" data-image-description=\"&lt;p&gt;neural networks deep learning stochastic gradient descent&lt;\/p&gt; \" data-medium-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/bdtechtalks.com\/wp-content\/uploads\/2019\/08\/neural-networks-deep-learning-stochastic-gradient-descent.jpg?fit=696%2C464&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\"\/><\/figure>\n<p>Another solution is to perform pruning in early training instead of initialization. In this case, the neural network is trained for a specific number of epochs before being pruned. The benefit is that instead of choosing between random weights, you\u2019ll be pruning a network that has partially converged. Tests made by the AI researchers showed that the performance of most pruning methods improved as the target network went through more training iterations, but they were still below the baseline benchmarks.<\/p>\n<p>The tradeoff of pruning in early training is that you\u2019ll have to spend resources on those initial epochs, even though the costs are much smaller than full training, and you\u2019ll have to weigh and choose the right balance between performance-gain and training costs.<\/p>\n<p>In their paper, the AI researchers suggest future targets for research on pruning neural networks. One direction is to improve current methods or research new methods that find specific weights to prune instead of proportions in neural network layers. A second area is to find better methods for early-training pruning. And finally, maybe magnitudes and gradients are not the best signals for early pruning. \u201cAre there different signals we should use early in training? Should we expect signals that work early in training to work late in training (or vice versa)?\u201d the authors write.<\/p>\n<p>Some of the claims made in the paper are contested by the creators of the pruning methods. \u201cWhile we\u2019re truly excited about our work (SNIP) attracting lots of interests these days and being addressed in the suggested paper by Jonathan et al., we\u2019ve found some of the claims in the paper a bit troublesome,\u201d Namhoon Lee, AI researcher at the University of Oxford and co-author of the SNIP paper, told<span>\u00a0<\/span><em>TechTalks<\/em>.<\/p>\n<p>Contrary to the findings of the paper, Lee said that random shuffling<span>\u00a0<\/span><em>will<\/em><span>\u00a0<\/span>affect the results, and potentially by a lot, when tested on fully-connected networks as opposed to convolutional neural networks.<\/p>\n<p>Lee also questioned the validity of comparing early-pruning methods to post-training magnitude pruning. \u201cMagnitude based pruning undergoes<span>\u00a0<\/span><em>training steps<\/em><span>\u00a0<\/span>before it starts the pruning process, whereas pruning-at-initialization methods do not (by definition),\u201d Lee said. \u201cThis indicates that they are not standing at the same start line\u2014the former is far ahead of others\u2014and therefore, this could intrinsically and unfairly favor the former. In fact, the saliency of magnitude is not likely a driving force that yields good performance for magnitude based pruning; it\u2019s rather the algorithm (e.g., how long it trains first, how much it prunes, etc.) that is well-tuned.\u201d<\/p>\n<p>Lee added that if magnitude-based pruning starts at the same stage as with pruning-at-initialization methods, it will be the same as random pruning because the initial weights of neural networks are random values.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Making_deep_learning_research_more_accessible\"><\/span>Making deep learning research more accessible<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It would be interesting to see how research in this area unfolds. I\u2019m also curious to see how these and future methods would perform on other neural network architectures such as Transformers, which are by far more computationally expensive to train than CNNs. Also worth noting is that these methods have been developed for and tested on<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/02\/10\/unsupervised-learning-vs-supervised-learning\/\">supervised learning problems<\/a>. Hopefully, we\u2019ll see similar research on similar techniques for more costly branches of AI such as deep<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/05\/28\/what-is-reinforcement-learning\/\">reinforcement learning<\/a>.<\/p>\n<p>Progress in this field could have a huge impact on the future of AI research and <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lications. With the costs of training deep neural networks constantly growing, some parts of areas of research are becoming increasingly centralized in wealthy tech companies who have vast financial and computational resources.<\/p>\n<p>Effective ways to prune neural networks before training them could create new opportunities for a wider group of AI researchers and labs who don\u2019t have access to very large computational resources.<\/p>\n<hr\/>\n<p><i><span style=\"font-weight: 400;\">This article was originally published by Ben Dickson on <\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/\"><i><span style=\"font-weight: 400;\">TechTalks<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">, a publication that examines trends in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">technology<\/a>, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article <a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/10\/12\/deep-learning-neural-network-pruning\/\">here<\/a>.\u00a0<\/span><\/i><\/p>\n<p class=\"c-post-pubDate\">\n                                    Published October 18, 2020 \u2014 09:00 UTC<\/p><\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><script data-src=\"https:\/\/connect.facebook.net\/en_US\/sdk.js#xfbml=1&amp;appId=378011798897423&amp;version=v2.6\" id=\"socialSrcFacebook\" type=\"text\/template\"><\/script><\/p>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/neural\/2020\/10\/18\/why-training-neural-networks-comes-with-a-hefty-price-tag-syndication\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#Why training neural networks comes with a hefty price tag&#8221; In recent years, deep learning has proven to be an effective solution to many of the hard problems of\u00a0artificial intelligence. But deep learning is also becoming increasingly expensive. Running deep neural networks requires a lot of compute resources, training them even more. The costs of&#8230;<\/p>\n","protected":false},"author":1,"featured_media":91910,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/neural?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/10\/image-4-4.png&signature=6e57ffe0c70e62dba3c1c78dd2d786a0","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-91909","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/91909","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=91909"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/91909\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/91910"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=91909"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=91909"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=91909"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}