{"id":411414,"date":"2022-03-02T19:11:45","date_gmt":"2022-03-02T16:11:45","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-rewards-teach-reinforcement-learning-agents-to-behave\/"},"modified":"2022-03-02T19:11:45","modified_gmt":"2022-03-02T16:11:45","slug":"how-rewards-teach-reinforcement-learning-agents-to-behave","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-rewards-teach-reinforcement-learning-agents-to-behave\/","title":{"rendered":"#How rewards teach reinforcement learning agents to behave"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a32503fb7449\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a32503fb7449\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-rewards-teach-reinforcement-learning-agents-to-behave\/#What_is_reinforcement_learning\" >What is reinforcement learning?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-rewards-teach-reinforcement-learning-agents-to-behave\/#How_a_reward_function_works\" >How a reward function works<\/a><\/li><\/ul><\/nav><\/div>\n<p>&#8220;<strong>#How rewards teach reinforcement learning agents to behave<\/strong>&#8221;<br \/>\n<img decoding=\"async\" src=\"https:\/\/img-cdn.tnwcdn.com\/image?fit=796%2C417&amp;url=https%3A%2F%2Fcdn0.tnwcdn.com%2Fwp-content%2Fblogs.dir%2F1%2Ffiles%2F2022%2F03%2FUntitled-design-1.jpg&amp;signature=df350bf547f81701a1cdd9631b069e20\" \/><\/p>\n<div>\n                            <span style=\"font-weight: 400;\">In June 2021, scientists at the AI lab DeepMind made a controversial claim. The researchers suggested that we could reach artificial <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a> intelligence (AGI) using one single <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>roach: <\/span><span style=\"font-weight: 400;\">reinforcement learning<\/span><span style=\"font-weight: 400;\">. They titled their paper on the subject: \u201c<\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/deepmind.com\/research\/publications\/2021\/Reward-is-Enough\"><span style=\"font-weight: 400;\">Reward is Enough<\/span><\/a><span style=\"font-weight: 400;\">.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The team argued that AGI could emerge through an incentive mechanism known as a reward function.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201cWe hypothesize that intelligence, and its associated abilities, can be understood as subserving the maximization of reward,\u201d the study authors wrote.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Their claims have been dismissed by some scientists, but they nonetheless shine a spotlight on a powerful technique.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"What_is_reinforcement_learning\"><\/span><b>What is reinforcement learning?<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In reinforcement learning (RL), a software agent learns through trial and error. When it takes a desired action, the model receives a reward.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Over time, the agent works out how to execute the task to optimize its reward.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The technique can be applied to a vast array of tasks, from <\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.zdnet.com\/article\/uc-berkeley-robot-navigation-could-chart-a-new-course-for-self-driving-systems\/\"><span style=\"font-weight: 400;\">controlling autonomous vehicles<\/span><\/a><span style=\"font-weight: 400;\"> to <\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/sustainability.google\/progress\/projects\/machine-learning\/\"><span style=\"font-weight: 400;\">improving energy efficiency<\/span><\/a><span style=\"font-weight: 400;\">. But its most celebrated achievements have come in the world of <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/game\/\" data-internallinksmanager029f6b8e52c=\"7\" title=\"Game\" target=\"_blank\" rel=\"noopener\">game<\/a>s.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In March 2016, the technique had a landmark moment.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A DeepMind system called AlphaGo became the first computer program to defeat a world champion in Go, a famously complex board game.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The victory <\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/books.google.co.uk\/books?id=Z1FfDwAAQBAJ&amp;pg=PT98&amp;lpg=PT98&amp;dq=200+million+people+watched+alphago&amp;source=bl&amp;ots=If6iGPZcLY&amp;sig=ACfU3U0v0_xt6NDhKqlvREz5-njZwFs3wA&amp;hl=en&amp;sa=X&amp;ved=2ahUKEwjNzP6PkqX2AhWJEMAKHQLhB1kQ6AF6BAgsEAM#v=onepage&amp;q=200%20million%20people%20watched%20alphago&amp;f=false\"><span style=\"font-weight: 400;\">was reportedly watched<\/span><\/a><span style=\"font-weight: 400;\"> by over 200 million people.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><\/p>\n<figure>\n<p>                <iframe loading=\"lazy\" srcdoc=\"&lt;style&gt;*{padding:0;margin:0;overflow:hidden}html,body{background:#000;height:100%}img{position:absolute;top:0;left:0;width:100%;height:100%;object-fit:cover;transition:opacity .1s cubic-bezier(0.4,0,1,1)}a:hover img+img{opacity:1!important}&lt;\/style&gt;&lt;a href=\" https:=\"\" src=\"https:\/\/img.youtube.com\/vi\/WXuK6gekU1Y\/hqdefault.jpg\" style=\"top: 50%;left:50%;width:68px;height:48px;transform:translate3d(-50%,-50%,0)\" height=\"240\" width=\"320\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen=\"\" frameborder=\"0\"><\/iframe><\/p>\n<\/figure>\n<p>        <!--resp-video-container--><\/span><\/p>\n<p><span style=\"font-weight: 400;\">During the match, the AI played unconventional moves that baffled its opponent.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201cThe final version of AlphaGo does not use any rules,\u201d said Demis Hassabis, DeepMind co-founder and CEO.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201cInstead, it learns the game from scratch by playing against different versions of itself thousands of times, incrementally learning through a process of trial and error, known as reinforcement learning. This means it is free to learn the game for itself, unconstrained by orthodox thinking.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These constraints were replaced by reward maximization.<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"How_a_reward_function_works\"><\/span><b>How a reward function works<\/b><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Rewards are common learning incentives for animals. A squirrel, for instance, develops intellectual abilities in its search for nuts. A child, meanwhile, may get a chocolate for tidying their room \u2014 or a spank for bad behavior. (<\/span><i><span style=\"font-weight: 400;\">Don\u2019t worry, I don\u2019t have kids<\/span><\/i><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In AI systems, the rewards and punishments are calculated mathematically. A self-driving system could receive a -1 when the model hits a wall, and a +1 if it safely passes another car. These signals allow the agent to evaluate its performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The algorithm then learns through trial and error to maximize the reward \u2014 and ultimately, complete the task in the most desirable manner.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u201cBecause it\u2019s learning from interaction in an incremental way, it feels very much like what biological intelligence systems do,\u201d Doina Precup, who leads DeepMind\u2019s Montreal office, <\/span><span style=\"font-weight: 400;\">told TNW<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Precup\u2019s colleagues are now developing multi-purpose RL agents.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In 2020, DeepMind unveiled MuZeru, a program that figures out the rules of a game it\u2019s never seen before. Eventually, the lab believes such agents could solve multiple problems in the real world.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are still major challenges to overcome. RL agents struggle to maximize rewards in complex environments and assess the long-term repercussions of their actions. Nonetheless, the reward-is-enough proponents believe the algorithms\u2019 adaptability could pave a path to AGI.<\/span>\n                        <\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/news\/how-rewards-work-in-reinforcement-learning-deepmind\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How rewards teach reinforcement learning agents to behave&#8221; In June 2021, scientists at the AI lab DeepMind made a controversial claim. The researchers suggested that we could reach artificial general intelligence (AGI) using one single approach: reinforcement learning. They titled their paper on the subject: \u201cReward is Enough.\u201d The team argued that AGI could emerge&#8230;<\/p>\n","protected":false},"author":1,"featured_media":411415,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/neural?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2022\/03\/Untitled-design-1.jpg&signature=840670943ed2cc84f53f7265c443186a","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-411414","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/411414","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=411414"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/411414\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/411415"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=411414"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=411414"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=411414"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}