{"id":335682,"date":"2021-09-06T17:34:54","date_gmt":"2021-09-06T14:34:54","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/debunking-the-mysteries-of-deep-reinforcement-learning\/"},"modified":"2021-09-06T17:34:54","modified_gmt":"2021-09-06T14:34:54","slug":"debunking-the-mysteries-of-deep-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/","title":{"rendered":"#Debunking the mysteries of deep reinforcement learning"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a3391016df23\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a3391016df23\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/#States_rewards_and_actions\" >States, rewards, and actions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/#Reinforcement_learning_applications\" >Reinforcement learning applications<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/#Reinforcement_learning_functions\" >Reinforcement learning functions<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/#Why_deep_reinforcement_learning\" >Why deep reinforcement learning?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/debunking-the-mysteries-of-deep-reinforcement-learning\/#Deep_reinforcement_learning_and_general_AI\" >Deep reinforcement learning and general AI<\/a><\/li><\/ul><\/nav><\/div>\n<p>&#8220;<strong>#Debunking the mysteries of deep reinforcement learning<\/strong>&#8221;<\/p>\n<div>Deep reinforcement learning is one of the most interesting branches ofartificial intelligence. It is behind some of the most remarkable achievements of the AI community, including <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2018\/07\/02\/ai-plays-chess-go-poker-video-games\/\">beating human champions at board and video games<\/a>, self-driving cars, robotics, and <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2021\/06\/14\/google-reinforcement-learning-ai-chip-design\/\">AI hardware design<\/a>.<\/p>\n<p>Deep reinforcement learning leverages the learning capacity of deep neural networks to tackle problems that were too complex for classic RL techniques. <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2021\/01\/28\/deep-learning-explainer\/\">Deep reinforcement learning<\/a> is much more complicated than the other branches of machine learning. But in this post, I\u2019ll try to demystify it without going into the technical details.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"States_rewards_and_actions\"><\/span>States, rewards, and actions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>At the heart of every <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/05\/28\/what-is-reinforcement-learning\/\">reinforcement learning<\/a> problem are an agent and an environment. The environment provides information about the state of the system. The agent observes these states and interacts with the environment by taking actions. Actions can be discrete (e.g., flipping a switch) or continuous (e.g., turning a knob). These actions cause the environment to transition to a new state. And based on whether the new state is relevant to the goal of the system, the agent receives a reward (the reward can also be zero or negative if it moves the agent away from its goal).<\/p>\n<figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366159 js-lazy\" alt=\"Reinforcement-learning\" width=\"696\" height=\"392\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning.jpeg\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-479x270.jpeg 479w\"\/><noscript><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366159\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning.jpeg\" alt=\"Reinforcement-learning\" width=\"696\" height=\"392\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-479x270.jpeg 479w\"\/><\/noscript><\/figure>\n<p>Every cycle of state-action-reward is called a step. The reinforcement learning system continues to iterate through cycles until it reaches the desired state or a maximum number of steps are expired. This <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/watch-movies-tv-seriess\/\" data-internallinksmanager029f6b8e52c=\"8\" title=\"Watch Movies &amp; TV Series\" target=\"_blank\" rel=\"noopener\">series<\/a> of steps is called an episode. At the beginning of each episode, the environment is set to an initial state and the agent\u2019s reward is reset to zero.<\/p>\n<p>The goal of reinforcement learning is to train the agent to take actions that maximize its rewards. The agent\u2019s action-making function is called a policy. An agent usually requires many episodes to learn a good policy. For simpler problems, a few hundred episodes might be enough for the agent to learn a decent policy. For more complex problems, the agent might need <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/04\/17\/openai-five-neural-networks-dota-2\/\">millions of episodes of training<\/a>.<\/p>\n<p>There are more subtle nuances to reinforcement learning systems. For example, an RL environment can be deterministic or non-deterministic. In deterministic environments, running a sequence of state-action pairs multiple times always yields the same result. In contrast, in non-deterministic RL problems, the state of the environment can change from things other than the agent\u2019s actions (e.g., the passage of time, weather, other agents in the environment).<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Reinforcement_learning_applications\"><\/span>Reinforcement learning applications<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366160 js-lazy\" alt=\"deep-reinforcement-learning-applications\" width=\"696\" height=\"392\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications.jpeg\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-479x270.jpeg 479w\"\/><noscript><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366160\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications.jpeg\" alt=\"deep-reinforcement-learning-applications\" width=\"696\" height=\"392\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-reinforcement-learning-applications-479x270.jpeg 479w\"\/><\/noscript><\/figure>\n<p>To better understand the components of reinforcement learning, let\u2019s consider a few examples.<\/p>\n<p><strong>Chess:<\/strong> Here, the environment is the chessboard and the state of the environment is the location of chess pieces on the board. The RL agent can be one of the players (alternatively, both players can be RL agents separately training in the same environment). Each <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/game\/\" data-internallinksmanager029f6b8e52c=\"7\" title=\"Game\" target=\"_blank\" rel=\"noopener\">game<\/a> of chess is an episode. The episode starts at an initial state, with black and white pieces lined on the edges of the board. At each step, the agent observes the board (the state) and moves one of its pieces (takes an action), which transitions the environment to a new state. The agent receives a reward for reaching the checkmate state and zero rewards otherwise. One of the key challenges of chess is that the agent doesn\u2019t receive any rewards before it checkmates the opponent, which makes it hard to learn.<\/p>\n<p><strong>Atari Breakout:<\/strong> Breakout is a game where the player controls a paddle. There\u2019s a ball moving across the screen. Every time it hits the paddle, it bounces toward the top of the screen, where rows of bricks have been arrayed. Every time the paddle hits a brick, the brick gets destroyed and the ball bounces back. In Breakout, the environment is the game screen. The state is the location of the paddle and the bricks, and the location and velocity of the ball. The actions that the agent can take are move left, move right, or not move at all. The agent receives a positive reward every time the ball hits a brick and a negative reward if the ball moves past the paddle and reaches the bottom of the screen.<\/p>\n<p><strong>Self-driving cars:<\/strong> In <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2018\/09\/17\/self-driving-cars-ai-computer-vision\/\">autonomous driving<\/a>, the agent is the car, and the environment is the world that the car is navigating. The RL agent observes the state of the environment through cameras, lidars, and other sensors. The agent can take navigation actions such as accelerate, hit the brake, turn left or right, or do nothing. The RL agent is rewarded for staying on the road, avoiding collisions, conforming to driving regulations, and staying on course.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Reinforcement_learning_functions\"><\/span>Reinforcement learning functions<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366161 js-lazy\" alt=\"maze-reinforcement-learning\" width=\"696\" height=\"392\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning.jpeg\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-479x270.jpeg 479w\"\/><noscript><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366161\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning.jpeg\" alt=\"maze-reinforcement-learning\" width=\"696\" height=\"392\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/maze-reinforcement-learning-479x270.jpeg 479w\"\/><\/noscript><\/figure>\n<p>Basically, the goal of reinforcement learning is to map states to actions in a way that maximizes rewards. But what exactly does the RL agent learn?<\/p>\n<p>There are three categories of learning algorithms for RL systems:<\/p>\n<p><strong>Policy-based algorithms:<\/strong> This is the most <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a> type of optimization. A policy maps states to actions. An RL agent that learns a policy can create a trajectory of actions that lead from the current state to the objective.<\/p>\n<p>For example, consider an agent that is optimizing a policy to navigate through a maze and reach the exit. First, it starts by making random moves, for which it receives no rewards. In one of the episodes, it finally reaches the exit and receives the exit reward. It retraces its trajectory and readjusts the reward of each state-action pair based on how close it got the agent to the final goal. In the next episode, the RL agent has a better understanding of which actions to take given each state. It gradually adjusts the policy until it converges to an optimal solution.<\/p>\n<p>REINFORCE is a popular policy-based algorithm. The advantage of policy-based functions is that they can be <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lied to all kinds of reinforcement learning problems. The tradeoff of policy-based algorithms is that they are sample-inefficient and require a lot of training before converging on optimal solutions.<\/p>\n<p><strong>Value-based algorithms:<\/strong> Value-based functions learn to evaluate the value of states and actions. Value-based functions help the RL agent evaluate the possible future return on the current state and actions.<\/p>\n<p>There are two variations to value-based functions: Q-values and V-values. Q functions estimate the expected return on state-action pairs. V functions only estimate the value of states. Q functions are more common because it is easier to transform state-action pairs into an RL policy.<\/p>\n<p>Two popular value-based algorithms are SARSA and DQN. Value-based algorithms are more sample-efficient than policy-based RL. Their limitation is that they are only applicable to discrete action spaces (unless you make some changes to them).<\/p>\n<p><strong>Model-based algorithms:<\/strong> Model-based algorithms take a different approach to reinforcement learning. Instead of evaluating the value of states and actions, they try to predict the state of the environment given the current state and action. Model-based reinforcement learning allows the agent to simulate different trajectories before taking any action.<\/p>\n<p>Model-based approaches provide the agent with foresight and reduce the need for manually gathering data. This can be very advantageous in applications where gathering training data and experience is expensive and slow (e.g., robotics and self-driving cars).<\/p>\n<p>But the key challenge of model-based reinforcement learning is that creating a realistic model of the environment<span>\u00a0<\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2021\/06\/17\/evolution-rewards-artificial-intelligence\/\">can be very difficult<\/a>. Non-deterministic environments, such as the real world, are very hard to model. In some cases, developers manage to create simulations<span>\u00a0<\/span><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2021\/04\/26\/reinforcement-learning-embodied-ai\/\">that approximate the real environment<\/a>. But even learning models of these simulated environments ends up being very difficult.<\/p>\n<p>Nonetheless, model-based algorithms have become popular in deterministic problems such as chess and Go. Monte-Carlo Tree Search (MTCS) is a popular model-based method that can be applied to deterministic environments.<\/p>\n<p><strong>Combined methods<\/strong>: To overcome the shortcomings of each category of reinforcement learning algorithms, scientists have developed algorithms that combine elements of different types of learning functions. For example, Actor-Critic algorithms combine the strengths of policy-based and value-based functions. These algorithms use feedback from a value function (the critic) to steer the policy learner (the actor) in the right direction, which results in a more sample-efficient system.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Why_deep_reinforcement_learning\"><\/span>Why deep reinforcement learning?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366163 js-lazy\" alt=\"deep-neural-network-AI\" width=\"696\" height=\"392\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI.jpeg\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-479x270.jpeg 479w\"\/><noscript><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366163\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI.jpeg\" alt=\"deep-neural-network-AI\" width=\"696\" height=\"392\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/deep-neural-network-AI-479x270.jpeg 479w\"\/><\/noscript><\/figure>\n<p>Until now, we\u2019ve said nothing about deep neural networks. In fact, you can implement all the above-mentioned algorithms in any way you want. For example, <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Q-learning\">Q-learning<\/a>, a classic type of reinforcement learning algorithm, creates a table of state-action-reward values as the agent interacts with the environment. Such methods work fine when you\u2019re dealing with a very simple environment where the number of states and actions are very small.<\/p>\n<p>But when you\u2019re dealing with a complex environment, where the combined number of actions and states can reach huge numbers, or where the environment is non-deterministic and can have virtually limitless states, evaluating every possible state-action pair becomes impossible.<\/p>\n<p>In these cases, you\u2019ll need an approximation function that can learn optimal policies based on limited data. And this is what <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/08\/05\/what-is-artificial-neural-network-ann\/\">artificial neural networks<\/a> do. Given the right architecture and optimization function, a deep neural networks can learn an optimal policy without going through all the possible states of a system. Deep reinforcement learning agents still need huge amounts of data (e.g., thousands of hours of gameplay in <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2019\/04\/17\/openai-five-neural-networks-dota-2\/\">Dota<\/a> and StarCraft), but they can tackle problems that were impossible to solve with classic reinforcement learning systems.<\/p>\n<p>For example, a deep RL model can use <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/01\/06\/convolutional-neural-networks-cnn-convnets\/\">convolutional neural networks<\/a> to extract state information from visual data such as camera feeds and video game graphics. And <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/06\/08\/what-is-recurrent-neural-network-rnn\/\">recurrent neural networks<\/a> can extract useful information from sequences of frames, such as where a ball is headed or if a car is parked or moving. This complex learning capacity can help RL agents to understand more complex environments and map their states to actions.<\/p>\n<p>Deep reinforcement learning is comparable to <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2020\/02\/10\/unsupervised-learning-vs-supervised-learning\/\">supervised machine learning<\/a>. The model generates actions, and based on the feedback from the environment, it adjusts its parameters. However, deep reinforcement learning also has a few unique challenges that make it different from traditional supervised learning.<\/p>\n<p>Unlike supervised learning problems, where the model has a set of labeled data, the RL agent only has access to the outcome of its own experiences. It might be able to learn an optimal policy based on the experiences it gathers across different training episodes. But it might also miss many other optimal trajectories that could have led to better policies. Reinforcement learning also needs to evaluate trajectories of state-action pairs, which is much harder to learn than supervised learning problems where every training example is paired with its expected outcome.<\/p>\n<p>This added complexity increases the data requirements of deep reinforcement learning models. But unlike supervised learning, where training data can be curated and prepared in advance, deep reinforcement learning models gather their data during training. In some types of RL algorithms, the data gathered in an episode must be discarded afterward and can\u2019t be used to further speed up the model tuning process in future episodes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Deep_reinforcement_learning_and_general_AI\"><\/span>Deep reinforcement learning and general AI<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><figure class=\"post-image post-mediaBleed aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366166 js-lazy\" alt=\"Reinforcement-learning-artificial-intelligence\" width=\"696\" height=\"392\" sizes=\"auto, (max-width: 696px) 100vw, 696px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence.jpeg\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-479x270.jpeg 479w\"\/><noscript><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-1366166\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence.jpeg\" alt=\"Reinforcement-learning-artificial-intelligence\" width=\"696\" height=\"392\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence.jpeg 696w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-280x158.jpeg 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-240x135.jpeg 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/Reinforcement-learning-artificial-intelligence-479x270.jpeg 479w\"\/><\/noscript><\/figure>\n<p>The AI community is divided on how far you can push deep reinforcement learning. Some scientists believe that with the right RL architecture, you can tackle any kind of problem, including artificial general intelligence. Reinforcement learning is the same algorithm that gave rise to natural intelligence, these scientists believe, and given enough time and energy and the right rewards, we can recreate human-level intelligence.<\/p>\n<p>Others think that reinforcement learning doesn\u2019t address some of the most fundamental problems of artificial intelligence. Despite all their benefits, deep reinforcement learning agents need problems to be well-defined and can\u2019t discover new problems and solutions by themselves, this second group believes.<\/p>\n<p>In any case, what can\u2019t be denied is that deep reinforcement learning has helped solve some very complicated challenges and will continue to remain an important field of interest and research for the AI community for the time being.<\/p>\n<p><i><span>This article was originally published by Ben Dickson on\u00a0<\/span><\/i><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/\"><i><span>TechTalks<\/span><\/i><\/a><i><span>, a publication that examines trends in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">technology<\/a>, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original article <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/bdtechtalks.com\/2021\/09\/02\/deep-reinforcement-learning-explainer\/\">here<\/a>.<\/span><\/i><\/p>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong>\n<\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/news\/mysteries-deep-reinforcement-learning-syndication\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#Debunking the mysteries of deep reinforcement learning&#8221; Deep reinforcement learning is one of the most interesting branches ofartificial intelligence. It is behind some of the most remarkable achievements of the AI community, including beating human champions at board and video games, self-driving cars, robotics, and AI hardware design. Deep reinforcement learning leverages the learning capacity&#8230;<\/p>\n","protected":false},"author":1,"featured_media":335683,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/neural?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2021\/09\/BDHed2.jpg&signature=d3e4a632a33ad107bb0d3274e74a856d","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-335682","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/335682","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=335682"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/335682\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/335683"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=335682"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=335682"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=335682"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}