{"id":299713,"date":"2021-07-15T15:00:40","date_gmt":"2021-07-15T12:00:40","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/"},"modified":"2021-07-15T15:00:40","modified_gmt":"2021-07-15T12:00:40","slug":"how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/","title":{"rendered":"#How \u201cChaos Engineering\u201d Helps You Avoid Unplanned Downtime \u2013 CloudSavvy IT"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_84 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a2b361a135de\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a2b361a135de\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/#Building_Resilience\" >Building Resilience<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/#Adding_Chaos_to_Your_Systems\" >Adding Chaos to Your Systems<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/#Designing_Your_Own_Chaos_Experiments\" >Designing Your Own Chaos Experiments<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/#The_Non-Technical_Side_of_Chaos_Engineering\" >The Non-Technical Side of Chaos Engineering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/how-chaos-engineering-helps-you-avoid-unplanned-downtime-cloudsavvy-it\/#Summary\" >Summary<\/a><\/li><\/ul><\/nav><\/div>\n<p><strong>&#8220;#How \u201cChaos Engineering\u201d Helps You Avoid Unplanned Downtime \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<figure style=\"width: 7360px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"type:primaryImage wp-image-12670 size-full\" data-pagespeed-lazy-srcset=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2021\/07\/09c0a66a.jpg?width=398&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1 400w, https:\/\/www.cloudsavvyit.com\/p\/uploads\/2021\/07\/09c0a66a.jpg?width=1198&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1 1200w\" sizes=\"auto, 400w, 1200w\" data-pagespeed-lazy-src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2021\/07\/09c0a66a.jpg?width=398&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"network switch panel\" width=\"7360\" height=\"4180\" src=\"https:\/\/www.shutterstock.com\/image-photo\/network-panel-switch-cable-data-center-1172940130\" data-credittext=\"asharkyu\/Shutterstock.com\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><figcaption class=\"wp-caption-text\"><span class=\"type:primaryImage imagecredit\"><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.shutterstock.com\/image-photo\/network-panel-switch-cable-data-center-1172940130\">asharkyu\/Shutterstock.com<\/a><\/span><\/figcaption><\/figure>\n<p>Chaos engineering is an <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>roach to software fault tolerance testing that intentionally provokes errors in live deployments. It incorporates an element of randomness to mimic the unpredictability of most real-world outages.<\/p>\n<p>The idea of adding chaos to a system is <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>ly credited to Netflix. In 2011, the company <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/the-netflix-simian-army-16e57fbab116\">published Chaos Monkey<\/a>, a tool that it built to disable parts of its production infrastructure. By inducing random failures in monitored environments, Netflix found that it could discover hidden problems that went unnoticed during regular tests.<\/p>\n<p>Chaos engineering provides an immune response effect. It\u2019s similar to how we vaccinate healthy people. You purposefully introduce a threat, potentially causing brief but observable problems, in order to develop stronger long-term resistance.<\/p>\n<h2 id=\"building-resilience\"><span class=\"ez-toc-section\" id=\"Building_Resilience\"><\/span>Building Resilience<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>It\u2019s safe to assume that any sufficiently large system contains bugs that you don\u2019t know about. Despite all your automated tests and day-to-day real-world usage, you can\u2019t catch everything. Some issues only surface in very specific scenarios, such as loss of connectivity to a third-party service.<\/p>\n<p>Chaos engineering accepts that unforeseen operating issues will always be a fact of life, even in supposedly watertight production environments. Whereas many organizations end up taking a \u201cwait and see\u201d approach, playing whack-a-mole as real reports come in, chaos engineering works on the principle that a brief outage that you invoke is always better than one that the customer sees first.<\/p>\n<p>Breaking things on purpose gives you a way of determining your system\u2019s overall resilience. What happens if the database goes down? How about an outage at your third-party email-sending service? Chaos engineering\u2019s greatest strength is its ability to reproduce events that unit tests and real-world use alone won\u2019t usually cover.<\/p>\n<p>Chaos testing tools are often run against real deployments to eliminate discrepancies between dev and production environments. You don\u2019t need to apply this much risk, though: As long as you\u2019re confident that you can accurately replicate your infrastructure, you could use the technique against a sandboxed staging environment.<\/p>\n<h2 id=\"adding-chaos-to-your-systems\"><span class=\"ez-toc-section\" id=\"Adding_Chaos_to_Your_Systems\"><\/span>Adding Chaos to Your Systems<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You have multiple options if you\u2019d like to add some chaos to your infrastructure. <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/dastergon\/awesome-chaos-engineering#notable-tools\">Automated tools<\/a> built for this purpose provide a starting point but can be tricky to incorporate into your own infrastructure. You normally need to integrate with VM or container management platforms so that the tool can interact with your own instances.<\/p>\n<p>In the case of Chaos Monkey, you need to be using <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/spinnaker.io\">Spinnaker<\/a>, Netflix\u2019s continuous delivery platform. While it has broad compatibility with popular public cloud providers, it\u2019s also another dependency that you\u2019re adding to your stack.<\/p>\n<p>If you\u2019re using Kubernetes, <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/asobti\/kube-monkey\">kube-monkey<\/a> takes the original Netflix principles and packages them for use in your cluster. It works on an opt-in basis, so Kubernetes resources with the <code>kube-monkey\/enabled<\/code> label will be eligible for random termination.<\/p>\n<p><a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/alexei-led\/pumba\">Pumba<\/a> provides similar capabilities for regular Docker containers. It can provoke container crashes, stress resource allowances such as CPU and memory, and cause network failures.<\/p>\n<p>A tool that specifically targets networking errors is Shopify\u2019s <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/Shopify\/toxiproxy\">Toxiproxy<\/a>. This provides a TCP proxy that simulates a wide range of network conditions. You can filter your application\u2019s traffic through Toxiproxy to see how the system performs with severe latency or reduced bandwidth.<\/p>\n<p>For advanced control, <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/vmware.github.io\/mangle\/#what-is-mangle\">VMWare\u2019s Mangle<\/a> is a \u201cchaos engineering orchestrator\u201d that targets several different deployment mechanisms. It works with Kubernetes, Docker, VMware vCenter, and generic SSH connections. Mangle lets you define custom faults for application and infrastructure components. Application faults should affect a single service. Infrastructure faults target shared components that could take down multiple services.<\/p>\n<p>While chaos engineering is most commonly associated with backend development and DevOps, there\u2019s growing interest among frontend engineers, too. <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/vmware-1.gitbook.io\/mangle\/sre-developers-and-users\/injecting-faults\">React Chaos<\/a> is a library that will throw random errors from React components, letting you identify flaky UI sections that could crash your whole app.<\/p>\n<h2 id=\"designing-your-own-chaos-experiments\"><span class=\"ez-toc-section\" id=\"Designing_Your_Own_Chaos_Experiments\"><\/span>Designing Your Own Chaos Experiments<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you can\u2019t successfully use an open-source chaos tool, design your own experiments instead. Make a list of the assumptions within your application\u2019s environment. Identify the connections between services and think about what would happen if one dropped out.<\/p>\n<p>You then need to test your hypothesis. Break the system and observe the consequences. Next, determine whether the effect was acceptable. Did the app crash and display a stack trace to the user? Or did it show an outage status page and email the stack trace to your on-call staff?<\/p>\n<p>It\u2019s important to keep each test small and focused. This limits the impact in the event of a production outage and helps you be sure that the issue arises from the tested assumption, not from another part of the system.<\/p>\n<p>Always ensure that you have a clear recovery procedure before manually conducting a chaos experiment. Elevating a provoked outage into a live, unplanned one is the last thing that you want. If you\u2019re terminating a service, be mindful of the time that you\u2019ll need to get it started again. There could be knock-on impacts on your application during longer outages: If you drop out of an email distribution service, there could be a backlog to work through when it comes back online. These aspects need to be incorporated into your action plan before you start work.<\/p>\n<p>After your experiment completes, you might need to update your system before re-running the test. Testing your fix actually improves the situation and lets you be confident that your system is now resilient to that specific scenario.<\/p>\n<p>Here\u2019s a summary of the chaos experiment process:<\/p>\n<ol type=\"1\">\n<li><strong>Develop a hypothesis:\u00a0<\/strong>\u201cThe system is resilient to increased network latency.\u201d<\/li>\n<li><strong>Design a focused experiment:<\/strong>\u00a0\u201cWe will artificially increase latency to 500ms on 70% of requests.\u201d Make sure that you have a clear rollback and recovery strategy.<\/li>\n<li><strong>Run the experiment:<\/strong>\u00a0Observe the impact on your application. Revert detrimental changes to production environments as soon as possible.<\/li>\n<li><strong>Analyze the results:<\/strong>\u00a0If you decide that your system wasn\u2019t resilient enough, implement improvements and repeat the process.<\/li>\n<\/ol>\n<h2 id=\"the-non-technical-side-of-chaos-engineering\"><span class=\"ez-toc-section\" id=\"The_Non-Technical_Side_of_Chaos_Engineering\"><\/span>The Non-Technical Side of Chaos Engineering<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Chaos engineering is normally viewed as a technical task for development and operations teams\u2014after all, \u201cengineering\u201d is in the name. Besides the nuts and bolts of networks and services, it\u2019s important to also look at the human side, too. It\u2019s easy to think that your system only depends on a database, a few app servers, and a stable network. That\u2019s not usually the case.<\/p>\n<p>Think about how your system would respond if team members were unavailable. Is knowledge readily accessible if an admin needs to step back unexpectedly? Especially in smaller organizations, it\u2019s common for a \u201cteam\u201d to be a single person. What happens if your networking guy is ill during a live outage?<\/p>\n<p>In the same way that you test the technical aspects by dropping out of services, you can anticipate human scenarios, too. Try purposefully excluding key individuals as you rehearse an outage. Was the remainder of the team able to restore service to an acceptable state? If they weren\u2019t, you might benefit from documenting more of the system and its dependencies.<\/p>\n<h2 id=\"summary\"><span class=\"ez-toc-section\" id=\"Summary\"><\/span>Summary<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The term \u201cchaos engineering\u201d refers to the practice of purposefully breaking things in production to uncover previously hidden issues. Although the approach can seem daunting to start with, dedicated tools like <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/netflix.github.io\/chaosmonkey\/How-to-deploy\">Chaos Monkey<\/a> can help you get started with minimal risk.<\/p>\n<p>Adding chaos is a useful technique, as it uncovers both transient and systemic problems. You might find that peaking memory use causes knock-on impacts across your infrastructure, but that increased network latency has a sporadic effect on specific parts of your stack.<\/p>\n<p>Effective use of chaos engineering can help you find bugs faster, before your customers notice them. It helps you build up resiliency in your system by encouraging anticipation of issues. Most teams still address problems reactively, leading to an increased cycle time that impedes efficiency.<\/p>\n<p>Chaos engineering is best treated as a mindset rather than a specific procedure or software product. If you acknowledge that systems tend toward chaos, you\u2019ll naturally start baking support for more \u201cwhat-if\u201d scenarios into your code.<\/p>\n<p>It\u2019s always worth thinking about the \u201cimpossible\u201d events, like a data center outage or severe network congestion. In reality, they\u2019re not impossible, just extremely rare. When they do strike, they\u2019re likely to be the most destructive events that your system encounters, unless your infrastructure is prepared to handle them with fallback routines.\n<\/p><\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/12638\/how-chaos-engineering-helps-you-avoid-unplanned-downtime\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How \u201cChaos Engineering\u201d Helps You Avoid Unplanned Downtime \u2013 CloudSavvy IT&#8221; asharkyu\/Shutterstock.com Chaos engineering is an approach to software fault tolerance testing that intentionally provokes errors in live deployments. It incorporates an element of randomness to mimic the unpredictability of most real-world outages. The idea of adding chaos to a system is generally credited to&#8230;<\/p>\n","protected":false},"author":1,"featured_media":299714,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2021\/07\/09c0a66a.jpg","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-299713","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/299713","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=299713"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/299713\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/299714"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=299713"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=299713"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=299713"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}