{"id":496275,"date":"2022-09-26T10:48:12","date_gmt":"2022-09-26T07:48:12","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/"},"modified":"2022-09-26T10:48:12","modified_gmt":"2022-09-26T07:48:12","slug":"synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/","title":{"rendered":"#Synthetic data is the safe, low-cost alternative to real data that we need"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a3701f228574\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a3701f228574\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/#%E2%80%9CSynthetic_data_is_the_safe_low-cost_alternative_to_real_data_that_we_need%E2%80%9D\" >&#8220;Synthetic data is the safe, low-cost alternative to real data that we need&#8221;<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/#Greetings_humanoids\" >Greetings, humanoids<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/#Fake_data_can_help_AIs_deal_with_real_data\" >Fake data can help AIs deal with real data<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/#How_to_make_really_fake_data\" >How to make really fake data<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/synthetic-data-is-the-safe-low-cost-alternative-to-real-data-that-we-need\/#Fake_data_is_like_real_data_without_well_the_realness\" >Fake data is like real data without, well, the realness<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h1><span class=\"ez-toc-section\" id=\"%E2%80%9CSynthetic_data_is_the_safe_low-cost_alternative_to_real_data_that_we_need%E2%80%9D\"><\/span>&#8220;Synthetic data is the safe, low-cost alternative to real data that we need&#8221;<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<div id=\"article-main-content\">\n                            <em>Content provided by <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.ibm.com\/nl-en\">IBM<\/a> and TNW.<\/em><\/p>\n<p>Babies learn to talk from hearing other humans \u2014 mostly their parents \u2014 repeatedly produce sounds. Slowly, through repetition and discovering patterns, infants start connecting those sounds to meaning. Through a lot of practice, they eventually manage to produce similar sounds that humans around them can understand.<\/p>\n<p>Machine learning algorithms work much in the same way, but instead of having a couple of parents to copy from, they use data, painstakingly categorized by thousands of humans who have to manually review the data and tell the machine what it means.<\/p>\n<div class=\"inarticle-wrapper neural channel-cta hs-embed-tnw\">\n<div id=\"hs-embed-tnw\" class=\"channel-cta-wrapper\">\n<div class=\"channel-cta-img\"><img class=\"js-lazy\" https:=\"\"\/><\/div>\n<p><noscript><img decoding=\"async\" src=\"https:\/\/thenextweb.com\/news\/src=\" https:=\"\"\/><\/noscript><\/p>\n<div class=\"channel-cta-input\">\n<h2 class=\"channel-cta-title\"><span class=\"ez-toc-section\" id=\"Greetings_humanoids\"><\/span>Greetings, humanoids<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p class=\"channel-cta-tagline\">Subscribe to our <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">news<\/a>letter now for a weekly recap of our favorite AI stories in your inbox.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p>However, this tedious and time-consuming process isn\u2019t the only problem with real world data used to train machine learning algorithms.<\/p>\n<p>Take fraud detection in insurance claims. For an algorithm to accurately be able to tell a case of fraud apart from legit claims, it needs to see both. Thousands upon thousands of both. And because AI systems are often supplied by third parties \u2014 so not run by the insurance company itself \u2014 those third parties have to be given access to all that sensitive data. You get where this is going, because the same <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lies to healthcare records and financial data.<\/p>\n<p>More esoteric but just as worrying are all the algorithms trained on text, pictures, and videos. Aside from <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/kotaku.com\/ai-art-dall-e-midjourney-stable-diffusion-copyright-1849388060\">questions of copyright<\/a>, many<a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.engadget.com\/dall-e-generative-ai-tracking-data-privacy-160034656.html\"> creators have voiced disagreement<\/a> with their work being sucked into a data set to train <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/newsletters.theatlantic.com\/galaxy-brain\/62fc502abcbd490021afea1e\/twitter-viral-outrage-ai-art\/\">a machine that might eventually take (part of) their job<\/a>. And that\u2019s assuming their creations aren\u2019t racist or problematic in other ways \u2013\u2013 which in turn could lead to problematic outputs.<\/p>\n<p>Also, what if there\u2019s simply not enough data available to train an AI on all eventualities? In a <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.rand.org\/content\/dam\/rand\/pubs\/research_reports\/RR1400\/RR1478\/RAND_RR1478.pdf\">2016 RAND Corporation report<\/a>, the authors calculated how many miles, \u201ca fleet of 100 autonomous vehicles driving 24 hours a day, 365 days a year, at an average speed of 25 miles per hour,\u201d would have to drive to show that their failure rate (resulting in fatalities or injuries), was reliably lower than that of humans. Their answer? 500 years and 11 billion miles.<\/p>\n<p>You don\u2019t have to be a super-brained genius to figure out that the current process is not ideal. So what can we do? How can we create enough, privacy-respecting, non-problematic, all-eventuality-covering, accurately-labeled data? You guessed it: more AI.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Fake_data_can_help_AIs_deal_with_real_data\"><\/span><strong>Fake data can help AIs deal with real data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Even before the RAND report, it was totally clear for companies working on autonomous driving that they were woefully under equipped to gather enough data to reliably train algorithms to drive safely under any condition or circumstance.<\/p>\n<p>Take Waymo, Alphabet\u2019s autonomous driving company. Instead of relying solely on their real world vehicles, they created a totally simulated world, in which simulated cars with simulated sensors could drive around endlessly, collecting real data on their simulated way. <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.theverge.com\/2021\/7\/6\/22565448\/waymo-simulation-city-autonomous-vehicle-testing-virtual\">According to the company<\/a>, by 2020 it had collected data on 15 billion miles of simulated driving \u2014 compared to a measly 20 million miles of real-world driving.<\/p>\n<blockquote class=\"c-richText__pullQuote\">\n<div class=\"c-richText__pullQuoteGradient\">\n<p class=\"c-richText__pullQuoteQuote\">More methods for producing synthetic data are gaining ground.<\/p>\n<\/p><\/div>\n<\/blockquote>\n<p>In the parlance of AI, this is called synthetic data, or \u201cdata applicable to a given situation that is not obtained by direct measurement,\u201d if you want to get technical. Or less technically: AIs are producing fake data so other AIs can learn about the real world at a speedier pace.<\/p>\n<p>One example is <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/research.ibm.com\/blog\/what-is-synthetic-data\">Task2Sim<\/a>, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/unrealdata.ai\/\">scalability of this type of model<\/a> makes collecting data less time consuming and less expensive for data hungry businesses.<\/p>\n<p>Adding to this, Rogerio Feris, an IBM researcher who co-authored the paper on Task2Sim said,<\/p>\n<blockquote><p>The beauty of synthetic images is that you can control their parameters \u2014 the background, lighting, and the way objects are posed.<\/p>\n<\/blockquote>\n<p>Thanks to all of the concerns listed above, the production of all kinds of synthetic data has ballooned over the past few years, with <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/elise-deux.medium.com\/the-list-of-synthetic-data-companies-2021-5aa246265b42\">dozens of startups in the field blooming<\/a> and picking up hundreds of millions of dollars in investment.<\/p>\n<p>The synthetic data generated ranges from \u2018human data\u2019 like health or financial records to synthesized pictures of a diverse range of human faces \u2014 to more abstract data sets like genomic data, that mimic the structure of DNA.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"How_to_make_really_fake_data\"><\/span><strong>How to make really fake data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>There are a couple of ways this synthetic data generation happens, the most common and well established of which is called GAN or generative adversarial networks.<\/p>\n<p>In a GAN, two AIs are pitted against each other. One AI produces a synthetic data set, while the other tries to establish if the generated data is genuine. The feedback from the latter loops back into the former \u2018training\u2019 it to become more accurate in producing convincing fake data. You\u2019ve probably seen one of the many <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/thispersondoesnotexist.com\/\">this-X-does-not-exist<\/a> websites \u2014 ranging from people to cats to buildings \u2014 which generate their images based on GANs.<\/p>\n<blockquote class=\"c-richText__pullQuote\">\n<div class=\"c-richText__pullQuoteGradient\">\n<p class=\"c-richText__pullQuoteQuote\">Synthetic data can give smaller players the opportunity to turn the tables.\n            <\/div>\n<\/blockquote>\n<p>Lately, more methods for producing synthetic data have been gaining ground. The first are known as <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/ai.googleblog.com\/2021\/07\/high-fidelity-image-generation-using.html\">diffusion models<\/a>, in which AIs are trained to reconstruct certain types of data while more and more noise \u2014 data that gradually corrupts the training data \u2014 is added to the real world data. Eventually, the AI can be fed random data, which it works back into a format that it was originally trained on.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Fake_data_is_like_real_data_without_well_the_realness\"><\/span><strong>Fake data is like real data without, well, the realness<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it\u2019s easier to collect way more of it, because you don\u2019t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there\u2019s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.<\/p>\n<p>With AI playing an increasingly larger role in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/technology\/\" data-internallinksmanager029f6b8e52c=\"4\" title=\"Technology\" target=\"_blank\" rel=\"noopener\">technology<\/a> and society, expectations around synthetic data are pretty optimistic. Gartner has famously estimated that <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/blogs.gartner.com\/andrew_white\/2021\/07\/24\/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated\/\">60% of training data will be synthetic data by 2024<\/a>. Market analyst <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/venturebeat.com\/business\/why-synthetic-data-makes-real-ai-better\/\">Cognilytica valued the market<\/a> of synthetic data generation at $110 million in 2021, and growing to $1.15 billion by 2027.<\/p>\n<p>Data has been called the most valuable commodity in the digital age. Big tech has sat on mountains of user data that gave it an advantage over smaller contenders in the AI space. Synthetic data can give smaller players the opportunity to turn the tables.<\/p>\n<p>As you might suspect, the big question regarding synthetic data is around the so-called fidelity \u2014 or how closely it matches real-world data. The jury is still out on this, but research <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/venturebeat.com\/business\/why-synthetic-data-makes-real-ai-better\/\">seems to show <\/a>that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/computing.mit.edu\/news\/when-it-comes-to-ai-can-we-ditch-the-datasets\/\">performed as well as an image classifier trained exclusively on real data<\/a>.<\/p>\n<p>All in all, synthetic and real world stop lights appear to be green for the near-future dominance of synthetic data in training privacy-friendly and safer AI models, and with that, a possible future of smarter AIs for us is just over the horizon.\n                        <\/p><\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/news\/synthetic-data-safe-low-cost-alternative-data\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Synthetic data is the safe, low-cost alternative to real data that we need&#8221; Content provided by IBM and TNW. Babies learn to talk from hearing other humans \u2014 mostly their parents \u2014 repeatedly produce sounds. Slowly, through repetition and discovering patterns, infants start connecting those sounds to meaning. Through a lot of practice, they eventually&#8230;<\/p>\n","protected":false},"author":1,"featured_media":496276,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/neural?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2022\/09\/OL3.jpg&signature=f202291f59e13836ed9cda600bf5a22c","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-496275","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/496275","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=496275"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/496275\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/496276"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=496275"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=496275"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=496275"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}