{"id":131161,"date":"2020-12-11T16:00:43","date_gmt":"2020-12-11T13:00:43","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-to-use-aws-textract-ocr-to-pull-text-and-data-from-documents-cloudsavvy-it\/"},"modified":"2020-12-11T16:00:43","modified_gmt":"2020-12-11T13:00:43","slug":"how-to-use-aws-textract-ocr-to-pull-text-and-data-from-documents-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-to-use-aws-textract-ocr-to-pull-text-and-data-from-documents-cloudsavvy-it\/","title":{"rendered":"#How To Use AWS Textract OCR To Pull Text and Data From Documents \u2013 CloudSavvy IT"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a41c234763f4\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a41c234763f4\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-use-aws-textract-ocr-to-pull-text-and-data-from-documents-cloudsavvy-it\/#Why_Use_AWS_Textract\" >Why Use AWS Textract?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-use-aws-textract-ocr-to-pull-text-and-data-from-documents-cloudsavvy-it\/#Using_Textract\" >Using Textract<\/a><\/li><\/ul><\/nav><\/div>\n<p><strong>&#8220;#How To Use AWS Textract OCR To Pull Text and Data From Documents \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5269\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/0eb3564906a864c93706b30eaca199af\/p\/uploads\/2020\/06\/e601b806.png\" alt=\"AWS Logo\" width=\"700\" height=\"300\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Many companies use human workers to do manual data entry on forms, <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lications, and other physical documents. While this is very accurate, it\u2019s slow and costly. AWS Textract uses machine learning to automate this process.<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Why_Use_AWS_Textract\"><\/span>Why Use AWS Textract?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Textract certainly isn\u2019t the only Optical Character Recognition tool\u2014there are plenty of open source solutions available for free, such as Tesseract OCR. You can read our guide to using that to learn more.<\/p>\n<p>Textract, however, is a lot more than simple OCR as it\u2019s meant for analyzing and extracting data from forms, tables, and other documents. It\u2019s able to pull out important key-value pairs, tables, and other key strings, which makes it actually usable as an interface between scanned documents and a database (though you\u2019ll need to set that automation up yourself).<\/p>\n<p>The other allure is that Textract makes OCR available as a fully managed cloud service. You don\u2019t need to set up your own application servers to run OCR and understand the output; just configure Textract, and send it some documents, it will output the results.<\/p>\n<p>For companies still doing manual data entry, Textract can save you a <em>lot<\/em> of money, both in the reduced man hours spent typing on a keyboard, and the fact that it can batch process many items at once, increasing the speed of data entry immensely.<\/p>\n<p>In terms of price, Textract is cheapest for straight up text, like scanning pages of books. For that, it only costs\u00a0$1.50 per 1000 pages. For analyzing tables, it costs\u00a0$15.00 per 1000 pages. For key-value pairs, it costs\u00a0$50.00 per 1000 pages. While that\u2019s not exactly free, it sure beats paying a human to do it manually.<\/p>\n<p>Textract is pretty accurate, but if you\u2019re worried about the machine getting something wrong, AWS has a solution for that as well. You can set up Textract to use <a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/augmented-ai\/?tag=reviewgeek-20\">Amazon\u2019s Augmented AI workflow<\/a>, which will automatically refer low-confidence results to humans for review.<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Using_Textract\"><\/span>Using Textract<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Head over to the Textract Management Console, and click \u201cget started.\u201d Using the console manually, you can upload documents using the button here:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8499\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/2cc3a98453ba9acf3f0c61a207ae3991\/p\/uploads\/2020\/12\/2dceab79.png\" alt=\"\" width=\"700\" height=\"317\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Textract will process it im<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/social-mediaa\/\" data-internallinksmanager029f6b8e52c=\"1\" title=\"Social Media\" target=\"_blank\" rel=\"noopener\">media<\/a>tely. You\u2019ll quickly see what makes Textract so useful; it knew which pieces of text on this W2 form were important, which ones were part of key-value pairs, which ones were part of tables, and which ones it could throw out.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8500\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/60ad336fc3c1641eeae4342628431463\/p\/uploads\/2020\/12\/bd8b9e9f.png\" alt=\"\" width=\"700\" height=\"436\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>On the right, you\u2019ll find the output, which displays all the raw strings it found, the key-value pairs, and any tables of data. Note that these aren\u2019t mutually exclusive, as in this case it found key-value pairs that where also parts of tables.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8501\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/dc93cc0f4d59d11744d0ff3535d0947c\/p\/uploads\/2020\/12\/763d9202.png\" alt=\"\" width=\"700\" height=\"260\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>You can download the results, and you\u2019ll find a CSV file of all tables and key-value pairs, as well as a text file of the raw text output.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8502\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/b22a415a70f49c2e801bb1b7b7e37ab3\/p\/uploads\/2020\/12\/7df351fe.png\" alt=\"\" width=\"700\" height=\"252\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>If you want to automate Textract, you\u2019ll need to use the AWS CLI or API. Textract has <a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/awscli.amazonaws.com\/v2\/documentation\/api\/latest\/reference\/textract\/analyze-document.html?tag=reviewgeek-20\">its own set of commands for working with it from the command line<\/a>.<\/p>\n<p>You can either <a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/awscli.amazonaws.com\/v2\/documentation\/api\/latest\/reference\/textract\/analyze-document.html#options?tag=reviewgeek-20\">serialize the document to\u00a0base64-encoded document bytes<\/a>, or upload it to S3 and give Textract a key for where to find it. Then, you can use <code>analyze-document<\/code>\u00a0to start a job:<\/p>\n<pre>aws textract analyze-document --document '{\"S3Object\":{\"Bucket\":\"bucket\",\"Name\":\"document\"}}' --feature-types '[\"TABLES\",\"FORMS\"]'<\/pre>\n<p>This is a synchronous operation, but you can analyze asynchronously by starting a job and then fetching the results manually.<\/p>\n<pre>aws textract get-document-analysis --job-id df7cf32ebbd2a5de113535fcf4d921926a701b09b4e7d089f3aebadb41e0712b --max-results 1000<\/pre>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/8498\/how-to-use-aws-textract-to-pull-text-from-documents\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How To Use AWS Textract OCR To Pull Text and Data From Documents \u2013 CloudSavvy IT&#8221; Many companies use human workers to do manual data entry on forms, applications, and other physical documents. While this is very accurate, it\u2019s slow and costly. AWS Textract uses machine learning to automate this process. Why Use AWS Textract?&#8230;<\/p>\n","protected":false},"author":1,"featured_media":131162,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2020\/06\/e601b806.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-131161","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/131161","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=131161"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/131161\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/131162"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=131161"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=131161"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=131161"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}