{"id":119237,"date":"2020-11-24T16:00:13","date_gmt":"2020-11-24T13:00:13","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/"},"modified":"2020-11-24T16:00:13","modified_gmt":"2020-11-24T13:00:13","slug":"how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/","title":{"rendered":"#How To Convert Images To Text On The Linux Command Line With OCR \u2013 CloudSavvy IT"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a413aae97fc8\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a413aae97fc8\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/#What_Is_OCR\" >What Is OCR?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/#Installing_Tesseract_OCR\" >Installing Tesseract OCR<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/#Lets_OCR\" >Let\u2019s OCR!<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/#What_if_I_Want_to_OCR_a_PDF_file\" >What if I Want to OCR a PDF file?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr-cloudsavvy-it\/#Wrapping_Up\" >Wrapping Up<\/a><\/li><\/ul><\/nav><\/div>\n<p><strong>&#8220;#How To Convert Images To Text On The Linux Command Line With OCR \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<figure id=\"attachment_8162\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-8162 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/1ddf0035b4a836ab90db9409c4cba6e8\/p\/uploads\/2020\/11\/d70c487e.png\" alt=\"\" width=\"700\" height=\"300\" data-crediturl=\"https:\/\/www.shutterstock.com\/image-photo\/ocr-cube-letters-words-computer-software-541377586\" data-credittext=\"Shutterstock\/ Dominik Bruhn\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><figcaption class=\"wp-caption-text\"><span class=\"imagecredit\"><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.shutterstock.com\/image-photo\/ocr-cube-letters-words-computer-software-541377586\">Shutterstock\/ Dominik Bruhn<\/a><\/span><\/figcaption><\/figure>\n<p>Top quality Optimal Character Recognition (OCR) software may have been expensive in the past, but now it is available, free of charge, directly from your Linux Terminal command line! This article will help you get setup and started with OCR.<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"What_Is_OCR\"><\/span>What Is OCR?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The OCR acronym stands for <i>Optimal Character Recognition<\/i>: a software program and system whereby a computer can read the text inside images. Imagine taking a photo of your favorite passage from one the Lord of The Rings books.<\/p>\n<p>You\u2019d like to quote it elsewhere, but all you have is a photo. OCR Software can help you by parsing that photo\/image and finding all text within it.<\/p>\n<p>The OCR Software will then, for each letter discovered, analyze the graphical dots seen in the image, and translate\/transform that into actual text a computer can use, for example in a word processor.<\/p>\n<p>While there are many OCR software available, some paid and some free, they are not all of the same quality. Some packages will provide poorer quality results, others will closely align to the text seen in the photo or image.<\/p>\n<p><a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">General<\/a>ly speaking, standard books (or Internet web page prints) will work very well, and should produce reasonable quality results in all cases, as the fonts are straight and uniform and under a singe angle, provided that the original photo or scan is of reasonable quality.<\/p>\n<p>Also good to keep in mind is that even advanced software packages may struggle with poor quality or blurred images, and most packages may struggle with different handwriting styles etc. Other challenges may include text mixed with images or photos, or different direction (for example left-right as well as top-down, or angled text) within the same page.<\/p>\n<p>This makes choosing, and potentially paying for, an OCR package a perhaps long winded process, especially if you want to test and evaluate each package.<\/p>\n<p>For those who are using Linux, there is a great alternative route. A free, top quality OCR software based on LSTM Neural Net with unicode (UTF-8) support, and which can recognize more then 100 languages by default. It also supports many output formats like HTML, PDF, and plain text.<\/p>\n<p>Without further ado; welcome to Tesseract OCR!<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Installing_Tesseract_OCR\"><\/span>Installing <i>Tesseract OCR<\/i><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To install <i>Tesseract OCR<\/i> on your Debian\/Apt based Linux distribution (Like Ubuntu and Mint), do:<\/p>\n<p><code>sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng<\/code><\/p>\n<p>To install <i>Tesseract OCR<\/i> on RHEL and Centos, do:<\/p>\n<p><code>sudo yum install epel-release<\/code><br \/><code>sudo yum install tesseract-devel leptonica-devel<\/code><\/p>\n<p>To install <i>Tesseract OCR<\/i> on Fedora, do:<\/p>\n<p><code>sudo yum install tesseract-devel leptonica-devel<\/code><\/p>\n<p>To install <i>Tesseract OCR<\/i> on OSX, do:<\/p>\n<p><code>brew install tesseract<\/code><\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Lets_OCR\"><\/span>Let\u2019s OCR!<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We will use a simple image which contains the following text:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8152\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/ad3a617fb4f6ad49abe6be789f949c22\/p\/uploads\/2020\/11\/bf910db7.png\" alt=\"Sample image ready for OCR via Tesseract\" width=\"539\" height=\"76\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>To convert this image, all you have to do is open your Terminal prompt, change directory (using the <code>cd your_directory_with_images<\/code> command) to the directory which contains your images (for example, if you have made a directory images in your home directory (<code>~\/images<\/code>) you can simply use <code>cd ~\/images<\/code>), and OCR the files:<\/p>\n<pre>tesseract -l eng input_for_ocr.png output_from_ocr&#13;\ncat output_from_ocr.txt &#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8153\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/564e1ec0c1e828bf2b0949eb220dd6a4\/p\/uploads\/2020\/11\/b64e51a5.png\" alt=\"Using Tesseract OCR via the Linux command line\" width=\"506\" height=\"173\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Very simple and straightforward. And as we can see, the output is perfect.<\/p>\n<p>We specify the English language by using the <code>-l eng<\/code> option. You can check the tesseract manual (<code>man tesseract<\/code>) for any other available language codes.<\/p>\n<p>We also specified the input image (<i>input_for_ocr.png<\/i>) as well as the output file <code>output_from_ocr<\/code> without any file extension, which will use the default plain text <code>.txt<\/code> format.<\/p>\n<p>We can also change the output format to PDF by using a slightly longer command which simply specifies the output format at the end:<\/p>\n<pre>tesseract -l eng input_for_ocr.png output_from_ocr pdf&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8154\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/79a10173b9c8fb8fe2d83e2394d7bde3\/p\/uploads\/2020\/11\/0bcec6f4.png\" alt=\"Tesseract PDF output format\" width=\"547\" height=\"100\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>By adding the <code>pdf<\/code> suffix, the output format used was PDF. When we open the PDF file (<i>output_from_ocr.pdf<\/i>), we can see that the text can be selected and copied\/pasted as was done with the word <i>Readers!<\/i> here:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8155\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/a6c3a07b4e524b5649ba71038cb56cc2\/p\/uploads\/2020\/11\/997ff80f.png\" alt=\"PDF file generated with Tesseract contains text based data\" width=\"1163\" height=\"371\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>In other words, the PDF file contains text based and selectable data, not graphical (and therefore unselectable) information. Great!<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"What_if_I_Want_to_OCR_a_PDF_file\"><\/span>What if I Want to OCR a PDF file?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Sometimes you may receive a PDF file which \u2013 though the PDF format supports actual text inside pages \u2013 contains only images with text. This can be frustrating as copy and paste will not be available. You can OCR these pages also, with a small workaround.<\/p>\n<p>You will first want to convert your PDF file to images \u2013 one image per page \u2013 and then OCR the individual pages into text. A little more work, but still a great time saver over re-typing text manually.<\/p>\n<p>For simple steps to convert a PDF file to images, or even to <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">script<\/a> and automate the conversion of multiple PDF files, you can read our article Convert PDF to Images From the Linux Command Line!<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Wrapping_Up\"><\/span>Wrapping Up<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In this article, we explored Tesseract, the top quality free command-line OCR engine for Linux. We saw how we could easily convert images to text using a simple command.<\/p>\n<p>We also looked at converting images to text-based PDF files, and referred an article where you can find information on how to pre-convert image-based PDF files to images so they can subsequently be converted to text using the OCR method shown here.<\/p>\n<p><strong>Enjoy!<\/strong>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/8151\/how-to-convert-images-to-text-on-the-linux-command-line-with-ocr\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How To Convert Images To Text On The Linux Command Line With OCR \u2013 CloudSavvy IT&#8221; Shutterstock\/ Dominik Bruhn Top quality Optimal Character Recognition (OCR) software may have been expensive in the past, but now it is available, free of charge, directly from your Linux Terminal command line! This article will help you get setup&#8230;<\/p>\n","protected":false},"author":1,"featured_media":119238,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2020\/11\/d70c487e.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-119237","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/119237","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=119237"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/119237\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/119238"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=119237"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=119237"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=119237"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}