{"id":128140,"date":"2020-12-07T17:00:29","date_gmt":"2020-12-07T14:00:29","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-to-correctly-parse-file-names-in-bash-cloudsavvy-it\/"},"modified":"2020-12-07T17:00:29","modified_gmt":"2020-12-07T14:00:29","slug":"how-to-correctly-parse-file-names-in-bash-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-to-correctly-parse-file-names-in-bash-cloudsavvy-it\/","title":{"rendered":"#How to Correctly Parse File Names in Bash \u2013 CloudSavvy IT"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a4200a5042d4\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a4200a5042d4\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-correctly-parse-file-names-in-bash-cloudsavvy-it\/#The_Problem_With_Correctly_Parsing_File_Names_in_Bash\" >The Problem With Correctly Parsing File Names in Bash<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-correctly-parse-file-names-in-bash-cloudsavvy-it\/#The_Secret_Recipe_NULL_Termination\" >The Secret Recipe: NULL Termination<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-correctly-parse-file-names-in-bash-cloudsavvy-it\/#Wrapping_up\" >Wrapping up<\/a><\/li><\/ul><\/nav><\/div>\n<p><strong>&#8220;#How to Correctly Parse File Names in Bash \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4038\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/f1fee0a0a83b16d260ba2e862cb46eec\/p\/uploads\/2017\/07\/add8ac45.png\" alt=\"Bash Shell\" width=\"1400\" height=\"600\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Bash file naming conventions are very rich, and it is easy to create a <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">script<\/a> or one-liner which incorrectly parses file names. Learn to parse file names correctly, and thereby ensure your scripts work as intended!<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"The_Problem_With_Correctly_Parsing_File_Names_in_Bash\"><\/span>The Problem With Correctly Parsing File Names in Bash<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>If you have been using Bash for a while, and have been scripting in it\u2019s rich Bash language, you will likely have run into some file name parsing issues. Let\u2019s take a look at simple example of what can go wrong:<\/p>\n<pre>touch 'a&#13;\n> b'&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8368\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/e8132f8a57d741b2282f553ac2e2971a\/p\/uploads\/2020\/12\/d90a6452.png\" alt=\"Setting up a file with a CR character in the filename\" width=\"301\" height=\"85\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Here we created a file which has an actual <code>CR<\/code> (carriage return) introduced into it by pressing enter after the <code>a<\/code>. Bash file naming conventions are very rich, and whilst it is in some ways cool we can use special characters like these in a filename, let\u2019s see how this file fares when we try to take some actions on it:<\/p>\n<pre>ls | xargs rm&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8369\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/8e0a6c99b6387d744322b64c10dd87ae\/p\/uploads\/2020\/12\/8bc8a069.png\" alt=\"The problem trying to handle a filename which includes CR\" width=\"402\" height=\"102\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>That did not work. <code>xargs<\/code> will take the input from <code>ls<\/code> (via the <code>|<\/code> pipe), and pass it to <code>rm<\/code>, but something went amiss in the process!<\/p>\n<p>What went amiss is that the output from <code>ls<\/code> is taken literally by <code>xargs<\/code>, and the \u2018enter\u2019 (<code>CR<\/code> \u2013 Carriage Return) within the filename is seen by <code>xargs<\/code> as an actual termination character, not a <code>CR<\/code> to be passed onto <code>rm<\/code> as it should be.<\/p>\n<p>Let\u2019s exemplify this in another way:<\/p>\n<pre>ls | xargs -I{} echo '{}|'&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8370\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/6e9c054458dd6c4dd1268c0ff64119e9\/p\/uploads\/2020\/12\/ac00a86f.png\" alt=\"Showing how xargs will see the CR character as a newline and split data upon it\" width=\"432\" height=\"98\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>It is clear: <code>xargs<\/code> is processing the input as two individual lines, splitting the original filename in two! Even if we were to fix the fix the space issues by some fancy parsing using sed, we would soon run into other issues when we start using other special characters like spaces, backslashes, quotes and more!<\/p>\n<pre>touch 'a&#13;\nb'&#13;\ntouch 'a b'&#13;\ntouch 'ab'&#13;\ntouch 'a\"b'&#13;\ntouch \"a'b\"&#13;\nls&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8371\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/2c434ed9443776c3af6ed959626051af\/p\/uploads\/2020\/12\/64267db5.png\" alt=\"All sorts of special characters in filenames\" width=\"410\" height=\"119\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Even if you are a seasoned Bash developer, you may shiver at seeing filenames like this, as it would be very complex, for most common Bash tools, to parse these files correctly. You would have to do all sorts of string modifications to make this work. That is, unless you have the secret recipe.<\/p>\n<p>Before we dive into that, there is one more thing \u2013 a must-know \u2013 which you can run into when parsing <code>ls<\/code> output. If you use color coding for directory listings, which is enabled by default on Ubuntu, it is easy to run into another set of <code>ls<\/code> parsing issues.<\/p>\n<p>These are not really related to how files are named, but rather to how the files are presented as output of <code>ls<\/code>. The <code>ls<\/code> output will contain hex codes which represent the color to use to your terminal.<\/p>\n<p>To avoid running into these, simply use <code>--color=never<\/code> as an option to <code>ls<\/code>:<br \/><code>ls --color=never<\/code>.<\/p>\n<p>In Mint 20 (a great Ubuntu derivative operating system) this issue seems fixed, though the issue may still be present in many other or older versions of Ubuntu etc. I have seen this issue as recent as mid August 2020 on Ubuntu.<\/p>\n<p>Even if you do not use color coding for your directory listings, it is possible that your script will run on other systems not owned or managed by you. In such a case, you will want to also use this option to prevent users of such machine from running in the issue described.<\/p>\n<p>Returning to our secret recipe, let\u2019s look at how we can make sure that we won\u2019t have any issues with special characters in Bash filenames. The solution provided avoids all use of <code>ls<\/code>, which one would do well to avoid in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>, so the color coding issues are not applicable either.<\/p>\n<p>There are still times where <code>ls<\/code> parsing is quick and handy, but it will always be tricky and likely \u2018dirty\u2019 as soon as special characters are introduced \u2013 not to mention insecure (special characters can be used to introduce all sorts of issues).<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"The_Secret_Recipe_NULL_Termination\"><\/span>The Secret Recipe: NULL Termination<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Bash tool developers have realized this same problem many years earlier, and have provided us with: <code>NULL<\/code> termination!<\/p>\n<p>What is <code>NULL<\/code> termination you ask? Consider how in the examples above, <code>CR<\/code> (or literally <i>enter<\/i>) was the main termination character.<\/p>\n<p>We also saw how special characters like quotes, white spaces and back slashes can be used in filenames, even though they have special functions when it comes to other Bash text parsing and modification tools like sed. Now compare this with the <code>-0<\/code> option to <i>xargs<\/i>, from <code>man xargs<\/code>:<\/p>\n<p><strong>-0, \u2013null<\/strong> <em>Input items are terminated by a null character instead of by white space, and the quotes and backslash are not special (every character is taken literally). Disables the end of file string, which is treated like any other argument. Useful when input items might contain white space, quote marks, or backslashes. The GNU find -print0 option produces input suitable for this mode.<\/em><\/p>\n<p>And the <code>-print0<\/code> option to <code>find<\/code>, from <code>man find<\/code>:<\/p>\n<p><strong>-fprint0 file<\/strong> <em>True; print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output. This option corresponds to the -0 option of xargs.<\/em><\/p>\n<p>The <i>True;<\/i> here means <i>If the option is specified, the following is true;<\/i>. Also interesting is the two clear warnings given elsewhere in the same manual page:<\/p>\n<ul>\n<li>If you are piping the output of find into another program and there is the faintest possibility that the files which you are searching for might contain a newline, then you should seriously consider using the -print0 option instead of -print. See the UNUSUAL FILENAMES section for information about how unusual characters in filenames are handled.<\/li>\n<li>If you are using find in a script or in a situation where the matched files might have arbitrary names, you should consider using -print0 instead of -print.<\/li>\n<\/ul>\n<p>These clear warnings remind us that parsing filenames in bash can be, and is, tricky business. However, with the right options to <code>find<\/code>, namely <code>-print0<\/code>, and <code>xargs<\/code>, namely <code>-0<\/code>, all our special character containing filenames can be parsed correctly:<\/p>\n<pre>ls&#13;\nfind . -name 'a*' -print0 &#13;\nfind . -name 'a*' -print0 | xargs -0 ls&#13;\nfind . -name 'a*' -print0 | xargs -0 rm&#13;\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-8372\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/9ad4a73c57956f0a508aba7c8c093126\/p\/uploads\/2020\/12\/a0e78a6b.png\" alt=\"The solution: find -print0 and xargs -0\" width=\"450\" height=\"172\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>First we check our directory listing. All our filenames containing special characters are there. We next do a simple <code>find ... -print0<\/code> to see the output. We note that the strings are <code>NULL<\/code> terminated (with the <code>NULL<\/code> or <code>\u0000<\/code> \u2013 the same character \u2013 not visible).<\/p>\n<p>We also note that there is a single <code>CR<\/code> in the output, which matches with the single <code>CR<\/code> we had introduced into the first filename, comprised of <i>a<\/i> followed by <i>enter<\/i> followed by <i>b<\/i>.<\/p>\n<p>Finally, the output does not introduce a newline (also containing <code>CR<\/code>) before returning the <code>$<\/code> terminal prompt, as the strings were <code>NULL<\/code> and not <code>CR<\/code> terminated. We press enter at the <code>$<\/code> terminal prompt to make things a bit clearer.<\/p>\n<p>Next we add <code>xargs<\/code> with the <code>-0<\/code> options, which enables <code>xargs<\/code> to handle the <code>NULL<\/code> terminated input correctly. We see that the input passed to and received from <code>ls<\/code> looks clear and there is no mangling of transformation of text happening.<\/p>\n<p>Finally we re-attempt our <code>rm<\/code> command, and this time for all the files including the original one containing the <code>CR<\/code> which we had issues with. The <code>rm<\/code> works perfectly, and no errors or parsing issues are observed. Great!<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Wrapping_up\"><\/span>Wrapping up<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We have seen how it is important, in many instances, to correctly parse and handle file names in Bash. Whereas learning how to use <code>find<\/code> correctly is a bit more challenging then simply using <code>ls<\/code>, the benefits it provides may pay off in the end. Increased security, and no issues with special characters.<\/p>\n<p>If you enjoyed this article, you may also want to read How to Bulk Rename Files to Numeric File Names in Linux which shows an interesting and somewhat complex <code>find -print0 | xargs -0<\/code> statement. <strong>Enjoy!<\/strong>\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/8367\/how-to-correctly-parse-file-names-in-bash\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How to Correctly Parse File Names in Bash \u2013 CloudSavvy IT&#8221; Bash file naming conventions are very rich, and it is easy to create a script or one-liner which incorrectly parses file names. Learn to parse file names correctly, and thereby ensure your scripts work as intended! The Problem With Correctly Parsing File Names in&#8230;<\/p>\n","protected":false},"author":1,"featured_media":128141,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2017\/07\/add8ac45.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-128140","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/128140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=128140"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/128140\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/128141"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=128140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=128140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=128140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}