{"id":125475,"date":"2020-12-03T16:00:57","date_gmt":"2020-12-03T13:00:57","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/parsing-html-in-bash-cloudsavvy-it\/"},"modified":"2020-12-03T16:00:57","modified_gmt":"2020-12-03T13:00:57","slug":"parsing-html-in-bash-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/parsing-html-in-bash-cloudsavvy-it\/","title":{"rendered":"#Parsing HTML in Bash \u2013 CloudSavvy IT"},"content":{"rendered":"<p><strong>&#8220;#Parsing HTML in Bash \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4038\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/f1fee0a0a83b16d260ba2e862cb46eec\/p\/uploads\/2017\/07\/add8ac45.png\" alt=\"Bash Shell\" width=\"1400\" height=\"600\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>I have a process where I need to copy all the images from a web page. I used to run this process with <code>xmllint<\/code>, which will process an XML or HTML file and print out the entries you specify. But when my server host provider upgraded their systems, they didn\u2019t include <code>xmllint<\/code>. So I had to find another way to extract a list of images from an HTML page. It turns out you can do this in Bash.<\/p>\n<p>You may not think Bash can parse data files, but it can with some clever thinking. Bash, like other UNIX shells before it, can parse lines one at a time from a file via the built-in <code>read<\/code> statement.<\/p>\n<p>By default, the <code>read<\/code> statement scans a line of data and splits it into fields. Usually, <code>read<\/code> splits fields using spaces and tabs, with newlines ending each line, but you can change this behavior by setting the Internal Field Separator (<code>IFS<\/code>) value and the end-of-line delimiter (<code>-d<\/code>).<\/p>\n<p>To parse an HTML file using <code>read<\/code> , set the <code>IFS<\/code> to a greater-than symbol (<code>&gt;<\/code>) and the delimiter to a less-than symbol (<code>&lt;<\/code>). Each time Bash scans a line, it parses up to the next <code>&lt;<\/code> (the start of an HTML tag) then splits that data at each <code>&gt;<\/code> (the end of an HTML tag). This sample code takes a line of input and splits the data into the <code>TAG<\/code> and <code>VALUE<\/code> variables:<\/p>\n<pre>local IFS='&gt;'&#13;\nread -d '&lt;' TAG VALUE<\/pre>\n<p>Let\u2019s explore how this works. Consider this simple HTML file:<\/p>\n<pre>&lt;img src=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/logo.png\"&#13;\nalt=\"My logo\" \/&gt;&#13;\n&lt;p&gt;some text&lt;\/p&gt;<\/pre>\n<p>The first time <code>read<\/code> parses this file, it stops at the first <code>&lt;<\/code> symbol. Since <code>&lt;<\/code> is the first character of this sample input, that means Bash finds an empty string. The resulting <code>TAG<\/code> and <code>VALUE<\/code> strings are also empty. But that\u2019s fine for my use case.<\/p>\n<p>The next time Bash reads the input, it gets <code>img src=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/logo.png\"\u21b2alt=\"My logo\" \/&gt;\u21b2<\/code> with a newline right before the alt, and stops before the\u00a0<code>&lt;<\/code> symbol on the next line. Then <code>read<\/code> splits the line at the <code>&gt;<\/code> symbol, which leaves <code>TAG<\/code> with <code>img src=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/logo.png\"\u21b2alt=\"My logo\" \/<\/code> and <code>VALUE<\/code> with an empty newline.<\/p>\n<p>The third time <code>read<\/code> parses the HTML file, it gets\u00a0<code>p&gt;some text<\/code>. Bash splits the string at the <code>&gt;<\/code> resulting in <code>TAG<\/code> containing <code>p<\/code> and <code>VALUE<\/code> with <code>some text<\/code> .<\/p>\n<p>Now that you understand how to use <code>read<\/code>, it\u2019s easy to parse a longer HTML file with Bash. Start with a Bash function called <code>xmlgetnext<\/code> to parse the data using <code>read<\/code> , since you\u2019ll be doing this again and again in the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">script<\/a>. I named my function <code>xmlgetnext<\/code> to remind me this is a replacement for the Linux <code>xmllint<\/code> program, but I could have just as easily named it <code>htmlgetnext<\/code> .<\/p>\n<pre>xmlgetnext () {&#13;\nlocal IFS='&gt;'&#13;\nread -d '&lt;' TAG VALUE&#13;\n}<\/pre>\n<p>Now call that <code>xmlgetnext<\/code> function to parse the HTML file. This is my complete <code>htmltags<\/code> script:<\/p>\n<pre>#!\/bin\/sh&#13;\n# print a list of all html tags&#13;\n&#13;\nxmlgetnext () {&#13;\nlocal IFS='&gt;'&#13;\nread -d '&lt;' TAG VALUE&#13;\n}&#13;\n&#13;\ncat $1 | while xmlgetnext ; do echo $TAG ; done<\/pre>\n<p>The last line is the key. It loops through the file using <code>xmlgetnext<\/code> to parse the HTML, and prints out only the <code>TAG<\/code> entries. And because of how <code>echo<\/code> operates with the standard field separators, any lines like <code>img src=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/logo.png\"\u21b2alt=\"My logo\" \/<\/code> that contain a newline get printed on a single line, as <code>img src=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/logo.png\" alt=\"My logo\" \/<\/code>.<\/p>\n<figure id=\"attachment_8316\" style=\"width: 948px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-8316\" src=\"https:\/\/www.cloudsavvyit.com\/thumbcache\/0\/0\/a4a6e6c0796110b43dde1d0b57592651\/p\/uploads\/2020\/12\/c5811d09.png\" alt=\"Screenshot showing parsing HTML in Bash\" width=\"948\" height=\"198\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><figcaption class=\"wp-caption-text\">Parsing HTML in Bash<\/figcaption><\/figure>\n<p>To fetch just the list of images, I run the output of this script through <code>grep<\/code>\u00a0to only print the lines that have an <code>img<\/code> tag at the start of the line.\n<\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/8315\/parsing-html-in-bash\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#Parsing HTML in Bash \u2013 CloudSavvy IT&#8221; I have a process where I need to copy all the images from a web page. I used to run this process with xmllint, which will process an XML or HTML file and print out the entries you specify. But when my server host provider upgraded their systems,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":125476,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2017\/07\/add8ac45.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-125475","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/125475","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=125475"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/125475\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/125476"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=125475"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=125475"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=125475"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}