{"id":321233,"date":"2021-08-10T20:00:00","date_gmt":"2021-08-10T17:00:00","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/how-to-convert-csv-and-log-files-to-a-columnar-format-cloudsavvy-it\/"},"modified":"2021-08-10T20:00:00","modified_gmt":"2021-08-10T17:00:00","slug":"how-to-convert-csv-and-log-files-to-a-columnar-format-cloudsavvy-it","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/how-to-convert-csv-and-log-files-to-a-columnar-format-cloudsavvy-it\/","title":{"rendered":"#How to Convert CSV and Log Files to a Columnar Format \u2013 CloudSavvy IT"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_84 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a2719f2a1c60\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a2719f2a1c60\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-csv-and-log-files-to-a-columnar-format-cloudsavvy-it\/#What_Is_A_Columnar_Format\" >What Is A Columnar Format?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/how-to-convert-csv-and-log-files-to-a-columnar-format-cloudsavvy-it\/#Convert_Automatically_Using_AWS_Glue\" >Convert Automatically Using AWS Glue<\/a><\/li><\/ul><\/nav><\/div>\n<p><strong>&#8220;#How to Convert CSV and Log Files to a Columnar Format \u2013 CloudSavvy IT&#8221;<\/strong><\/p>\n<div id=\"article-content-area\">\n<img loading=\"lazy\" decoding=\"async\" class=\"type:primaryImage alignnone wp-image-2282 size-full\" srcset=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/10c02a35.png?width=398&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1 400w, https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/10c02a35.png?width=1198&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1 1200w\" sizes=\"auto, 400w, 1200w\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/10c02a35.png?width=1198&amp;trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"AWS Glue Hero Image\" width=\"700\" height=\"300\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Columnar formats, such as Apache Parquet, offer great compression savings and are much easier to scan, process, and analyze than other formats such as CSV. In this article, we show you how to convert your CSV data to Parquet using AWS Glue.<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"What_Is_A_Columnar_Format\"><\/span>What Is A Columnar Format?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>CSV files, log files, and any other character-delimited file all effectively store data in columns. Each row of data has a certain number of columns all separated by the delimiter, such as commas or spaces. But under the hood, these formats are still just lines of strings. There\u2019s no easy way to scan just a single column of a CSV file.<\/p>\n<p>This can be a problem with services like AWS Athena, which are able to run SQL queries on data stored in CSV and other delimited files. Even if you\u2019re only querying a single column, Athena has to scan the\u00a0<em>entire<\/em>\u00a0file\u2019s contents. Athena\u2019s only charge is the GB of the data processed, so running up the bill by processing unnecessary data isn\u2019t the best idea.<\/p>\n<p>The solution is a true columnar format. Columnar formats store data in columns, much like a traditional relational database. The columns are stored together, and the data is much more homogenous, which makes them easier to compress. They\u2019re <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/parquet.apache.org\/documentation\/latest\/\">not exactly human readable<\/a>, but they\u2019re understood by the <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">app<\/a>lication processing them just fine. In fact, because there\u2019s less data to scan, they\u2019re much easier to process.<\/p>\n<p>Because Athena only has to scan one column to do a selection by column, it drastically cuts down on costs, especially for larger datasets. If you have 10 columns in each file and only scan one, that\u2019s a 90% cost savings just from switching to Parquet.<\/p>\n<h2 role=\"heading\" aria-level=\"2\"><span class=\"ez-toc-section\" id=\"Convert_Automatically_Using_AWS_Glue\"><\/span>Convert Automatically Using AWS Glue<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>AWS Glue is a tool from Amazon that converts datasets between formats. It\u2019s primarily used as part of a pipeline to process data stored in delimited and other formats, and injects them into databases for use in Athena. While it can be set up to be automatic, you can also run it manually as well, and with a bit of tweaking, it can be used to convert CSV files to the Parquet format.<\/p>\n<p>Head over to the AWS Glue Console and select \u201cGet Started\u201d. From the sidebar, click on \u201cAdd Crawler\u201d and create a new crawler. The crawler is configured to scan for data from<a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/s3.console.aws.amazon.com\/s3\/?tag=reviewgeek-20\"> S3 Buckets<\/a>, and import the data into a database for use in the conversion.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2276 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/9349343f.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Creating a crawler.\" width=\"700\" height=\"300\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Give your crawler a name, and choose to import data from a data store. Select S3 (though DynamoDB is another option), and enter the path to a folder containing your files. If you just have one file you want to convert, put it in its own folder.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2273 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/32414e17.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Choosing the data store to import data from into your crawler.\" width=\"700\" height=\"308\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Next, you\u2019re asked to create an IAM role for your crawler to operate as. Create the role, then choose it from the list. You may have to hit the refresh button next to it for it to appear.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2274 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/08143fe5.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Choosing and IAM role for your crawler.\" width=\"700\" height=\"292\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Choose a database for the crawler to output to; if you\u2019ve used Athena before, you can use your custom database, but if not the default one should work fine.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2275 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/fe911f17.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Configuring your crawler's output database.\" width=\"700\" height=\"302\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>If you wanted to automate the process, you can give your crawler a schedule so that it runs on a regular basis. If not, choose manual mode and execute it yourself from the console.<\/p>\n<p>Once it\u2019s created, go ahead and run the crawler to import the data into the database you chose. If everything worked, you should see your file imported with the proper schema. The data types for each column are assigned automatically based on the source input.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2277 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/79f5da1b.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Files imported with the proper schema.\" width=\"700\" height=\"265\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Once your data is in the AWS system, you can convert it. From the Glue Console, switch over to the \u201cJobs\u201d tab, and create a new job. Give it a name, add your IAM role, and select \u201cA Proposed Script Generated By AWS Glue\u201d as what the job runs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2278 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/9349343f-1.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Name your new job, add the IAM role, and select &quot;A Proposed Script Generated By AWS Glue&quot;.\" width=\"700\" height=\"270\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Select your table on the next screen, then choose \u201cChange Schema\u201d to specify that this job runs a conversion.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2279 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/32414e17-1.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Choose &quot;Change Schema&quot; to specify that your job runs a conversion.\" width=\"700\" height=\"185\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Next, you have to select \u201cCreate Tables In Your Data Target\u201d, specify Parquet as the format, and enter a new target path. Make sure this is an empty location without any other files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2280 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/fe911f17-1.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Choose a data target by selecting &quot;Create Tables In Your Data Target&quot;, specifying Parquet as the format, and entering a new target path.\" width=\"700\" height=\"268\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Next, you can edit the schema of your file. This defaults to a one-to-one mapping of CSV columns to Parquet columns, which is likely what you want, but you can modify it if you need to.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"imgchk9 alignnone wp-image-2281 size-full\" src=\"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/08143fe5-1.png?trim=1,1&amp;bg-color=000&amp;pad=1,1\" alt=\"Editing the schema of your file.\" width=\"700\" height=\"305\" onload=\"pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\" onerror=\"this.onerror=null;pagespeed.lazyLoadImages.loadIfVisibleAndMaybeBeacon(this);\"\/><\/p>\n<p>Create the job, and you\u2019ll be brought to a page that enables you to edit the Python script it runs. The default script should work fine, so hit \u201cSave\u201d and exit back to the jobs tab.<\/p>\n<p>In our testing, the script always failed unless the IAM role was given specific permission to write to the location we specified the output to go to. You may have to manually edit the permissions from the <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/console.aws.amazon.com\/iam\/home?&amp;tag=reviewgeek-20\">IAM Management Console<\/a>\u00a0if you run into the same issue.<\/p>\n<p>Otherwise, click \u201cRun\u201d and your script should start up. It may take a minute or two to process, but you should see the status in the info panel. When it\u2019s done, you\u2019ll see a new file created in S3.<\/p>\n<p>This job can be configured to run off of triggers set by the crawler that imports the data, so the whole process can be automated from start to finish. If you\u2019re importing server logs to S3 this way, this can be an easy method to convert them to a more usable format.\n<\/p><\/div>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/www.cloudsavvyit.com\/2272\/how-to-convert-csv-and-log-files-to-a-columnar-format\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#How to Convert CSV and Log Files to a Columnar Format \u2013 CloudSavvy IT&#8221; Columnar formats, such as Apache Parquet, offer great compression savings and are much easier to scan, process, and analyze than other formats such as CSV. In this article, we show you how to convert your CSV data to Parquet using AWS&#8230;<\/p>\n","protected":false},"author":1,"featured_media":321234,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/www.cloudsavvyit.com\/p\/uploads\/2019\/10\/10c02a35.png","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[],"class_list":["post-321233","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/321233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=321233"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/321233\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/321234"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=321233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=321233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=321233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}