{"id":289395,"date":"2021-07-01T20:29:11","date_gmt":"2021-07-01T17:29:11","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/new-data-science-platform-speeds-up-python-queries\/"},"modified":"2021-07-01T20:29:11","modified_gmt":"2021-07-01T17:29:11","slug":"new-data-science-platform-speeds-up-python-queries","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/new-data-science-platform-speeds-up-python-queries\/","title":{"rendered":"#New data science platform speeds up Python queries"},"content":{"rendered":"<p>&#8220;<strong>#New data <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">science<\/a> platform speeds up Python queries<\/strong>&#8221;<\/p>\n<div>\n<div class=\"article-gallery lightGallery\">\n<div data-thumb=\"https:\/\/scx1.b-cdn.net\/csz\/news\/tmb\/2020\/coding.jpg\" data-src=\"https:\/\/scx2.b-cdn.net\/gfx\/news\/hires\/2020\/coding.jpg\" data-sub-html=\"Credit: CC0 Public Domain\">\n<figure class=\"article-img\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/scx1.b-cdn.net\/csz\/news\/800a\/2020\/coding.jpg\" alt=\"coding\" title=\"Credit: CC0 Public Domain\" width=\"800\" height=\"529\"\/><figcaption class=\"text-darken text-low-up text-truncate-js text-truncate mt-3\">\n                Credit: CC0 Public Domain<br \/>\n            <\/figcaption><\/figure>\n<\/div>\n<\/div>\n<p>Researchers from Brown University and MIT have developed a new data science framework that allows users to process data with the programming language Python\u2014without paying the &#8216;performance tax&#8217; normally associated with a user-friendly language.<\/p>\n<p>                                                                                The new framework, called Tuplex, is able to process data queries written in Python up to 90 times faster than industry-standard data systems like Apache Spark or Dask. The research team unveiled the system in research presented at SIGMOD 2021, a premier data processing conference, and have made the software freely available to all.<\/p>\n<p>&#8220;Python is the primary programming language used by people doing data science,&#8221; said Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers of Tuplex. &#8220;That makes a lot of sense. Python is widely taught in universities, and it&#8217;s an easy language to get started with. But when it comes to data science, there&#8217;s a huge performance tax associated with Python because platforms can&#8217;t process Python efficiently on the back end.&#8221;<\/p>\n<p>Platforms like Spark perform data analytics by distributing tasks across multiple processor cores or machines in a data center. That parallel processing allows users to deal with giant data sets that would choke a single computer to death. Users interact with these platforms by inputting their own queries, which contain custom logic written as &#8220;user-defined functions&#8221; or UDFs. UDFs specify custom logic, like extracting the number of bedrooms from the text of a real estate listing for a query that searches all of the real estate listings in the U.S. and selects all the ones with three bedrooms.<\/p>\n<p>Because of its simplicity, Python is the language of choice for creating UDFs in the data science community. In fact, the Tuplex team cites a recent poll showing that 66% of data platform users utilize Python as their primary language. The problem is that analytics platforms have trouble dealing with those bits of Python code efficiently.<\/p>\n<figure class=\"mb-4\" itemscope=\"\" itemtype=\"http:\/\/schema.org\/VideoObject\"><meta itemprop=\"name\" content=\"New data science platform speeds up Python queries\"\/><meta itemprop=\"url\" content=\"https:\/\/www.youtube.com\/watch\/?v=Hz4v89THlJY\"\/><meta itemprop=\"description\" content=\"New data science platform speeds up Python queries\"\/><meta itemprop=\"uploadDate\" content=\"2021-07-01T13:02:26-04:00\"\/><meta itemprop=\"embedUrl\" content=\"https:\/\/www.youtube.com\/embed\/Hz4v89THlJY\"\/><meta itemprop=\"thumbnailUrl\" content=\"https:\/\/img.youtube.com\/vi\/Hz4v89THlJY\/maxresdefault.jpg\"\/><br \/>\n             <iframe loading=\"lazy\" title=\"259 Tuplex: Data Science in Python at Native Code Speed\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/Hz4v89THlJY?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<\/figure>\n<p>Data platforms are written in high-level computer languages that are compiled before running. Compilers are programs that take computer language and turn it into machine code\u2014sets of instructions that a computer processor can quickly execute. Python, however, is not compiled beforehand. Instead, computers interpret Python code line by line while the program runs, which can mean far slower performance.<br \/>\n                                            <!-- Google middle Adsense block --><\/p>\n<p>&#8220;These frameworks have to break out of their efficient execution of compiled code and jump into a Python interpreter to execute Python UDFs,&#8221; Schwarzkopf said. &#8220;That process can be a factor of 100 less efficient than executing compiled code.&#8221;<\/p>\n<p>If Python code could be compiled, it would speed things up greatly. But researchers have tried for years to develop a <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>-purpose Python compiler, Schwarzkopf says, with little success. So instead of trying to make a general Python compiler, the researchers designed Tuplex to compile a highly specialized program for the specific query and common-case input data. Uncommon input data, which account for only a small percentage of instances, are separated out and referred to an interpreter.<\/p>\n<p>&#8220;We refer to this process as dual-case processing, as it splits that data into two cases,&#8221; said Leonhard Spiegelberg, co-author of the research describing Tuplex. &#8220;This allows us to simplify the compilation problem as we only need to care about a single set of data types and common-case assumptions. This way, you get the best of two worlds: high productivity and fast execution speed.&#8221;<\/p>\n<p>And the runtime benefit can be substantial.<\/p>\n<p>&#8220;We show in our research that a wait time of 10 minutes for an output can be reduced to a second,&#8221; Schwarzkopf said. &#8220;So it really is a substantial improvement in performance.&#8221;<\/p>\n<p>In addition to speeding things up, Tuplex also has an innovative way of dealing with anomalous data, the researchers say. Large datasets are often messy, full of corrupted records or data fields that don&#8217;t follow convention. In real estate data, for example, the number of bedrooms could either be a numeral or a spelled-out number. Inconsistencies like that can be enough to crash some data platforms. But Tuplex extracts those anomalies and sets them aside to avoid a crash. Once the program has run, the user then has the option of repairing those anomalies.<\/p>\n<p>&#8220;We think this could have a major productivity impact for data scientists,&#8221; Schwarzkopf said. &#8220;To not have to run out to get a cup of coffee while waiting for an output, and to not have a program run for an hour only to crash before it&#8217;s done would be a really big deal.&#8221;\n                                                                                                                        <\/p>\n<hr\/>\n<div class=\"article-main__explore my-4 d-print-none\">\n<p>                                            <a rel=\"nofollow noopener\" target=\"_blank\" class=\"text-medium text-info mt-2 d-inline-block\" href=\"https:\/\/phys.org\/news\/2018-08-ai-code-collaborative-scientific-discovery.html\">AI for code encourages collaborative, open scientific discovery<\/a>\n                                        <\/div>\n<hr class=\"mb-4\"\/>\n<div class=\"article-main__more p-4\">\n                                                                                                <strong>More information:<\/strong><br \/>\n                                                Paper: <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/cs.brown.edu\/people\/malte\/pub\/papers\/2021-sigmod-tuplex.pdf\">cs.brown.edu\/people\/malte\/pub\/ \u2026 21-sigmod-tuplex.pdf<\/a><br \/>\nSoftware: <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/tuplex.cs.brown.edu\/\">tuplex.cs.brown.edu\/<\/a><\/p>\n<\/div>\n<div class=\"d-inline-block text-medium my-4\">\n                                                Provided by<br \/>\n                                                                                                    Brown University<br \/>\n                                                                                                        <a rel=\"nofollow noopener\" target=\"_blank\" class=\"icon_open\" href=\"http:\/\/www.brown.edu\/\"><br \/>\n                                                        <svg><use href=\"https:\/\/techx.b-cdn.net\/tmpl\/v2\/img\/svg\/sprite.svg#icon_open\" x=\"0\" y=\"0\"\/><\/svg><\/a><\/p><\/div>\n<p>                                        <!-- print only --><\/p>\n<div class=\"d-none d-print-block\">\n<p>\n                                                 <strong>Citation<\/strong>:<br \/>\n                                                 New data science platform speeds up Python queries (2021, July  1)<br \/>\n                                                 retrieved  2 July 2021<br \/>\n                                                 from https:\/\/techxplore.com\/<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">news<\/a>\/2021-07-science-platform-python-queries.html<\/p>\n<p>                                            This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no<br \/>\n                                            part may be reproduced without the written permission. The content is provided for information purposes only.<\/p><\/div>\n<\/p><\/div>\n<p><script id=\"facebook-jssdk\" async=\"\" src=\"https:\/\/connect.facebook.net\/en_US\/sdk.js\"><\/script><\/p>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong>\n<\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more Like this articles, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/science\/\" target=\"_blank\" rel=\"noopener\">Science category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/techxplore.com\/news\/2021-07-science-platform-python-queries.html\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#New data science platform speeds up Python queries&#8221; Credit: CC0 Public Domain Researchers from Brown University and MIT have developed a new data science framework that allows users to process data with the programming language Python\u2014without paying the &#8216;performance tax&#8217; normally associated with a user-friendly language. The new framework, called Tuplex, is able to process&#8230;<\/p>\n","protected":false},"author":1,"featured_media":289396,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/scx2.b-cdn.net\/gfx\/news\/hires\/2020\/coding.jpg","fifu_image_alt":"","footnotes":""},"categories":[16],"tags":[],"class_list":["post-289395","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sciencee"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/289395","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=289395"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/289395\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/289396"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=289395"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=289395"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=289395"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}