{"id":118167,"date":"2020-11-23T13:27:54","date_gmt":"2020-11-23T10:27:54","guid":{"rendered":"https:\/\/en.buradabiliyorum.com\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/"},"modified":"2020-11-23T13:27:54","modified_gmt":"2020-11-23T10:27:54","slug":"a-beginners-guide-to-web-scraping-with-python-and-scrapy","status":"publish","type":"post","link":"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/","title":{"rendered":"#A beginner\u2019s guide to web scraping with Python and Scrapy"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a42edf13314c\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #dd3333;color:#dd3333\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #dd3333;color:#dd3333\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a42edf13314c\" checked aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Getting_Started\" >Getting Started<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Building_our_first_Spider_with_XPath_queries\" >Building our first Spider with XPath queries<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Another_Spider_with_CSS_query_selectors\" >Another Spider with CSS query selectors<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#The_code\" >The code<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Code_explanation\" >Code explanation:<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#A_more_advanced_use_case\" >A more advanced use case<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Code_explanation-2\" >Code explanation:<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Run_the_example\" >Run the example:<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Results\" >Results:<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/buradabiliyorum.com\/en\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<p>&#8220;<strong>#A beginner\u2019s guide to web scraping with Python and Scrapy<\/strong>&#8221;<\/p>\n<div>\n<p class=\"p1\">Since their inception,\u00a0websites\u00a0are used to share information. Whether it is a Wikipedia article, <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/social-mediaa\/\" data-internallinksmanager029f6b8e52c=\"1\" title=\"Social Media\" target=\"_blank\" rel=\"noopener\">YouTube<\/a> channel, Instagram account, or a Twitter handle. They all are packed with interesting data that is available for everyone with access to the\u00a0internet\u00a0and a\u00a0web browser.<\/p>\n<p class=\"p1\">But, what if we want to get any specific data programmatically?<\/p>\n<p class=\"p1\">There are two ways to do that:<\/p>\n<ol>\n<li class=\"p1\">Using official API<\/li>\n<li class=\"p1\">Web Scraping<\/li>\n<\/ol>\n<p class=\"p1\">The concept of\u00a0API (<a href=\"https:\/\/buradabiliyorum.com\/en\/category\/download-scripts-themes-apps\/\" data-internallinksmanager029f6b8e52c=\"9\" title=\"Download Scripts &amp; Themes &amp; Apps\" target=\"_blank\" rel=\"noopener\">App<\/a>lication Programming Interface)\u00a0was introduced to exchange data between different systems in a standard way. But, most of the time, website owners don\u2019t provide any API. In that case, we are only left with the possibility to extract the data using\u00a0web scraping.<\/p>\n<p class=\"p1\">Basically, every web page is returned from the server in an\u00a0HTML\u00a0format, meaning that our actual data is nicely packed inside HTML elements. It makes the whole process of retrieving specific data very easy and straightforward.<\/p>\n<p class=\"p1\">This tutorial will be an ultimate guide for you to learn\u00a0web scraping using Python programming language. At first, I\u2019ll walk you through some basic examples to make you familiar with web scraping. Later on, we\u2019ll use that knowledge to extract data of football matches from\u00a0Livescore.cz\u00a0.<\/p>\n<p><em>[Read:\u00a0<span dir=\"auto\">Neural\u2019s market outlook for artificial intelligence in 2021 and beyond<\/span>]<\/em><\/p>\n<h2 id=\"getting-started\"><span class=\"ez-toc-section\" id=\"Getting_Started\"><\/span>Getting Started<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p class=\"p1\">To get us started, you will need to start a new Python3 project with and install\u00a0Scrapy\u00a0(a web scraping and web crawling library for Python). I\u2019m using\u00a0pipenv\u00a0for this tutorial, but you can use pip and venv, or conda.<\/p>\n<p class=\"p1\"><em>pipenv install scrapy<\/em><\/p>\n<p class=\"p1\">At this point, you have Scrapy, but you still need to create a new web scraping project, and for that scrapy provides us with a command line that does the work for us.<\/p>\n<p class=\"p1\">Let\u2019s now create a new project named\u00a0web_scraper\u00a0by using the scrapy cli.<\/p>\n<p class=\"p1\">If you are using\u00a0pipenv\u00a0like me, use:<\/p>\n<p class=\"p1\"><em>pipenv run scrapy startproject web_scraper .<\/em><\/p>\n<p class=\"p1\">Otherwise, from your virtual environment, use:<\/p>\n<p class=\"p1\"><em>scrapy startproject web_scraper .<\/em><\/p>\n<p class=\"p1\">This will create a basic project in the current directory with the following structure:<\/p>\n<div class=\"highlight\">\n<pre><figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1329039 lazy\" alt=\"\" width=\"811\" height=\"418\" sizes=\"auto, (max-width: 811px) 100vw, 811px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37.png 1398w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37-280x144.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37-524x270.png 524w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37-262x135.png 262w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.39.37-796x410.png 796w\"\/><\/figure><\/pre>\n<\/div>\n<h2 id=\"building-our-first-spider-with-xpath-queries\"><span class=\"ez-toc-section\" id=\"Building_our_first_Spider_with_XPath_queries\"><\/span>Building our first Spider with XPath queries<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We will start our web scraping tutorial with a very simple example. At first, we\u2019ll locate the logo of the<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/livecodestream.dev\/\">Live Code Stream<\/a><span>\u00a0<\/span>website inside HTML. And as we know, it is just a text and not an image, so we\u2019ll simply extract this text.<\/p>\n<p id=\"the-code\"><strong>The code<\/strong><\/p>\n<p>To get started we need to create a new spider for this project. We can do that by either creating a new file or using the CLI.<\/p>\n<p>Since we know already the code we need we will create a new Python file on this path<span>\u00a0<\/span><strong>\/web_scraper\/spiders\/live_code_stream.py<\/strong><\/p>\n<p>Here are the contents of this file.<\/p>\n<div class=\"highlight\">\n<pre><figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1329040 lazy\" alt=\"\" width=\"825\" height=\"578\" sizes=\"auto, (max-width: 825px) 100vw, 825px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43.png 1384w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43-280x196.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43-385x270.png 385w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43-193x135.png 193w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.40.43-796x558.png 796w\"\/><\/figure><\/pre>\n<\/div>\n<p id=\"code-explanation\"><strong>Code explanation:<\/strong><\/p>\n<ul>\n<li class=\"p1\">First of all, we imported the Scrapy library because we need its functionality to create a Python web spider. This spider will then be used to crawl the specified website and extract useful information from it.<\/li>\n<li class=\"p1\">We created a class and named it\u00a0LiveCodeStreamSpider. Basically, it inherits from\u00a0scrapy.Spider and that\u2019s why we passed it as a parameter.<\/li>\n<li class=\"p1\">Now, an important step is to define a unique name for your spider using a variable called\u00a0name. Remember that you are not allowed to use the name of an existing spider. Similarly, you can not use this name to create new spiders. It must be unique throughout this project.<\/li>\n<li class=\"p1\">After that, we passed the website URL using the\u00a0start_urls\u00a0list.<\/li>\n<li class=\"p1\">Finally, create a method called\u00a0parse()\u00a0that will locate the logo inside HTML code and extract its text. In Scrapy, there are two methods to find HTML elements inside source code. These are mentioned below.<\/li>\n<li class=\"p1\">CSS<\/li>\n<li class=\"p1\">XPath<\/li>\n<\/ul>\n<p class=\"p1\">You can even use some external libraries like\u00a0BeautifulSoup\u00a0and\u00a0lxml\u00a0. But, for this example, we\u2019ve used XPath.<br \/>A quick way to determine the XPath of any HTML element is to open it inside the\u00a0Chrome DevTools. Now, simply right-click on the HTML code of that element, hover the mouse cursor over \u201cCopy\u201d inside the popup menu that just appeared. Finally, click the \u201cCopy XPath\u201d menu item.<\/p>\n<p class=\"p1\">Have a look at the below screenshot to understand it better.<\/p>\n<figure class=\"\" data-src=\"https:\/\/thenextweb.com\/post\/2020-11-18-how-to-turn-the-web-into-data-with-python-and-scrapy\/find-xpath_hub7e3e64a73ee4298452ddd712fc8bae5_469803_700x0_resize_q75_box.jpg\">\n<figure class=\"post-image post-mediaBleed aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy loaded lazy\" alt=\"\" width=\"700\" height=\"356\" src=\"https:\/\/livecodestream.dev\/post\/2020-11-18-how-to-turn-the-web-into-data-with-python-and-scrapy\/find-xpath_hub7e3e64a73ee4298452ddd712fc8bae5_469803_700x0_resize_q75_box.jpg\" data-lazy=\"true\"\/><figcaption><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/thenextweb.com\/syndication\/2020\/11\/23\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/#\" data-url=\"https:\/\/twitter.com\/intent\/tweet?url=https%3A%2F%2Fthenextweb.com%2Fsyndication%2F2020%2F11%2F23%2Fa-beginners-guide-to-web-scraping-with-python-and-scrapy%2F&amp;via=thenextweb&amp;related=thenextweb&amp;text=Check out this picture on: Find XPath using Chrome Dev Tools\" data-title=\"Share Find XPath using Chrome Dev Tools on Twitter\" data-width=\"685\" data-height=\"500\" class=\"post-image-share popitup\" title=\"Share Find XPath using Chrome Dev Tools on Twitter\"><i class=\"icon icon--inline icon--twitter--dark\"\/><\/a>Find XPath using Chrome Dev Tools<\/figcaption><\/figure>\n<\/p>\n<\/figure>\n<p>By the way, I used<span>\u00a0<\/span><code>\/text()<\/code><span>\u00a0<\/span>after the actual XPath of the element to only retrieve the text from that element instead of the full element code.<\/p>\n<p><strong>Note:<\/strong><span>\u00a0<\/span>You\u2019re not allowed to use any other name for the variable, list, or function as mentioned above. These names are pre-defined in Scrapy library. So, you must use them as it is. Otherwise, the program will not work as intended.<\/p>\n<p id=\"run-the-spider\"><strong>Run the Spider:<\/strong><\/p>\n<p>As we are already inside the<span>\u00a0<\/span><strong>web_scraper<\/strong><span>\u00a0<\/span>folder in command prompt. Let\u2019s execute our spider and fill the result inside a new file<span>\u00a0<\/span><strong>lcs.json<\/strong><span>\u00a0<\/span>using the below code. Yes, the result we get will be well-structured using JSON format.<\/p>\n<div class=\"highlight\">\n<pre><code class=\"language-shell\" data-lang=\"shell\">pipenv run scrapy crawl lcs -o lcs.json&#13;\n<\/code><\/pre>\n<\/div>\n<div class=\"highlight\">\n<pre><code class=\"language-shell\" data-lang=\"shell\">scrapy crawl lcs -o lcs.json&#13;\n<\/code><\/pre>\n<\/div>\n<p id=\"results\"><strong>Results:<\/strong><\/p>\n<p>When the above code executes, we\u2019ll see a new file<span>\u00a0<\/span><strong>lcs.json<\/strong><span>\u00a0<\/span>in our project folder.<\/p>\n<p>Here are the contents of this file.<\/p>\n<div class=\"highlight\">\n<pre><code class=\"language-json\" data-lang=\"json\">[&#13;\n{<span>\"logo\"<\/span>: <span>\"Live Code Stream\"<\/span>}&#13;\n]&#13;\n<\/code><\/pre>\n<\/div>\n<h2 id=\"another-spider-with-css-query-selectors\"><span class=\"ez-toc-section\" id=\"Another_Spider_with_CSS_query_selectors\"><\/span>Another Spider with CSS query selectors<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Most of us love sports, and when it comes to Football, it is my personal favorite.<\/p>\n<p>Football tournaments are organized frequently throughout the world. There are several websites that provide a live feed of match results while they are being played. But, most of these websites don\u2019t offer any official API.<\/p>\n<p>In turn, it creates an opportunity for us to use our web scraping skills and extract meaningful information by directly scraping their website.<\/p>\n<p>For example, let\u2019s have a look at<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.livescore.cz\/\">Livescore.cz<\/a><span>\u00a0<\/span>website.<\/p>\n<p>On their home page, they have nicely displayed tournaments and their matches that will be played today (the date when you visit the website).<\/p>\n<p>We can retrieve information like:<\/p>\n<ul>\n<li>Tournament Name<\/li>\n<li>Match Time<\/li>\n<li>Team 1 Name (e.g. Country, Football Club, etc.)<\/li>\n<li>Team 1 Goals<\/li>\n<li>Team 2 Name (e.g. Country, Football Club, etc.)<\/li>\n<li>Team 2 Goals<\/li>\n<li>etc.<\/li>\n<\/ul>\n<p>In our code example, we will be extracting tournament names that have matches today.<\/p>\n<h2 id=\"the-code-1\"><span class=\"ez-toc-section\" id=\"The_code\"><\/span>The code<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Let\u2019s create a new spider in our project to retrieve the tournament names. I\u2019ll name this file as<span>\u00a0<\/span><strong>livescore_t.py<\/strong><\/p>\n<p>Here is the code that you need to enter inside<span>\u00a0<\/span><strong>\/web_scraper\/web_scraper\/spiders\/livescore_t.py<\/strong><\/p>\n<figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1329041 lazy\" alt=\"\" width=\"809\" height=\"603\" sizes=\"auto, (max-width: 809px) 100vw, 809px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05.png 1400w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05-280x210.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05-362x270.png 362w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05-181x135.png 181w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.42.05-796x594.png 796w\"\/><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Code_explanation\"><\/span><span style=\"font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif;\">Code explanation:<\/span><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>As usual, import Scrapy.<\/li>\n<li>Create a class that inherits the properties and functionality of<span>\u00a0<\/span><strong>scrapy.Spider<\/strong>.<\/li>\n<li>Give a unique name to our spider. Here, I used<span>\u00a0<\/span><code>LiveScoreT<\/code><span>\u00a0<\/span>as we will only be extracting the tournament names.<\/li>\n<li>The next step is to provide the URL of Livescore.cz.<\/li>\n<li>At last, the<span>\u00a0<\/span><code>parse()<\/code><span>\u00a0<\/span>function loop through all the matched elements that contains the<span>\u00a0<\/span><strong>tournament name<\/strong><span>\u00a0<\/span>and join it together using<span>\u00a0<\/span><code>yield<\/code>. Finally, we receive all the tournament names that have matches today. A point to be noted is that this time I used<span>\u00a0<\/span><strong>CSS<\/strong><span>\u00a0<\/span>selector instead of<span>\u00a0<\/span><strong>XPath<\/strong>.<\/li>\n<\/ul>\n<p id=\"run-the-newly-created-spider\"><strong>Run the newly created spider:<\/strong><\/p>\n<p>It\u2019s time to see our spider in action. Run the below command to let the spider crawl the home page of Livescore.cz website. The web scraping result will then be added inside a new file called<span>\u00a0<\/span><strong>ls_t.json<\/strong><span>\u00a0<\/span>in JSON format.<\/p>\n<div class=\"highlight\">\n<pre><code class=\"language-shell\" data-lang=\"shell\">pipenv run scrapy crawl LiveScoreT -o ls_t.json&#13;\n<\/code><\/pre>\n<\/div>\n<p>By now you know the drill.<\/p>\n<p id=\"results-1\"><strong>Results:<\/strong><\/p>\n<p>This is what our web spider has extracted on 18 November 2020 from<span>\u00a0<\/span><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.livescore.cz\/\">Livescore.cz<\/a><span>\u00a0<\/span>. Remember that the output may change every day.<\/p>\n<p><span style=\"font-family: Consolas, Monaco, monospace;\"><\/p>\n<figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1329046 lazy\" alt=\"\" width=\"795\" height=\"448\" sizes=\"auto, (max-width: 795px) 100vw, 795px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49.png 1392w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49-280x158.png 280w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49-479x270.png 479w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49-240x135.png 240w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49-796x448.png 796w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.43.49-1200x675.png 1200w\"\/><\/figure>\n<p><\/span><\/p>\n<h2 id=\"a-more-advanced-use-case\"><span class=\"ez-toc-section\" id=\"A_more_advanced_use_case\"><\/span>A more advanced use case<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In this section, instead of just retrieving the tournament name, we will go the next mile and get complete details of tournaments and their matches.<\/p>\n<p>Create a new file inside<span>\u00a0<\/span><strong>\/web_scraper\/web_scraper\/spiders\/<\/strong><span>\u00a0<\/span>and name it as<span>\u00a0<\/span><strong>livescore.py<\/strong>. Now, enter the below code in it.<\/p>\n<div class=\"highlight\">\n<pre><figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1329048 lazy\" alt=\"\" width=\"706\" height=\"1536\" sizes=\"auto, (max-width: 706px) 100vw, 706px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.44.47.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.44.47.png 706w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.44.47-97x210.png 97w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.44.47-124x270.png 124w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.44.47-62x135.png 62w\"\/><\/figure><\/pre>\n<\/div>\n<h3 id=\"code-explanation-2\"><span class=\"ez-toc-section\" id=\"Code_explanation-2\"><\/span>Code explanation:<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The code structure of this file is the same as our previous examples. Here, we just updated the<span>\u00a0<\/span><code>parse()<\/code><span>\u00a0<\/span>method with a new functionality.<\/p>\n<p>Basically, we extracted all the HTML<span>\u00a0<\/span><code>&lt;tr&gt;&lt;\/tr&gt;<\/code><span>\u00a0<\/span>elements from the page. Then, we loop through them to find out whether it is a tournament or a match. If it is a tournament, we extracted its name. In the case of a match, we extracted its \u201ctime,\u201d \u201cstate,\u201d and \u201cname and score of both teams.\u201d<\/p>\n<h3 id=\"run-the-example\"><span class=\"ez-toc-section\" id=\"Run_the_example\"><\/span>Run the example:<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Type the following command inside the console and execute it.<\/p>\n<div class=\"highlight\">\n<pre><code class=\"language-shell\" data-lang=\"shell\">pipenv run scrapy crawl LiveScore -o ls.json&#13;\n<\/code><\/pre>\n<\/div>\n<h3 id=\"results-2\"><span class=\"ez-toc-section\" id=\"Results\"><\/span>Results:<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Here is a sample of what has been retrieved:<\/p>\n<div class=\"highlight\">\n<pre><figure class=\"post-image post-mediaBleed alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1329049 lazy\" alt=\"\" width=\"702\" height=\"990\" sizes=\"auto, (max-width: 702px) 100vw, 702px\" src=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.45.33.png\" data-lazy=\"true\" srcset=\"https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.45.33.png 702w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.45.33-149x210.png 149w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.45.33-191x270.png 191w, https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Screenshot-2020-11-23-at-10.45.33-96x135.png 96w\"\/><\/figure><\/pre>\n<\/div>\n<p>Now with this data, we can do anything we want, like use it to train our own neural network to predict future <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/game\/\" data-internallinksmanager029f6b8e52c=\"7\" title=\"Game\" target=\"_blank\" rel=\"noopener\">game<\/a>s.<\/p>\n<h2 id=\"conclusion\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Data Analysts often use<span>\u00a0<\/span><strong>web scraping<\/strong><span>\u00a0<\/span>because it helps them in collecting data to predict the future. Similarly, businesses use it to extract emails from web pages as it is an effective way of lead generation. We can even use it to monitor the prices of products.<\/p>\n<p>In other words, web scraping has many use cases and<span>\u00a0<\/span><strong>Python<\/strong><span>\u00a0<\/span>is completely capable to do that.<\/p>\n<p>So, what are you waiting for? Try scraping your favorite websites now.<\/p>\n<p><i><span style=\"font-weight: 400;\">This <\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/livecodestream.dev\/post\/2020-11-18-how-to-turn-the-web-into-data-with-python-and-scrapy\/\"><i><span style=\"font-weight: 400;\">article<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> was originally published on <\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/livecodestream.dev\/\"><i><span style=\"font-weight: 400;\">Live Code Stream<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> by <\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/bajcmartinez\/\"><i><span style=\"font-weight: 400;\">Juan Cruz Martinez<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> (twitter: <\/span><\/i><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/twitter.com\/bajcmartinez\"><i><span style=\"font-weight: 400;\">@bajcmartinez<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">), founder and publisher of Live Code Stream, entrepreneur, developer, author, speaker, and doer of things.<\/span><\/i><\/p>\n<p><a rel=\"nofollow noopener noreferrer\" target=\"_blank\" href=\"https:\/\/livecodestream.dev\/subscribe\"><i><span style=\"font-weight: 400;\">Live Code Stream<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> is also available as a free weekly <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/news\/\" data-internallinksmanager029f6b8e52c=\"2\" title=\"News\" target=\"_blank\" rel=\"noopener\">news<\/a>letter. Sign up for updates on everything related to programming, AI, and computer <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/sciencee\/\" data-internallinksmanager029f6b8e52c=\"5\" title=\"Science\" target=\"_blank\" rel=\"noopener\">science<\/a> in <a href=\"https:\/\/buradabiliyorum.com\/en\/category\/general\/\" data-internallinksmanager029f6b8e52c=\"3\" title=\"General\" target=\"_blank\" rel=\"noopener\">general<\/a>.<\/span><\/i><\/p>\n<\/p><\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<blockquote><p><strong><span style=\"color: #ff6600;\">If you liked the article, do not forget to share it with your friends. Follow us on\u00a0<span style=\"color: #ff0000;\"><a style=\"color: #ff0000;\" href=\"https:\/\/news.google.com\/publications\/CAAqBwgKMLG0nwswvr63Aw\" target=\"_blank\" rel=\"nofollow noopener noreferrer\">Google News<\/a><\/span>\u00a0too, click on the star and choose us from your favorites.<\/span><\/strong><\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\">For forums sites go to <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/forum.buradabiliyorum.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Forum.BuradaBiliyorum.Com<\/a><\/span><\/strong>\n<\/p><\/blockquote>\n<blockquote>\n<p style=\"text-align: center;\"><strong>If you want to read more like this article, you can visit our <span style=\"color: #ff9900;\"><a style=\"color: #ff9900;\" href=\"https:\/\/en.buradabiliyorum.com\/technology\/\" target=\"_blank\" rel=\"noopener noreferrer\">Technology category.<\/a><\/span><\/strong><\/p>\n<\/blockquote>\n<p><span style=\"color: black;\"><a style=\"color: #ff9900;\" href=\"https:\/\/thenextweb.com\/syndication\/2020\/11\/23\/a-beginners-guide-to-web-scraping-with-python-and-scrapy\/\" target=\"_blank\" rel=\"noopener noreferrer\">Source<\/a><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;#A beginner\u2019s guide to web scraping with Python and Scrapy&#8221; Since their inception,\u00a0websites\u00a0are used to share information. Whether it is a Wikipedia article, YouTube channel, Instagram account, or a Twitter handle. They all are packed with interesting data that is available for everyone with access to the\u00a0internet\u00a0and a\u00a0web browser. But, what if we want to&#8230;<\/p>\n","protected":false},"author":1,"featured_media":118168,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"fifu_image_url":"https:\/\/img-cdn.tnwcdn.com\/image\/tnw?filter_last=1&fit=1280,640&url=https:\/\/cdn0.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2020\/11\/Untitled-1.jpg&signature=c2f8e69f1a243f5ab6f00765abe7f882","fifu_image_alt":"","footnotes":""},"categories":[18],"tags":[72366,73706,70759,81112,81113,73708],"class_list":["post-118167","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology","tag-data","tag-python-programming-language","tag-tech","tag-web-browser","tag-web-crawler","tag-web-scraping"],"_links":{"self":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/118167","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/comments?post=118167"}],"version-history":[{"count":0,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/posts\/118167\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media\/118168"}],"wp:attachment":[{"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/media?parent=118167"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/categories?post=118167"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/buradabiliyorum.com\/en\/wp-json\/wp\/v2\/tags?post=118167"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}