{"id":612,"date":"2017-12-30T20:03:44","date_gmt":"2017-12-30T20:03:44","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=612"},"modified":"2018-01-23T05:37:34","modified_gmt":"2018-01-23T05:37:34","slug":"web-scraping-python","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/web-scraping-python\/","title":{"rendered":"Tutorial: How to do web scraping in Python?"},"content":{"rendered":"<p>When we go for data science projects, like the <a href=\"https:\/\/www.codeastar.com\/data-wrangling\/\">Titanic Survivors<\/a> and <a href=\"https:\/\/www.codeastar.com\/win-big-real-estate-market-data-science\/\">Iowa House Prices<\/a>\u00a0projects, we need data sets to process our predictions. In above cases, those data sets have already been collected and prepared. We only need to download the data set files then start our projects. But when we want to work for our own data science projects, we need to prepare data sets ourselves. It would be easy if we can find free and public data sets from <a href=\"http:\/\/mlr.cs.umass.edu\/ml\/\" target=\"_blank\" rel=\"noopener\">UCI Machine Learning Repository<\/a>\u00a0or<a href=\"https:\/\/www.kaggle.com\/datasets\" target=\"_blank\" rel=\"noopener\"> Kaggle Data Sets<\/a>. But, what If there is no suitable data set found? Don&#8217;t worry, let&#8217;s create one for ourselves, by web scraping.<\/p>\n<p><!--more--><\/p>\n<h3>Tools for Web Scraping: Scrapy vs Beautiful Soup<\/h3>\n<p>There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: <a href=\"https:\/\/scrapy.org\/\" target=\"_blank\" rel=\"noopener\">Scrapy<\/a>. Then it comes another debate topic, &#8220;Why don&#8217;t you use <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\" target=\"_blank\" rel=\"noopener\">Beautiful Soup<\/a>, when Beautiful Soup can do the web scraping task as well?&#8221;<\/p>\n<p>Yes, both Scrapy and Beautiful Soup can do the web scraping job. It all depends on <strong style=\"color: navy;\">how you want to scrape the data from the internet<\/strong>. Scrapy is a web scraping <strong style=\"color: navy;\">framework<\/strong> while Beautiful Soup is a <strong style=\"color: navy;\">library<\/strong>. You can use Scrapy to create <strong style=\"color: navy;\">bots (spiders)<\/strong> to crawl web content <strong style=\"color: navy;\">alone<\/strong>, and you can <strong style=\"color: navy;\">import<\/strong> Beautiful Soup in your code to <strong style=\"color: navy;\">work with other libraries<\/strong> (e.g. requests) for web scraping. Scrapy provides you a <strong style=\"color: navy;\">complete solution<\/strong>. On the other hand, Beautiful Soup can be <strong style=\"color: navy;\">quick and handy<\/strong>. When you try to scrape <strong style=\"color: navy;\">massive data or multiple pages<\/strong> from a web site, Scrapy would be your choice. If you just want to scrape <strong style=\"color: navy;\">certain elements<\/strong> from a page, Beautiful Soup can bring you what you wanted.<\/p>\n<p>We can visualize the differences between Scrapy and Beautiful Soup in following pictures:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"655\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/scrapy\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?fit=988%2C755&amp;ssl=1\" data-orig-size=\"988,755\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"scrapy\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?fit=300%2C229&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?fit=988%2C755&amp;ssl=1\" class=\"alignnone wp-image-655\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?resize=340%2C260&#038;ssl=1\" alt=\"\" width=\"340\" height=\"260\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?resize=300%2C229&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?resize=768%2C587&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?resize=940%2C718&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/scrapy.png?w=988&amp;ssl=1 988w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"658\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/vs\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?fit=200%2C200&amp;ssl=1\" data-orig-size=\"200,200\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"vs\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?fit=200%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?fit=200%2C200&amp;ssl=1\" class=\"alignnone wp-image-658\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?resize=50%2C50&#038;ssl=1\" alt=\"\" width=\"50\" height=\"50\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/vs.png?w=200&amp;ssl=1 200w\" sizes=\"auto, (max-width: 50px) 100vw, 50px\" \/><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"656\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/beasou\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?fit=1140%2C955&amp;ssl=1\" data-orig-size=\"1140,955\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"beasou\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?fit=300%2C251&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?fit=1024%2C858&amp;ssl=1\" class=\"alignnone wp-image-656\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?resize=340%2C285&#038;ssl=1\" alt=\"\" width=\"340\" height=\"285\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?resize=300%2C251&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?resize=768%2C643&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?resize=1024%2C858&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?resize=940%2C787&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/beasou.png?w=1140&amp;ssl=1 1140w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><\/p>\n<h3>Web Scraping in Action<\/h3>\n<p>In this post, we are going to do a web scraping demonstration on eBay Daily Deals.\u00a0We can expect we scrape around 3000 eBay items a time from the daily deals main page, plus its linked category pages.<\/p>\n<p>Since we are scraping 3000 items from eBay Daily Deals, we will use Scrapy as our scraping tool. First thing first, let&#8217;s get Scrapy to our environment with our good old <a href=\"https:\/\/www.codeastar.com\/must-know-command-python-pip\/\">pip<\/a> command.<\/p>\n<pre>pip install Scrapy<\/pre>\n<p>Once Scrapy is installed we can run following command to get our scraping files framework (or, <em>spider egg sac<\/em>!)<\/p>\n<pre>scrapy startproject ebaybd<\/pre>\n<p>The &#8220;ebaybd&#8221; is our project\/spider name and the <em>startproject<\/em> keyword will create our <del>spider egg sac<\/del> files framework with following content:<\/p>\n<pre><span class=\"n\">ebaybd<\/span><span class=\"o\">\/                   # our project folder<\/span>\r\n    <span class=\"n\">scrapy<\/span><span class=\"o\">.<\/span><span class=\"n\">cfg<\/span>            <span class=\"c1\"># scrapy configuration file (just leave it there, we won't touch it) <\/span>\r\n    <span class=\"n\">ebaybd<\/span><span class=\"o\">\/<\/span>               <span class=\"c1\"># project's Python module (there is where we code our spider)<\/span>\r\n        <span class=\"n\">items<\/span><span class=\"o\">.<\/span><span class=\"n\">py<\/span>          <span class=\"c1\"># project items definition file (the item we ask our spider to scrape) <\/span>\r\n        <span class=\"n\">pipelines<\/span><span class=\"o\">.<\/span><span class=\"n\">py<\/span>      <span class=\"c1\"># project pipelines file (the process we let our spider do after getting the item) <\/span>\r\n        <span class=\"n\">settings<\/span><span class=\"o\">.<\/span><span class=\"n\">py<\/span>       <span class=\"c1\"># project settings file  <\/span>\r\n        <span class=\"n\">spiders<\/span><span class=\"o\">\/<\/span>          <span class=\"c1\"># our spider folder  (the place where we code our core logic) <\/span><\/pre>\n<p>Are you ready? Let&#8217;s hatch a spider!<\/p>\n<h3>eBay Daily Deals spider hatching<\/h3>\n<p>First, we go to edit the <em>items.py<\/em> file, as we need to tell our spider what to scrape for us. We then create the<em> EBayItem<\/em> class and add our desired eBay fields there.<\/p>\n<pre lang=\"python\" line=\"1\">class EBayItem(scrapy.Item):\r\n    name = scrapy.Field()\r\n    category = scrapy.Field()\r\n    link = scrapy.Field()\r\n    img_path = scrapy.Field()        \r\n    currency = scrapy.Field()   \r\n    price = scrapy.Field()   \r\n    orignal_price = scrapy.Field()   \r\n<\/pre>\n<p>Second, we need to tell our spider what to do once it has scrapped the data we wanted. So we edit the <em>pipelines.py<\/em> file with following content:<\/p>\n<pre lang=\"python\" line=\"1\">import csv\r\n\r\nclass EBayBDPipeline(object):\r\n\r\n    def open_spider(self, spider):\r\n        self.file = csv.writer(open(spider.file_name, 'w', newline='', encoding = 'utf8') )\r\n        fieldnames = ['Item_name', 'Category', 'Link', 'Image_path', 'Curreny', 'Price', 'Original_price']\r\n        self.file.writerow(fieldnames)\r\n\r\n    def process_item(self, item, spider):\r\n        self.file.writerow([item['name'], item['category'],item['link'], \r\n                    item['img_path'] , item['currency'],item['price'], \r\n                    item['orignal_price']])\r\n        return item\r\n<\/pre>\n<p>We create an\u00a0<em>EBayBDPipeline<\/em> class to ask the spider saving scraped data into a CSV file. Although Scrapy has its built-in <a href=\"https:\/\/doc.scrapy.org\/en\/latest\/topics\/feed-exports.html\" target=\"_blank\" rel=\"noopener\">CSV exporter<\/a>, making our own exporter can provide better customization.<\/p>\n<p>We have our scraped item class and the scraper pipline class, then we need to connect two classes together. So we work on the <em>settings.py<\/em> file by adding:<\/p>\n<pre lang=\"python\" line=\"1\">ITEM_PIPELINES = {\r\n    'ebaybd.pipelines.EBayBDPipeline': 300\r\n}\r\n<\/pre>\n<p>It will tell our spider to run the <em>EBayBDPipeline<\/em> class after scraping an item. The number 300 after the pipeline class is the value to determinate the processing sequence in multiple pipeline environment. The value can be ranged from 0 &#8211; 1000, since we only have one pipeline class in this project, the value can be ignored here.<\/p>\n<h3>Build a Spider<\/h3>\n<p>After setting up those item and pipeline classes, it is time for our main event &#8212; build a spider. We create a spider file, <em>ebay_deals_spider.py<\/em>, in our &#8220;spiders&#8221; folder:<\/p>\n<pre>ebaybd\/\r\n    ebaybd\/\r\n       spiders\/ \r\n          ebay_deals_spider.py            #our newly created spider core logic file\r\n<\/pre>\n<p>Inside the spider file, we import required classes\/modules.<\/p>\n<pre lang=\"python\" line=\"1\">import scrapy\r\nfrom scrapy.http import HtmlResponse   #Scrapy's html response class\r\nimport json, re, datetime              #for json, regular expression and date time functions  \r\nfrom ebaybd.items import EBayItem      #the item class we created in items.py\r\n<\/pre>\n<p>Add a function to remove currency and\u00a0thousand separator from scraped item price.<\/p>\n<pre lang=\"python\" line=\"5\">def formatPrice(price, currency):\r\n    if price is None:\r\n        return None\r\n  \r\n    price = price.replace(currency, \"\")\r\n    price = price.replace(\",\", \"\")\r\n    price = price.strip()\r\n    return price\r\n<\/pre>\n<p>And add a function to scrape html content into our <em>EBayItem<\/em> class.<\/p>\n<pre lang=\"python\" line=\"1\">def getItemInfo(htmlResponse, category):\r\n    eBayItem = EBayItem()\r\n    \r\n    name = htmlResponse.css(\".ebayui-ellipsis-2::text\").extract_first()\r\n    if name is None: \r\n        name = htmlResponse.css(\".ebayui-ellipsis-3::text\").extract_first()\r\n    link = htmlResponse.css(\"h3.dne-itemtile-title.ellipse-2 a::attr(href)\").extract_first()\r\n    if link is None: \r\n        link = htmlResponse.css(\"h3.dne-itemtile-title.ellipse-3 a::attr(href)\").extract_first()\r\n    eBayItem['name'] = name   \r\n    eBayItem['category'] = category\r\n    eBayItem['link'] = link\r\n    eBayItem['img_path'] = htmlResponse.css(\"div.slashui-image-cntr img::attr(src)\").extract_first()\r\n    currency = htmlResponse.css(\".dne-itemtile-price meta::attr(content)\").extract_first()\r\n    if currency is None: \r\n        currency =  htmlResponse.css(\".dne-itemtile-original-price span::text\").extract_first()[:3]\r\n    eBayItem['currency'] = currency\r\n    eBayItem['price'] = formatPrice(htmlResponse.css(\".dne-itemtile-price span::text\").extract_first(), currency)\r\n    eBayItem['orignal_price'] = formatPrice(htmlResponse.css(\".dne-itemtile-original-price span::text\").extract_first(), currency)\r\n\r\n    return eBayItem\r\n<\/pre>\n<p>We use Scrapy&#8217; selectors to extract data with the same CSS expressions. For example, we scrape an item name with:<\/p>\n<pre lang=\"python\" line=\"1\">name = htmlResponse.css(\".ebayui-ellipsis-2::text\").extract_first()<\/pre>\n<p>That means we scrape the text of the first element with &#8220;ebayui-ellipsis-2&#8221; CSS class, as the item name.<br \/>\n(You can use your browser, right click an eBay daily deals page, select &#8220;Inspect&#8221; to get following screen)<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"630\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/itemname\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?fit=1298%2C692&amp;ssl=1\" data-orig-size=\"1298,692\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"itemname\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?fit=300%2C160&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?fit=1024%2C546&amp;ssl=1\" class=\"alignnone wp-image-630 size-large\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?resize=940%2C501&#038;ssl=1\" alt=\"\" width=\"940\" height=\"501\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?resize=1024%2C546&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?resize=300%2C160&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?resize=768%2C409&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?resize=940%2C501&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/itemname.png?w=1298&amp;ssl=1 1298w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<h3>Inside the spider&#8217;s brain<\/h3>\n<p>For our spider class, where we code the core logic, we have 2 major types of function, &#8220;start request&#8221; (<em>start_requests<\/em>) and &#8220;parse response&#8221; (<em>parse<\/em> and <em>parse_cat_listing<\/em>).<\/p>\n<pre lang=\"python\" line=\"1\">class BDSpider(scrapy.Spider):\r\n    name = \"ebaybd\"\r\n    file_name = datetime.datetime.now().strftime(\"%F\") +\".csv\"         #use current date as file name\r\n\r\n    def start_requests(self):\r\n        ...\r\n\r\n    def parse(self, response):\r\n        ...\r\n                              \r\n    def parse_cat_listing(self, response):\r\n        ...\r\n<\/pre>\n<p>The <em>start_requests<\/em> function is pretty straight forward, we tell our spider where (urls) to start scraping.<\/p>\n<pre lang=\"python\" line=\"1\">   def start_requests(self):\r\n        urls = [\r\n            'https:\/\/www.ebay.com\/globaldeals'\r\n        ]\r\n        for url in urls:\r\n            yield scrapy.Request(url=url, callback=self.parse)\r\n<\/pre>\n<p>In our case, the global deals url (<a href=\"https:\/\/www.ebay.com\/globaldeals\" target=\"_blank\" rel=\"noopener\">https:\/\/www.ebay.com\/globaldeals<\/a>) is used, thus people from anywhere, the US, Germany, India, South Korea, etc, can all get their eBay daily deals. After getting the url request, we ask the spider to parse the url content under the\u00a0<em>parse<\/em> function. You may notice the keyword <em>yield<\/em> is used instead of<em> return<\/em>\u00a0in the spider. Unlike the <em>return<\/em> keyword which sends back an entire list in memory at once. The <em>yield<\/em> keyword returns a generator object, which let the spider handle the parse request one by one. We will see more on the <em>yield<\/em> keyword in our\u00a0<em>parse<\/em> function:<\/p>\n<pre lang=\"python\" line=\"1\">def parse(self, response):\r\n        #spotlight deals\r\n        spl_deal = response.css(\".ebayui-dne-summary-card.card.ebayui-dne-item-featured-card--topDeals\")\r\n        spl_title = spl_deal.css(\"h2 span::text\").extract_first()\r\n        eBayItem = getItemInfo(spl_deal, spl_title)\r\n        yield eBayItem    \r\n\r\n        #feature deals\r\n        feature_deal_title = response.css(\".ebayui-dne-banner-text h2 span::text\").extract_first()\r\n        feature_deals_card = response.css(\".ebayui-dne-item-featured-card\")\r\n        feature_deals = feature_deals_card.css(\".col\")\r\n        for feature in feature_deals:\r\n             eBayItem = getItemInfo(feature, feature_deal_title)\r\n             yield eBayItem \r\n\r\n        #card deals\r\n        cards = response.css(\".ebayui-dne-item-pattern-card.ebayui-dne-item-pattern-card-no-padding\")    \r\n        for card in cards:\r\n            title = card.css(\"h2 span::text\").extract_first()\r\n            more_link = card.css(\".dne-show-more-link a::attr(href)\").extract_first()\r\n            if more_link is not None:\r\n                 cat_id = re.sub(r\"^https:\/\/www.ebay.com\/globaldeals\/|featured\/|\/all$\",\"\",more_link)\r\n                 cat_id = re.sub(\"\/\",\",\",cat_id)\r\n                 cat_listing = \"https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings?_ofs=0&amp;category_path_seo={}&amp;deal_type=featured\".format(cat_id)\r\n\r\n                 request = scrapy.Request(cat_listing, callback=self.parse_cat_listing)\r\n                 request.meta['category'] = title\r\n                 request.meta['page_index'] = 1\r\n                 request.meta['cat_id'] = cat_id\r\n                 yield request\r\n            else:\r\n                 self.log(\"Get item on page for {}\".format(title))\r\n                 category_deals = card.css(\".item\")\r\n                 for c_item in category_deals:\r\n                    eBayItem = getItemInfo(c_item, title)\r\n                    yield eBayItem\r\n<\/pre>\n<p>When there is a spotlight deal, our spider will scrape the item and process the item pipeline task. The same workflow happens on featured deals. The difference is, there are more than one item in featured deals. Since we are using the\u00a0<em>yield<\/em> keyword, we can call the item pipeline one by one without stopping the iteration.<\/p>\n<p>Then on the category items, if there is a categorized daily deals link within the displayed categories, our spider will scrape daily deals items from the categorized page instead. And in those categorized daily deals pages, a tricky situation is happened there &#8212; infinite scrolling.<\/p>\n<h3>Infinite Scroll Pagination Handling<\/h3>\n<p>The eBay categorized deals pages use infinite scrolling to display daily deals items. There is no &#8220;next&#8221; button for showing more items, the categorized deals page will display more items once a user has scrolled down the page a little bit. Since there is no pagination element on the page, we can not assign CSS or Xpath selector to our spider. But items do not come from no where, there should be somewhere to load new items. Infinite scrolling uses ajax to make it scroll infinity, so we should inspect the page&#8217;s network performance.<br \/>\n(Right click on your browser, select &#8220;Inspect&#8221; and click &#8220;Network&#8221; tab)<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"639\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/ajax\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?fit=1360%2C815&amp;ssl=1\" data-orig-size=\"1360,815\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"ajax\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?fit=300%2C180&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?fit=1024%2C614&amp;ssl=1\" class=\"alignnone wp-image-639 size-large\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?resize=940%2C564&#038;ssl=1\" alt=\"\" width=\"940\" height=\"564\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?resize=1024%2C614&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?resize=300%2C180&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?resize=768%2C460&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?resize=940%2C563&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/ajax.png?w=1360&amp;ssl=1 1360w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<p>Try scrolling your page and see rather any JavaScript call is processed. And Ding! An ajax call to\u00a0<span style=\"font-size: 16px;\">https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings is found. It will return 24 eBay items under a given category per call. We then request this JavaScript call and handle it in our other parse function, <em>parse_cat_listing<\/em>.<\/span><\/p>\n<h3>Scrapy and JavaScript<\/h3>\n<p><span style=\"font-size: 16px;\">\u00a0The <em>parse_cat_listing<\/em> function is the place where we handle response from JavaScript and transform it into <em>EBayItem<\/em>.\u00a0<\/span><\/p>\n<pre lang=\"python\" line=\"1\"> def parse_cat_listing(self, response):\r\n        category = response.meta['category']\r\n        page_index = response.meta['page_index']\r\n        cat_id = response.meta['cat_id']\r\n\r\n        data = json.loads(response.body)\r\n        fulfillment_value =  data.get('fulfillmentValue')\r\n        listing_html = fulfillment_value['listingsHtml']\r\n        is_last_page = fulfillment_value['pagination']['isLastPage']\r\n        json_response = HtmlResponse(url=\"json response\", body=listing_html, encoding='utf-8')\r\n        items_on_cat = json_response.css(\".col\")\r\n        \r\n        for item in items_on_cat:\r\n              eBayItem = getItemInfo(item, category)\r\n              yield eBayItem\r\n\r\n        if (is_last_page == False): \r\n            item_starting_index = page_index * 24\r\n            cat_listing = \"https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings?_ofs={}&amp;category_path_seo={}&amp;deal_type=featured\".format(item_starting_index, cat_id)\r\n            request = scrapy.Request(cat_listing, callback=self.parse_cat_listing)\r\n            request.meta['category'] = category\r\n            request.meta['page_index'] = page_index+1\r\n            request.meta['cat_id'] = cat_id\r\n            yield request     \r\n<\/pre>\n<p>Since the JavaScript returns a JSON object containing the item listing html content, we first obtain the html content in text.<\/p>\n<pre lang=\"python\" line=\"1\">data = json.loads(response.body)\r\nfulfillment_value =  data.get('fulfillmentValue')\r\nlisting_html = fulfillment_value['listingsHtml']\r\n<\/pre>\n<p>Then use Scrapy&#8217;s <em>HtmlResponse<\/em> class to encode the text to HTML. We apply UTF-8 encoding for cases with special characters appearing in an item name.<\/p>\n<pre lang=\"python\" line=\"1\">json_response = HtmlResponse(url=\"json response\", body=listing_html, encoding='utf-8')\r\n<\/pre>\n<p>After that, we can apply the same <em>getItemInfo<\/em> logic we have done in the <em>parse<\/em> function.<\/p>\n<p>As it is an infinite scrolling, we keep requesting the JavaScript call\u00a0recurringly until the\u00a0JSON object returns a last page flag.<\/p>\n<h3>The Scraping Result<\/h3>\n<p>We have set all our spider files, it is time to let it scrape by running following command:<\/p>\n<pre>scrapy crawl ebaybd\r\n<\/pre>\n<p>Our &#8220;ebaybd&#8221; spider then scrapes and saves the result in a &#8220;YYYY-MM-DD.csv&#8221; file. Below is the result I run with about 3000 records in 1 minute processing time.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"643\" data-permalink=\"https:\/\/www.codeastar.com\/web-scraping-python\/csv\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?fit=1597%2C488&amp;ssl=1\" data-orig-size=\"1597,488\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"csv\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?fit=300%2C92&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?fit=1024%2C313&amp;ssl=1\" class=\"alignnone wp-image-643 size-large\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?resize=940%2C287&#038;ssl=1\" alt=\"\" width=\"940\" height=\"287\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?resize=1024%2C313&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?resize=300%2C92&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?resize=768%2C235&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?resize=940%2C287&amp;ssl=1 940w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/csv.png?w=1597&amp;ssl=1 1597w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/p>\n<p>Our spider is done here, you may extend its functionality by scraping other categories \/ item attributes, or store the result in MongoDB, cloud service, etc. Happy Scraping!<\/p>\n<h3>What have we learnt in this post?<\/h3>\n<ol>\n<li>Differences between Scrapy and BeautifulSoup<\/li>\n<li>Building our own Scrapy spider<\/li>\n<li>Using our own item pipeline processor<\/li>\n<li>Handling infinite scrolling<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p>The source code package can be found at\u00a0<a href=\"https:\/\/github.com\/codeastar\/ebay-daily-deals-scraper\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/codeastar\/ebay-daily-deals-scraper<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When we go for data science projects, like the Titanic Survivors and Iowa House Prices\u00a0projects, we need data sets to process our predictions. In above cases, those data sets have already been collected and prepared. We only need to download the data set files then start our projects. But when we want to work for [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":664,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[2],"tags":[43,46,8,44,45,42],"class_list":["post-612","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-we-code-therefore-we-are","tag-beautiful-soup","tag-infinite-scrolling","tag-python","tag-scrapy","tag-tutorial","tag-web-scraping"],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Tutorial: How to do web scraping in Python? &#8902; Code A Star<\/title>\n<meta name=\"description\" content=\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, &quot;Why don&#039;t you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?&quot;\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/web-scraping-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tutorial: How to do web scraping in Python? &#8902; Code A Star\" \/>\n<meta property=\"og:description\" content=\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, &quot;Why don&#039;t you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/web-scraping-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2017-12-30T20:03:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-01-23T05:37:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1115\" \/>\n\t<meta property=\"og:image:height\" content=\"694\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"Tutorial: How to do web scraping in Python?\",\"datePublished\":\"2017-12-30T20:03:44+00:00\",\"dateModified\":\"2018-01-23T05:37:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\"},\"wordCount\":1392,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"keywords\":[\"Beautiful Soup\",\"infinite scrolling\",\"Python\",\"Scrapy\",\"tutorial\",\"web scraping\"],\"articleSection\":[\"We code therefore we are\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/web-scraping-python\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\",\"url\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\",\"name\":\"Tutorial: How to do web scraping in Python? &#8902; Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"datePublished\":\"2017-12-30T20:03:44+00:00\",\"dateModified\":\"2018-01-23T05:37:34+00:00\",\"description\":\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \\\"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\\\"\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/web-scraping-python\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"width\":1115,\"height\":694,\"caption\":\"How to do web scraping?\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Tutorial: How to do web scraping in Python?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don&#039;t wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/x.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Tutorial: How to do web scraping in Python? &#8902; Code A Star","description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/web-scraping-python\/","og_locale":"en_US","og_type":"article","og_title":"Tutorial: How to do web scraping in Python? &#8902; Code A Star","og_description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","og_url":"https:\/\/www.codeastar.com\/web-scraping-python\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2017-12-30T20:03:44+00:00","article_modified_time":"2018-01-23T05:37:34+00:00","og_image":[{"width":1115,"height":694,"url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","type":"image\/png"}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"Tutorial: How to do web scraping in Python?","datePublished":"2017-12-30T20:03:44+00:00","dateModified":"2018-01-23T05:37:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/"},"wordCount":1392,"commentCount":0,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"image":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","keywords":["Beautiful Soup","infinite scrolling","Python","Scrapy","tutorial","web scraping"],"articleSection":["We code therefore we are"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/web-scraping-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/","url":"https:\/\/www.codeastar.com\/web-scraping-python\/","name":"Tutorial: How to do web scraping in Python? &#8902; Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"image":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","datePublished":"2017-12-30T20:03:44+00:00","dateModified":"2018-01-23T05:37:34+00:00","description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/web-scraping-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","width":1115,"height":694,"caption":"How to do web scraping?"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"Tutorial: How to do web scraping in Python?"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don&#039;t wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/x.com\/codeastar"]}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-9S","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=612"}],"version-history":[{"count":44,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612\/revisions"}],"predecessor-version":[{"id":754,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612\/revisions\/754"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media\/664"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=612"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=612"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=612"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}