{"id":612,"date":"2017-12-30T20:03:44","date_gmt":"2017-12-30T20:03:44","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=612"},"modified":"2018-01-23T05:37:34","modified_gmt":"2018-01-23T05:37:34","slug":"web-scraping-python","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/web-scraping-python\/","title":{"rendered":"Tutorial: How to do web scraping in Python?"},"content":{"rendered":"

When we go for data science projects, like the Titanic Survivors<\/a> and Iowa House Prices<\/a>\u00a0projects, we need data sets to process our predictions. In above cases, those data sets have already been collected and prepared. We only need to download the data set files then start our projects. But when we want to work for our own data science projects, we need to prepare data sets ourselves. It would be easy if we can find free and public data sets from UCI Machine Learning Repository<\/a>\u00a0or Kaggle Data Sets<\/a>. But, what If there is no suitable data set found? Don’t worry, let’s create one for ourselves, by web scraping.<\/p>\n

<\/p>\n

Tools for Web Scraping: Scrapy vs Beautiful Soup<\/h3>\n

There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy<\/a>. Then it comes another debate topic, “Why don’t you use Beautiful Soup<\/a>, when Beautiful Soup can do the web scraping task as well?”<\/p>\n

Yes, both Scrapy and Beautiful Soup can do the web scraping job. It all depends on how you want to scrape the data from the internet<\/strong>. Scrapy is a web scraping framework<\/strong> while Beautiful Soup is a library<\/strong>. You can use Scrapy to create bots (spiders)<\/strong> to crawl web content alone<\/strong>, and you can import<\/strong> Beautiful Soup in your code to work with other libraries<\/strong> (e.g. requests) for web scraping. Scrapy provides you a complete solution<\/strong>. On the other hand, Beautiful Soup can be quick and handy<\/strong>. When you try to scrape massive data or multiple pages<\/strong> from a web site, Scrapy would be your choice. If you just want to scrape certain elements<\/strong> from a page, Beautiful Soup can bring you what you wanted.<\/p>\n

We can visualize the differences between Scrapy and Beautiful Soup in following pictures:<\/p>\n

\"\"\"\"\"\"<\/p>\n

Web Scraping in Action<\/h3>\n

In this post, we are going to do a web scraping demonstration on eBay Daily Deals.\u00a0We can expect we scrape around 3000 eBay items a time from the daily deals main page, plus its linked category pages.<\/p>\n

Since we are scraping 3000 items from eBay Daily Deals, we will use Scrapy as our scraping tool. First thing first, let’s get Scrapy to our environment with our good old pip<\/a> command.<\/p>\n

pip install Scrapy<\/pre>\n

Once Scrapy is installed we can run following command to get our scraping files framework (or, spider egg sac<\/em>!)<\/p>\n

scrapy startproject ebaybd<\/pre>\n

The “ebaybd” is our project\/spider name and the startproject<\/em> keyword will create our spider egg sac<\/del> files framework with following content:<\/p>\n

ebaybd<\/span>\/                   # our project folder<\/span>\r\n    scrapy<\/span>.<\/span>cfg<\/span>            # scrapy configuration file (just leave it there, we won't touch it) <\/span>\r\n    ebaybd<\/span>\/<\/span>               # project's Python module (there is where we code our spider)<\/span>\r\n        items<\/span>.<\/span>py<\/span>          # project items definition file (the item we ask our spider to scrape) <\/span>\r\n        pipelines<\/span>.<\/span>py<\/span>      # project pipelines file (the process we let our spider do after getting the item) <\/span>\r\n        settings<\/span>.<\/span>py<\/span>       # project settings file  <\/span>\r\n        spiders<\/span>\/<\/span>          # our spider folder  (the place where we code our core logic) <\/span><\/pre>\n

Are you ready? Let’s hatch a spider!<\/p>\n

eBay Daily Deals spider hatching<\/h3>\n

First, we go to edit the items.py<\/em> file, as we need to tell our spider what to scrape for us. We then create the EBayItem<\/em> class and add our desired eBay fields there.<\/p>\n

class EBayItem(scrapy.Item):\r\n    name = scrapy.Field()\r\n    category = scrapy.Field()\r\n    link = scrapy.Field()\r\n    img_path = scrapy.Field()        \r\n    currency = scrapy.Field()   \r\n    price = scrapy.Field()   \r\n    orignal_price = scrapy.Field()   \r\n<\/pre>\n

Second, we need to tell our spider what to do once it has scrapped the data we wanted. So we edit the pipelines.py<\/em> file with following content:<\/p>\n

import csv\r\n\r\nclass EBayBDPipeline(object):\r\n\r\n    def open_spider(self, spider):\r\n        self.file = csv.writer(open(spider.file_name, 'w', newline='', encoding = 'utf8') )\r\n        fieldnames = ['Item_name', 'Category', 'Link', 'Image_path', 'Curreny', 'Price', 'Original_price']\r\n        self.file.writerow(fieldnames)\r\n\r\n    def process_item(self, item, spider):\r\n        self.file.writerow([item['name'], item['category'],item['link'], \r\n                    item['img_path'] , item['currency'],item['price'], \r\n                    item['orignal_price']])\r\n        return item\r\n<\/pre>\n

We create an\u00a0EBayBDPipeline<\/em> class to ask the spider saving scraped data into a CSV file. Although Scrapy has its built-in CSV exporter<\/a>, making our own exporter can provide better customization.<\/p>\n

We have our scraped item class and the scraper pipline class, then we need to connect two classes together. So we work on the settings.py<\/em> file by adding:<\/p>\n

ITEM_PIPELINES = {\r\n    'ebaybd.pipelines.EBayBDPipeline': 300\r\n}\r\n<\/pre>\n

It will tell our spider to run the EBayBDPipeline<\/em> class after scraping an item. The number 300 after the pipeline class is the value to determinate the processing sequence in multiple pipeline environment. The value can be ranged from 0 – 1000, since we only have one pipeline class in this project, the value can be ignored here.<\/p>\n

Build a Spider<\/h3>\n

After setting up those item and pipeline classes, it is time for our main event — build a spider. We create a spider file, ebay_deals_spider.py<\/em>, in our “spiders” folder:<\/p>\n

ebaybd\/\r\n    ebaybd\/\r\n       spiders\/ \r\n          ebay_deals_spider.py            #our newly created spider core logic file\r\n<\/pre>\n

Inside the spider file, we import required classes\/modules.<\/p>\n

import scrapy\r\nfrom scrapy.http import HtmlResponse   #Scrapy's html response class\r\nimport json, re, datetime              #for json, regular expression and date time functions  \r\nfrom ebaybd.items import EBayItem      #the item class we created in items.py\r\n<\/pre>\n

Add a function to remove currency and\u00a0thousand separator from scraped item price.<\/p>\n

def formatPrice(price, currency):\r\n    if price is None:\r\n        return None\r\n  \r\n    price = price.replace(currency, \"\")\r\n    price = price.replace(\",\", \"\")\r\n    price = price.strip()\r\n    return price\r\n<\/pre>\n

And add a function to scrape html content into our EBayItem<\/em> class.<\/p>\n

def getItemInfo(htmlResponse, category):\r\n    eBayItem = EBayItem()\r\n    \r\n    name = htmlResponse.css(\".ebayui-ellipsis-2::text\").extract_first()\r\n    if name is None: \r\n        name = htmlResponse.css(\".ebayui-ellipsis-3::text\").extract_first()\r\n    link = htmlResponse.css(\"h3.dne-itemtile-title.ellipse-2 a::attr(href)\").extract_first()\r\n    if link is None: \r\n        link = htmlResponse.css(\"h3.dne-itemtile-title.ellipse-3 a::attr(href)\").extract_first()\r\n    eBayItem['name'] = name   \r\n    eBayItem['category'] = category\r\n    eBayItem['link'] = link\r\n    eBayItem['img_path'] = htmlResponse.css(\"div.slashui-image-cntr img::attr(src)\").extract_first()\r\n    currency = htmlResponse.css(\".dne-itemtile-price meta::attr(content)\").extract_first()\r\n    if currency is None: \r\n        currency =  htmlResponse.css(\".dne-itemtile-original-price span::text\").extract_first()[:3]\r\n    eBayItem['currency'] = currency\r\n    eBayItem['price'] = formatPrice(htmlResponse.css(\".dne-itemtile-price span::text\").extract_first(), currency)\r\n    eBayItem['orignal_price'] = formatPrice(htmlResponse.css(\".dne-itemtile-original-price span::text\").extract_first(), currency)\r\n\r\n    return eBayItem\r\n<\/pre>\n

We use Scrapy’ selectors to extract data with the same CSS expressions. For example, we scrape an item name with:<\/p>\n

name = htmlResponse.css(\".ebayui-ellipsis-2::text\").extract_first()<\/pre>\n

That means we scrape the text of the first element with “ebayui-ellipsis-2” CSS class, as the item name.
\n(You can use your browser, right click an eBay daily deals page, select “Inspect” to get following screen)
\n\"\"<\/p>\n

Inside the spider’s brain<\/h3>\n

For our spider class, where we code the core logic, we have 2 major types of function, “start request” (start_requests<\/em>) and “parse response” (parse<\/em> and parse_cat_listing<\/em>).<\/p>\n

class BDSpider(scrapy.Spider):\r\n    name = \"ebaybd\"\r\n    file_name = datetime.datetime.now().strftime(\"%F\") +\".csv\"         #use current date as file name\r\n\r\n    def start_requests(self):\r\n        ...\r\n\r\n    def parse(self, response):\r\n        ...\r\n                              \r\n    def parse_cat_listing(self, response):\r\n        ...\r\n<\/pre>\n

The start_requests<\/em> function is pretty straight forward, we tell our spider where (urls) to start scraping.<\/p>\n

   def start_requests(self):\r\n        urls = [\r\n            'https:\/\/www.ebay.com\/globaldeals'\r\n        ]\r\n        for url in urls:\r\n            yield scrapy.Request(url=url, callback=self.parse)\r\n<\/pre>\n

In our case, the global deals url (https:\/\/www.ebay.com\/globaldeals<\/a>) is used, thus people from anywhere, the US, Germany, India, South Korea, etc, can all get their eBay daily deals. After getting the url request, we ask the spider to parse the url content under the\u00a0parse<\/em> function. You may notice the keyword yield<\/em> is used instead of return<\/em>\u00a0in the spider. Unlike the return<\/em> keyword which sends back an entire list in memory at once. The yield<\/em> keyword returns a generator object, which let the spider handle the parse request one by one. We will see more on the yield<\/em> keyword in our\u00a0parse<\/em> function:<\/p>\n

def parse(self, response):\r\n        #spotlight deals\r\n        spl_deal = response.css(\".ebayui-dne-summary-card.card.ebayui-dne-item-featured-card--topDeals\")\r\n        spl_title = spl_deal.css(\"h2 span::text\").extract_first()\r\n        eBayItem = getItemInfo(spl_deal, spl_title)\r\n        yield eBayItem    \r\n\r\n        #feature deals\r\n        feature_deal_title = response.css(\".ebayui-dne-banner-text h2 span::text\").extract_first()\r\n        feature_deals_card = response.css(\".ebayui-dne-item-featured-card\")\r\n        feature_deals = feature_deals_card.css(\".col\")\r\n        for feature in feature_deals:\r\n             eBayItem = getItemInfo(feature, feature_deal_title)\r\n             yield eBayItem \r\n\r\n        #card deals\r\n        cards = response.css(\".ebayui-dne-item-pattern-card.ebayui-dne-item-pattern-card-no-padding\")    \r\n        for card in cards:\r\n            title = card.css(\"h2 span::text\").extract_first()\r\n            more_link = card.css(\".dne-show-more-link a::attr(href)\").extract_first()\r\n            if more_link is not None:\r\n                 cat_id = re.sub(r\"^https:\/\/www.ebay.com\/globaldeals\/|featured\/|\/all$\",\"\",more_link)\r\n                 cat_id = re.sub(\"\/\",\",\",cat_id)\r\n                 cat_listing = \"https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings?_ofs=0&category_path_seo={}&deal_type=featured\".format(cat_id)\r\n\r\n                 request = scrapy.Request(cat_listing, callback=self.parse_cat_listing)\r\n                 request.meta['category'] = title\r\n                 request.meta['page_index'] = 1\r\n                 request.meta['cat_id'] = cat_id\r\n                 yield request\r\n            else:\r\n                 self.log(\"Get item on page for {}\".format(title))\r\n                 category_deals = card.css(\".item\")\r\n                 for c_item in category_deals:\r\n                    eBayItem = getItemInfo(c_item, title)\r\n                    yield eBayItem\r\n<\/pre>\n

When there is a spotlight deal, our spider will scrape the item and process the item pipeline task. The same workflow happens on featured deals. The difference is, there are more than one item in featured deals. Since we are using the\u00a0yield<\/em> keyword, we can call the item pipeline one by one without stopping the iteration.<\/p>\n

Then on the category items, if there is a categorized daily deals link within the displayed categories, our spider will scrape daily deals items from the categorized page instead. And in those categorized daily deals pages, a tricky situation is happened there — infinite scrolling.<\/p>\n

Infinite Scroll Pagination Handling<\/h3>\n

The eBay categorized deals pages use infinite scrolling to display daily deals items. There is no “next” button for showing more items, the categorized deals page will display more items once a user has scrolled down the page a little bit. Since there is no pagination element on the page, we can not assign CSS or Xpath selector to our spider. But items do not come from no where, there should be somewhere to load new items. Infinite scrolling uses ajax to make it scroll infinity, so we should inspect the page’s network performance.
\n(Right click on your browser, select “Inspect” and click “Network” tab)<\/p>\n

\"\"<\/p>\n

Try scrolling your page and see rather any JavaScript call is processed. And Ding! An ajax call to\u00a0https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings is found. It will return 24 eBay items under a given category per call. We then request this JavaScript call and handle it in our other parse function, parse_cat_listing<\/em>.<\/span><\/p>\n

Scrapy and JavaScript<\/h3>\n

\u00a0The parse_cat_listing<\/em> function is the place where we handle response from JavaScript and transform it into EBayItem<\/em>.\u00a0<\/span><\/p>\n

 def parse_cat_listing(self, response):\r\n        category = response.meta['category']\r\n        page_index = response.meta['page_index']\r\n        cat_id = response.meta['cat_id']\r\n\r\n        data = json.loads(response.body)\r\n        fulfillment_value =  data.get('fulfillmentValue')\r\n        listing_html = fulfillment_value['listingsHtml']\r\n        is_last_page = fulfillment_value['pagination']['isLastPage']\r\n        json_response = HtmlResponse(url=\"json response\", body=listing_html, encoding='utf-8')\r\n        items_on_cat = json_response.css(\".col\")\r\n        \r\n        for item in items_on_cat:\r\n              eBayItem = getItemInfo(item, category)\r\n              yield eBayItem\r\n\r\n        if (is_last_page == False): \r\n            item_starting_index = page_index * 24\r\n            cat_listing = \"https:\/\/www.ebay.com\/globaldeals\/spoke\/ajax\/listings?_ofs={}&category_path_seo={}&deal_type=featured\".format(item_starting_index, cat_id)\r\n            request = scrapy.Request(cat_listing, callback=self.parse_cat_listing)\r\n            request.meta['category'] = category\r\n            request.meta['page_index'] = page_index+1\r\n            request.meta['cat_id'] = cat_id\r\n            yield request     \r\n<\/pre>\n

Since the JavaScript returns a JSON object containing the item listing html content, we first obtain the html content in text.<\/p>\n

data = json.loads(response.body)\r\nfulfillment_value =  data.get('fulfillmentValue')\r\nlisting_html = fulfillment_value['listingsHtml']\r\n<\/pre>\n

Then use Scrapy’s HtmlResponse<\/em> class to encode the text to HTML. We apply UTF-8 encoding for cases with special characters appearing in an item name.<\/p>\n

json_response = HtmlResponse(url=\"json response\", body=listing_html, encoding='utf-8')\r\n<\/pre>\n

After that, we can apply the same getItemInfo<\/em> logic we have done in the parse<\/em> function.<\/p>\n

As it is an infinite scrolling, we keep requesting the JavaScript call\u00a0recurringly until the\u00a0JSON object returns a last page flag.<\/p>\n

The Scraping Result<\/h3>\n

We have set all our spider files, it is time to let it scrape by running following command:<\/p>\n

scrapy crawl ebaybd\r\n<\/pre>\n

Our “ebaybd” spider then scrapes and saves the result in a “YYYY-MM-DD.csv” file. Below is the result I run with about 3000 records in 1 minute processing time.<\/p>\n

\"\"<\/p>\n

Our spider is done here, you may extend its functionality by scraping other categories \/ item attributes, or store the result in MongoDB, cloud service, etc. Happy Scraping!<\/p>\n

What have we learnt in this post?<\/h3>\n
    \n
  1. Differences between Scrapy and BeautifulSoup<\/li>\n
  2. Building our own Scrapy spider<\/li>\n
  3. Using our own item pipeline processor<\/li>\n
  4. Handling infinite scrolling<\/li>\n<\/ol>\n

     <\/p>\n

    The source code package can be found at\u00a0https:\/\/github.com\/codeastar\/ebay-daily-deals-scraper<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"

    When we go for data science projects, like the Titanic Survivors and Iowa House Prices\u00a0projects, we need data sets to process our predictions. In above cases, those data sets have already been collected and prepared. We only need to download the data set files then start our projects. But when we want to work for […]<\/p>\n","protected":false},"author":1,"featured_media":664,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[2],"tags":[43,46,8,44,45,42],"class_list":["post-612","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-we-code-therefore-we-are","tag-beautiful-soup","tag-infinite-scrolling","tag-python","tag-scrapy","tag-tutorial","tag-web-scraping"],"jetpack_publicize_connections":[],"yoast_head":"\nTutorial: How to do web scraping in Python? ⋆ Code A Star<\/title>\n<meta name=\"description\" content=\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, "Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?"\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/web-scraping-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tutorial: How to do web scraping in Python? ⋆ Code A Star\" \/>\n<meta property=\"og:description\" content=\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, "Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?"\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/web-scraping-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2017-12-30T20:03:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-01-23T05:37:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1115\" \/>\n\t<meta property=\"og:image:height\" content=\"694\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"Tutorial: How to do web scraping in Python?\",\"datePublished\":\"2017-12-30T20:03:44+00:00\",\"dateModified\":\"2018-01-23T05:37:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\"},\"wordCount\":1392,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"keywords\":[\"Beautiful Soup\",\"infinite scrolling\",\"Python\",\"Scrapy\",\"tutorial\",\"web scraping\"],\"articleSection\":[\"We code therefore we are\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/web-scraping-python\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\",\"url\":\"https:\/\/www.codeastar.com\/web-scraping-python\/\",\"name\":\"Tutorial: How to do web scraping in Python? ⋆ Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"datePublished\":\"2017-12-30T20:03:44+00:00\",\"dateModified\":\"2018-01-23T05:37:34+00:00\",\"description\":\"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \\\"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\\\"\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/web-scraping-python\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1\",\"width\":1115,\"height\":694,\"caption\":\"How to do web scraping?\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Tutorial: How to do web scraping in Python?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don't wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/x.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Tutorial: How to do web scraping in Python? ⋆ Code A Star","description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/web-scraping-python\/","og_locale":"en_US","og_type":"article","og_title":"Tutorial: How to do web scraping in Python? ⋆ Code A Star","og_description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","og_url":"https:\/\/www.codeastar.com\/web-scraping-python\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2017-12-30T20:03:44+00:00","article_modified_time":"2018-01-23T05:37:34+00:00","og_image":[{"width":1115,"height":694,"url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","type":"image\/png"}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"Tutorial: How to do web scraping in Python?","datePublished":"2017-12-30T20:03:44+00:00","dateModified":"2018-01-23T05:37:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/"},"wordCount":1392,"commentCount":0,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"image":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","keywords":["Beautiful Soup","infinite scrolling","Python","Scrapy","tutorial","web scraping"],"articleSection":["We code therefore we are"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/web-scraping-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/","url":"https:\/\/www.codeastar.com\/web-scraping-python\/","name":"Tutorial: How to do web scraping in Python? ⋆ Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"image":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","datePublished":"2017-12-30T20:03:44+00:00","dateModified":"2018-01-23T05:37:34+00:00","description":"There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy. Then it comes another debate topic, \"Why don't you use Beautiful Soup, when Beautiful Soup can do the web scraping task as well?\"","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/web-scraping-python\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#primaryimage","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","width":1115,"height":694,"caption":"How to do web scraping?"},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/web-scraping-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"Tutorial: How to do web scraping in Python?"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don't wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/x.com\/codeastar"]}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2017\/12\/web_scraper.png?fit=1115%2C694&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-9S","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=612"}],"version-history":[{"count":44,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612\/revisions"}],"predecessor-version":[{"id":754,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/612\/revisions\/754"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media\/664"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=612"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=612"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=612"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}