{"id":612,"date":"2017-12-30T20:03:44","date_gmt":"2017-12-30T20:03:44","guid":{"rendered":"http:\/\/www.codeastar.com\/?p=612"},"modified":"2018-01-23T05:37:34","modified_gmt":"2018-01-23T05:37:34","slug":"web-scraping-python","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/web-scraping-python\/","title":{"rendered":"Tutorial: How to do web scraping in Python?"},"content":{"rendered":"
When we go for data science projects, like the Titanic Survivors<\/a> and Iowa House Prices<\/a>\u00a0projects, we need data sets to process our predictions. In above cases, those data sets have already been collected and prepared. We only need to download the data set files then start our projects. But when we want to work for our own data science projects, we need to prepare data sets ourselves. It would be easy if we can find free and public data sets from UCI Machine Learning Repository<\/a>\u00a0or Kaggle Data Sets<\/a>. But, what If there is no suitable data set found? Don’t worry, let’s create one for ourselves, by web scraping.<\/p>\n <\/p>\n There are plenty of choices for web scraping tools on the internet. Since we have used Python for most of our projects here, we will focus on a Python one: Scrapy<\/a>. Then it comes another debate topic, “Why don’t you use Beautiful Soup<\/a>, when Beautiful Soup can do the web scraping task as well?”<\/p>\n Yes, both Scrapy and Beautiful Soup can do the web scraping job. It all depends on how you want to scrape the data from the internet<\/strong>. Scrapy is a web scraping framework<\/strong> while Beautiful Soup is a library<\/strong>. You can use Scrapy to create bots (spiders)<\/strong> to crawl web content alone<\/strong>, and you can import<\/strong> Beautiful Soup in your code to work with other libraries<\/strong> (e.g. requests) for web scraping. Scrapy provides you a complete solution<\/strong>. On the other hand, Beautiful Soup can be quick and handy<\/strong>. When you try to scrape massive data or multiple pages<\/strong> from a web site, Scrapy would be your choice. If you just want to scrape certain elements<\/strong> from a page, Beautiful Soup can bring you what you wanted.<\/p>\n We can visualize the differences between Scrapy and Beautiful Soup in following pictures:<\/p>\n In this post, we are going to do a web scraping demonstration on eBay Daily Deals.\u00a0We can expect we scrape around 3000 eBay items a time from the daily deals main page, plus its linked category pages.<\/p>\n Since we are scraping 3000 items from eBay Daily Deals, we will use Scrapy as our scraping tool. First thing first, let’s get Scrapy to our environment with our good old pip<\/a> command.<\/p>\n Once Scrapy is installed we can run following command to get our scraping files framework (or, spider egg sac<\/em>!)<\/p>\n The “ebaybd” is our project\/spider name and the startproject<\/em> keyword will create our Are you ready? Let’s hatch a spider!<\/p>\n First, we go to edit the items.py<\/em> file, as we need to tell our spider what to scrape for us. We then create the EBayItem<\/em> class and add our desired eBay fields there.<\/p>\n Second, we need to tell our spider what to do once it has scrapped the data we wanted. So we edit the pipelines.py<\/em> file with following content:<\/p>\nTools for Web Scraping: Scrapy vs Beautiful Soup<\/h3>\n
<\/p>\n
Web Scraping in Action<\/h3>\n
pip install Scrapy<\/pre>\n
scrapy startproject ebaybd<\/pre>\n
spider egg sac<\/del> files framework with following content:<\/p>\nebaybd<\/span>\/ # our project folder<\/span>\r\n scrapy<\/span>.<\/span>cfg<\/span> # scrapy configuration file (just leave it there, we won't touch it) <\/span>\r\n ebaybd<\/span>\/<\/span> # project's Python module (there is where we code our spider)<\/span>\r\n items<\/span>.<\/span>py<\/span> # project items definition file (the item we ask our spider to scrape) <\/span>\r\n pipelines<\/span>.<\/span>py<\/span> # project pipelines file (the process we let our spider do after getting the item) <\/span>\r\n settings<\/span>.<\/span>py<\/span> # project settings file <\/span>\r\n spiders<\/span>\/<\/span> # our spider folder (the place where we code our core logic) <\/span><\/pre>\n
eBay Daily Deals spider hatching<\/h3>\n
class EBayItem(scrapy.Item):\r\n name = scrapy.Field()\r\n category = scrapy.Field()\r\n link = scrapy.Field()\r\n img_path = scrapy.Field() \r\n currency = scrapy.Field() \r\n price = scrapy.Field() \r\n orignal_price = scrapy.Field() \r\n<\/pre>\n
import csv\r\n\r\nclass EBayBDPipeline(object):\r\n\r\n def open_spider(self, spider):\r\n self.file = csv.writer(open(spider.file_name, 'w', newline='', encoding = 'utf8') )\r\n fieldnames = ['Item_name', 'Category', 'Link', 'Image_path', 'Curreny', 'Price', 'Original_price']\r\n self.file.writerow(fieldnames)\r\n\r\n def process_item(self, item, spider):\r\n self.file.writerow([item['name'], item['category'],item['link'], \r\n item['img_path'] , item['currency'],item['price'], \r\n item['orignal_price']])\r\n return item\r\n<\/pre>\n