{"id":1540,"date":"2018-12-16T19:12:19","date_gmt":"2018-12-16T19:12:19","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1540"},"modified":"2018-12-22T15:37:48","modified_gmt":"2018-12-22T15:37:48","slug":"word-cloud-easy-python-job-seekers","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/word-cloud-easy-python-job-seekers\/","title":{"rendered":"Word Cloud for Job Seekers in Python"},"content":{"rendered":"\n
We tried Python web scraping project<\/a> using scrapy<\/a> and text mining project<\/a> using TFIDF<\/a> in past. This time, we are going to raise the bar, by combing two projects together. So we have our — text mining web scraper. Like our early post<\/a> in the CodeAStar blog, it is always good to build something useful with fewer lines of code. Then we will use our text mining web scraper to make a Word Cloud for Job Seekers.<\/p>\n\n\n\n\n\n\n The cloud we go to build is a word cloud that containing key information for job seekers. Thus they can make better strategies based on what they see on the cloud. We will use web scraping technique to get job information from indeed.com<\/a>. Indeed is chosen because of its popularity, clean layout and its wide range of countries support (we will use the default US site, indeed.com, in this project).<\/span><\/p>\n So we are going to: <\/span><\/p>\n And our output should look like: <\/p>\n This project is straight-forward and easy. What are we waiting? Let’s code it!<\/p>\n\n\n Now we not only code a star, we code a cloud as well :]] . Like the way we did in the EZ Weather Flask app<\/a> project, we use the pipenv<\/a> to start our development environment. <\/p>\n\n\n\n Get the package list file, Pipfile, from here<\/a> and put it in your development folder. Then we can install all the required modules in just one line of command. <\/p>\n\n\n\n We name our file as “indeedminer.py” and the file name just says it all. Inside our indeedminer file, first, we import required packages and code the “gather job search query” part.<\/p>\n\n\n The above code snippet is pretty straight-forward. We accept 3 arguments, search keywords, location and page count, from command line. Search keywords can be job title, industry or company name. And page count is the number of search result pages that we use to build our word cloud. So a usage example can be: <\/p>\n\n\n\n i.e. We build a word cloud for “property manager” jobs in “Phoenix, AZ” using 3 search result pages. <\/p>\n\n\n\n Please note that “pythonw” is used instead of “python” from above command. As a word cloud is a graphical component, we need to open a window terminal to display it. <\/p>\n\n\n\n From our past project, we used scrapy to build a platform and use it to scrape daily deal from eBay. Since we are building an easy scraper this time, not a scraping platform, so the Beautiful Soup<\/a> is the right tool for us. <\/p>\n\n\n“Cloud” funding<\/h3>\n
\n
Code A Cloud<\/h3>\n\n\n\n
$pipenv --three\n$pipenv shell<\/code><\/pre>\n\n\n\n
$pipenv install<\/code><\/pre>\n\n\n\n
from bs4 import BeautifulSoup\nfrom urllib.request import urlopen\nfrom urllib.parse import urlencode\nfrom tqdm import tqdm\nimport nltk\nfrom nltk.corpus import stopwords\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nimport matplotlib.pyplot as plt\nfrom wordcloud import WordCloud\nimport sys, re, string, datetime\n\nif (len(sys.argv) < 3): \n print(\"\\n\\tUsage: indeedminer.py [search keywords] [location] [optional: search page count]\")\n print('\\te.g. $pythonw indeedminer.py \"HR Manager\" \"New York\"\\n')\n exit()\n\nsearch_page = 1\nif (len(sys.argv) > 3):\n search_page = int(sys.argv[3]) \n\nsearch_keyword= sys.argv[1]\nlocation = sys.argv[2]\nparams = {\n 'q':search_keyword,\n 'l':location\n }<\/pre>\n\n\n
$pythonw indeedminer.py \"property manager\" \"Phoenix, AZ\" 3<\/code><\/pre>\n\n\n\n
Scraping with “Soup”<\/h3>\n\n\n\n
url_prefix = \"https:\/\/www.indeed.com\" #replace url_prefix with your favorite country from https:\/\/www.indeed.com\/worldwide\nurl = url_prefix + \"\/jobs?\"+urlencode(params)\n\ndef getJobInfoLinks(url, next_page_count, url_prefix):\n job_links_arr = [] \n while True: \n if (next_page_count < 1):\n break \n next_page_count -= 1 \n html = urlopen(url)\n soup = BeautifulSoup(html, 'lxml') \n job_links_arr += getJobLinksFromIndexPage(soup)\n pagination = soup.find('div', {'class':'pagination'}) \n next_link = \"\"\n\n for page_link in reversed(pagination.find_all('a')):\n next_link_idx = page_link.get_text().find(\"Next\")\n if (next_link_idx >= 0):\n next_link = page_link.get('href') \n break \n if (next_link == \"\"):\n break \n url = url_prefix+next_link \n return job_links_arr\n \ndef getJobLinksFromIndexPage(soup): \n jobcards = soup.find_all('div', {'class':'jobsearch-SerpJobCard row result'}) \n job_links_arr = [] \n for jobcard in tqdm(jobcards): \n job_title_obj = jobcard.find('a', {'class':'turnstileLink'})\n job_title_link = job_title_obj.get('href')\n job_links_arr.append(job_title_link) \n return job_links_arr\n\ncurrent_datetime = datetime.datetime.today().strftime('%Y-%m-%d %H:%M:%S')\nprint(\"Getting job links in {} page(s)...\".format(search_page))\njob_links_arr = getJobInfoLinks(url, search_page, url_prefix)\n<\/pre>\n