Last time, we coded a Python Word Cloud Generator for Indeed.com users. This time, since Christmas is here, I would like to code a job seeking word cloud for my hometown — Hong Kong! So no matter you are living in Hong Kong or looking for jobs in Hong Kong, I hope this Hong Kong edition word cloud can help you find what you want.
The Easy Way
Our project this time is about the Hong Kong job market, we can still use the same Indeed web miner from our last post. Simply replace the url_prefix variable from “https://www.indeed.com” to “https://www.indeed.hk”. Then we can build a word cloud for the Hong Kong job market.
Really? Really that simple? Indeed. But, we are going to do it in a better and more localized way.
The Hong Kong Way 🇭🇰
Other than Indeed.com, there is another popular job searching site in Hong Kong — jobsdb.com. JobsDB is the No. 1 recruitment website in Hong Kong, which is used by many local companies and multinational corporations’ regional offices in Hong Kong.
So this time, we code the JobsDB web miner using the Indeed web miner’s skeleton. We capture user’s input, analyze JobsDB results using TFIDF, and generate a word cloud.
First, we import required modules (yes, they look similar in Indeed web miner :]] ):
from bs4 import BeautifulSoup
import urllib
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import sys, re, string, datetime, getopt
from os import path
from PIL import Image
import numpy as np
And assign the url_prefix to Hong Kong JobsDB website:
url_prefix = "https://hk.jobsdb.com"
params = "/hk/search-jobs/{}/".format(params)
location = "Hong Kong"
Scraping with “Soup” (again)
Like the way we did in Indeed web miner, we use Beautiful Soup for web scraping to get the search results from JobsDB:
def getJobLinksFromIndexPage(soup, url_prefix):
job_links_arr = []
for link in tqdm(soup.find_all('a', attrs={'href': re.compile("^"+url_prefix+"/hk/en/job/")})):
job_title_link = link.get('href')
job_links_arr.append(job_title_link)
return job_links_arr
def getJobInfoLinks(url, next_page_count, url_prefix,
params, start_page):
job_links_arr = []
while True:
#define an user agent as it is a required field for browsing JobsDB
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
html = urllib.request.urlopen( req )
soup = BeautifulSoup(html, 'lxml')
job_links_arr += getJobLinksFromIndexPage(soup, url_prefix)
start_page += 1
if (start_page > next_page_count):
break
next_page_tag = "{}{}".format(params, start_page)
next_link = soup.find('a', attrs={'href': re.compile("^"+next_page_tag)})
if (next_link == None):
break
url = url_prefix + next_link.get('href')
return job_links_arr
start_page = 1
url = "{}{}{}".format(url_prefix,params, start_page)
current_datetime = datetime.datetime.today().strftime('%Y-%m-%d %H:%M:%S')
print("Getting job links in {} page(s)...".format(search_page))
job_links_arr = getJobInfoLinks(url, search_page, url_prefix,
params, start_page)
Please note that there are 2 minor differences on our JobsDB web miner:
- we find job links by hyperlink pattern instead of div class
- we assign “Magic Browser” as user-agent to pass through JobsDB’s request checking
Once we have had the job links, we can scrape the job details:
punctuation = string.punctuation
job_desc_arr=[]
print("Getting job details in {} post(s)...".format(len(job_links_arr)))
for job_link in tqdm(job_links_arr):
req = urllib.request.Request(job_link, headers={'User-Agent' : "Magic Browser"})
job_html = urllib.request.urlopen( req )
job_soup = BeautifulSoup(job_html, 'lxml')
job_desc = job_soup.find('div', {'class':'jobad-primary-details'})
for li_tag in job_desc.findAll('li'):
li_tag.insert(0, " ") #add space before an object
job_desc = job_desc.get_text()
job_desc = re.sub('https?:\/\/.*[\r\n]*', '', job_desc, flags=re.MULTILINE)
job_desc = job_desc.translate(job_desc.maketrans(punctuation, ' ' * len(punctuation)))
job_desc_arr.append(job_desc)
Other than the “Magic Browser” user-agent and the JobsDB div class name, we are just using the same code snippet from Indeed web miner.
After getting the job details, we use the same TFIDF code to obtained the top 500 weighted words in transformed_job_desc.
Visualize the Word Cloud again and a little twist…
Finally, we are in the word cloud generation stage. We can use the same code from last post to generate a word cloud. So I use “programmer” as search input with “10” search result pages.
Here we go:
We can find “web”, “java”, “sql”, “.net”, “c” and other keywords from Hong Kong’s word cloud for “programmer”. By comparing with the Indeed’s word cloud, the Hong Kong’s one is more focused on “web”, “java”, “sql” and “.net”. While the Indeed’s one is heavily “C” weighted.
Then what is “a little twist”? As mentioned earlier, we do it in a more localized way, the Hong Kong way. Thus, we bring you the Hong Kong Junk Boat!
It is basically as same as our original word cloud, but it is presented in a junk boat mask. In order to do this twist, we need a mask image file like this:
We then load the image and apply the image mask to our word cloud.
junkboat_mask = np.array(Image.open("images/junk_hk.png"))
w = WordCloud(width=800,height=600,background_color='white',max_words=500,mask=junkboat_mask,
contour_width=2, contour_color='#E50110').fit_words(freqs_dict)
image_colors = ImageColorGenerator(junkboat_mask)
plt.imshow(w.recolor(color_func=image_colors), interpolation="bilinear")
Also, we need to add an optional parameter from command line to trigger the “junk boat” mode.
try:
opts, args = getopt.getopt(sys.argv[2:],"jp:")
except getopt.GetoptError as err:
print('ERROR: {}'.format(err))
sys.exit(1)
junk_boat_mode = False
search_page = 1
for opt, arg in opts:
if opt == '-j':
junk_boat_mode = True
elif opt == '-p':
search_page = int(arg)
Thus we can display the junk boat word cloud when we add “-j” from command line:
$pythonw jobsdbminer.py "programmer" -p 10 -j #display a junk boat word cloud using 10 search result pages
$pythonw jobsdbminer.py "programmer" -p 10 #display an ordinary word cloud using 10 search result pages
What have we learnt in this post?
- Usage of the number one and the most localized job searching site in Hong Kong
- Assigning user agent on url open request
- Usage of image mask on generating word cloud
- Adding optional command-line argument in Python
(the complete source can be found at https://github.com/codeastar/webminer_jobsdb or https://gitlab.com/codeastar/webminer_jobsdb)