{"id":1895,"date":"2019-04-30T21:36:47","date_gmt":"2019-04-30T21:36:47","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1895"},"modified":"2019-05-15T19:30:45","modified_gmt":"2019-05-15T19:30:45","slug":"word-embedding-in-nlp-and-python-part-1","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","title":{"rendered":"Word Embedding in NLP and Python &#8211; Part 1"},"content":{"rendered":"\n<p>We have handled text in machine learning using <a href=\"https:\/\/www.codeastar.com\/tfidf-predict-deal-probability\/\">TFIDF<\/a>. And we can use it to build <a href=\"https:\/\/www.codeastar.com\/word-cloud-easy-python-job-seekers\/\">word cloud<\/a> for analytic purpose. But is it the capability of a machine can do on text? Definitely not, as we just haven&#8217;t let machine to &#8220;learn&#8221; about text yet. TFIDF is a statistics process, while we want a machine to learn the correlation of text. So a machine can read <strong>and<\/strong> understand texts like a human does, we call it Natural Language Processing (<a rel=\"noreferrer noopener\" aria-label=\"NLP (opens in a new tab)\" href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\" target=\"_blank\">NLP<\/a>). We can think about that, we teach TFIDF to a machine in a Mathematics class. This time, we teach a machine in a Linguistics class. And our first lesson will be &#8212; word embedding. <\/p>\n\n\n\n<!--more-->\n\n\n\n<h3 class=\"wp-block-heading\">Word in machine&#8217;s POV<\/h3>\n\n\n\n<p>When we read a word, &#8220;apple&#8221;, we may think of a reddish fruit, a pie ingredient or even a technology company.  We have such thinking because we have learnt from class or experienced from our daily life. For a machine, &#8220;apple&#8221; is just a five-character length word. In fact, it is a bunch of 0 and 1 from a machine&#8217;s point of view.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img data-attachment-id=\"1938\" data-permalink=\"https:\/\/www.codeastar.com\/human_machine_pov\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?fit=1063%2C435&amp;ssl=1\" data-orig-size=\"1063,435\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"human_machine_pov\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?fit=300%2C123&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?fit=1024%2C419&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"419\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?resize=1024%2C419&#038;ssl=1\" alt=\"machine's pov on word\" class=\"wp-image-1938\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?resize=1024%2C419&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?resize=300%2C123&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?resize=768%2C314&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/human_machine_pov.png?w=1063&amp;ssl=1 1063w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" data-recalc-dims=\"1\" \/><\/figure><\/div>\n\n\n\n<p>Our work is helping a machine to learn the correlation of words. So we let words attach with other words &#8212; that is why we call it the &#8220;word embedding&#8221;.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How word embedding works in a machine <\/h3>\n\n\n\n<p>Now we know what to do for machine learning on text. It is time to understand how do we achieve our goal. As we mentioned earlier, a machine treats words as a bunch of 0 and 1. The first thing we do is to encode words into vectors, i.e. each word has its geometric location. The next thing we do is to bind words with similar context into close geometric locations. Mathematically, the cosine of the angle between those vectors should be close to 1 and the angle should be close to 0. Let&#8217; see following example, we have 3 different words, &#8220;Lorry&#8221;, &#8220;Truck&#8221; and &#8220;Plane&#8221;. Our objectives are letting machine has similar words in close position, and has unmatched word in far position. <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img data-attachment-id=\"1899\" data-permalink=\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/axis\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?fit=429%2C322&amp;ssl=1\" data-orig-size=\"429,322\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"axis\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?fit=300%2C225&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?fit=429%2C322&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" width=\"429\" height=\"322\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?resize=429%2C322&#038;ssl=1\" alt=\"word embedding\" class=\"wp-image-1899\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?w=429&amp;ssl=1 429w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/axis.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"(max-width: 429px) 100vw, 429px\" data-recalc-dims=\"1\" \/><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Word Embedding technology #1 &#8211; Word2Vec<\/h3>\n\n\n\n<p>In order to do word embedding, we will need <a rel=\"noreferrer noopener\" aria-label=\"Word2Vec (opens in a new tab)\" href=\"https:\/\/arxiv.org\/abs\/1301.3781\" target=\"_blank\">Word2Vec<\/a> technology on neural networks. Word2Vec was developed by Tomas Mikolov and his teammates at Google. It consists of two methods,  <strong>Continuous Bag-Of-Words (CBOW)<\/strong> and <strong>Skip-Gram<\/strong>.<\/p>\n\n\n\n<p>CBOW is the way we predict a result word using surrounding words. For example, &#8220;day&#8221; is the predicted word, when our bag-of-words inputs are &#8220;Have&#8221;, &#8220;a&#8221;, &#8220;nice&#8221;. And &#8220;nice&#8221; is the predicted one when our inputs are &#8220;Have&#8221;, &#8220;a&#8221;, &#8220;day&#8221;. <\/p>\n\n\n\n<p>On the other hand, Skip-Gram is a reverse of CBOW. We start our prediction from our target word, and calculate the possibility of its surrounding words.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"mce_11\">Word Embedding technology #2 &#8211; fastText<\/h3>\n\n\n\n<p style=\"text-align:left\">After the release of Word2Vec, Facebook&#8217;s AI Research (FAIR) Lab has built its own word embedding library referring Tomas Mikolov&#8217;s paper. So we have <a rel=\"noreferrer noopener\" aria-label=\"fasttext (opens in a new tab)\" href=\"https:\/\/fasttext.cc\/\" target=\"_blank\">fastText<\/a> library. The major difference of fastText and Word2Vec is the implementation of n-gram.  We can think of n-gram as sub-word. FastText breaks a word into several 3-4 characters length n-grams. For example, &#8220;<em>action<\/em>&#8220;, fastText will handle it as <em>&#8220;&lt;ac&#8221;<\/em>, &#8220;act&#8221;, &#8220;<em>cti<\/em>&#8220;, &#8220;<em>tio&#8221;<\/em>, &#8220;<em>ion<\/em>&#8220;, &#8220;<em>on&gt;<\/em>. Please note that &#8220;&lt;&#8221; and &#8220;&gt;&#8221; are added to the first-gram and the last-gram. So a machine can distinguish shorter words and other n-grams. Likes the word &#8220;<em>act<\/em>&#8220;, is &#8220;<em>&lt;act&gt;<\/em>&#8221; from a machine&#8217;s POV, not the n-gram &#8220;<em>act<\/em>&#8220;.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">NLP in action<\/h3>\n\n\n\n<p>I believe practicing with example is a good way of learning. And it is always good to code something in CodeAStar here :]] . This time, we use a Kaggle&#8217;s competition topic, <a rel=\"noreferrer noopener\" aria-label=\"Toxicity Classification (opens in a new tab)\" href=\"https:\/\/www.kaggle.com\/c\/jigsaw-unintended-bias-in-toxicity-classification\" target=\"_blank\">Toxicity Classification<\/a>, as our NLP example. In the competition, our goal is to find out toxic comments, i.e. comments with vulgar or insulting languages.<\/p>\n\n\n\n<p>Now we know that we must train a machine with texts first, so it can correlate what words are toxic and non-toxic. We can pick one of those  embedding technologies and start to train a machine with training data. <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img data-attachment-id=\"1936\" data-permalink=\"https:\/\/www.codeastar.com\/waithappy\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?fit=729%2C750&amp;ssl=1\" data-orig-size=\"729,750\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"waithappy\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?fit=292%2C300&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?fit=729%2C750&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" width=\"292\" height=\"300\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?resize=292%2C300&#038;ssl=1\" alt=\"wait\" class=\"wp-image-1936\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?resize=292%2C300&amp;ssl=1 292w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/waithappy.png?w=729&amp;ssl=1 729w\" sizes=\"(max-width: 292px) 100vw, 292px\" data-recalc-dims=\"1\" \/><\/figure><\/div>\n\n\n\n<p>When we learn a common language, in this case, English, we know some words are correlated and some are not. And we should not have much difference in general, otherwise we cannot communicate with each other. Then we apply this case to a machine. Each machine should has same word embedding vectors when each of them use the same technology and the same training data.<\/p>\n\n\n\n<p>That&#8217;s it. We can use the pre-trained word embedding model instead of training ourselves. From fastText official website, we can <a rel=\"noreferrer noopener\" aria-label=\"download (opens in a new tab)\" href=\"https:\/\/fasttext.cc\/docs\/en\/english-vectors.html\" target=\"_blank\">download<\/a> the pre-trained model which fastText used 600 billion tokens (&#8220;words&#8221;) to make 300 million vectors (&#8220;unique words&#8221;) from <a rel=\"noreferrer noopener\" aria-label=\"Common Crawl (opens in a new tab)\" href=\"http:\/\/commoncrawl.org\" target=\"_blank\">Common Crawl<\/a>. It is a good way to save our time and effort. And most importantly, I don&#8217;t think I have a powerful machine to train with 600 billion tokens. :]]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Load the Pre-Trained Model<\/h3>\n\n\n\n<p>Okay, we first load the Kaggle&#8217;s training and testing data which contain toxic\/non-toxic comments. Then we merge them together as a data frame, <em>df_merge<\/em>.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">import pandas as pd\ndf_train = pd.read_csv('..\/input\/jigsaw-unintended-bias-in-toxicity-classification\/train.csv')\ndf_test = pd.read_csv('..\/input\/jigsaw-unintended-bias-in-toxicity-classification\/test.csv')\ndf_merge = pd.concat([df_train[['id','comment_text']], df_test], axis=0)\n<\/pre>\n\n\n\n<p>Let&#8217; see what&#8217;s inside the data frame:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img data-attachment-id=\"1911\" data-permalink=\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/comments\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?fit=852%2C385&amp;ssl=1\" data-orig-size=\"852,385\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"comments\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?fit=300%2C136&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?fit=852%2C385&amp;ssl=1\" decoding=\"async\" loading=\"lazy\" width=\"852\" height=\"385\" src=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?resize=852%2C385&#038;ssl=1\" alt=\"comments\" class=\"wp-image-1911\" srcset=\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?w=852&amp;ssl=1 852w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?resize=300%2C136&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/comments.png?resize=768%2C347&amp;ssl=1 768w\" sizes=\"(max-width: 852px) 100vw, 852px\" data-recalc-dims=\"1\" \/><\/figure><\/div>\n\n\n\n<p>Our objective is changing the words in our comments into word vectors. So we load the word vectors from <a rel=\"noreferrer noopener\" aria-label=\"fastText pre-trained model (opens in a new tab)\" href=\"https:\/\/www.kaggle.com\/yekenot\/fasttext-crawl-300d-2m\" target=\"_blank\">fastText pre-trained model<\/a> using <a rel=\"noreferrer noopener\" aria-label=\"Gensim's KeyedVectors (opens in a new tab)\" href=\"https:\/\/radimrehurek.com\/gensim\/models\/keyedvectors.html\" target=\"_blank\">Gensim&#8217;s KeyedVectors<\/a> library.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">from gensim.models import KeyedVectors\nfasttext_300d_2m_model = '..\/input\/fasttext-crawl-300d-2m\/crawl-300d-2M.vec'\nwordvectors_index = KeyedVectors.load_word2vec_format(fasttext_300d_2m_model)\n<\/pre>\n\n\n\n<p>Now we have 2 million word vectors, but is it large enough to cover the words in our comments? And what are those words that we cannot find in our word vectors? Don&#8217;t worry, let&#8217;s find it out. Firstly, we discover the unique words (vocabulary list) from our comments.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">from tqdm import tqdm\ndef build_vocab(texts):\n    sentences = texts.apply(lambda x: x.split()).values\n    vocab = {}\n    for sentence in tqdm(sentences):\n        for word in sentence:\n            try:\n                vocab[word] += 1\n            except KeyError:\n                vocab[word] = 1\n    return vocab\n\nvocab = build_vocab(df_merge['comment_text'])\nprint(f\"The length of vocab: {len(vocab)}\")\n<\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1902194\/1902194 [00:32&lt;00:00, 58340.85it\/s]\nThe length of vocab: 1731089<\/code><\/pre>\n\n\n\n<p>Wow, we have 1.73 million unique words and 2 million word vectors. The next things we do are to check our word vector coverage and find the &#8220;out of vocabulary&#8221; (oov) words.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">def check_coverage(vocab,embeddings_index):\n    a = {}\n    oov = {}\n    k = 0\n    i = 0\n    for word in tqdm(vocab):\n        try:\n            a[word] = embeddings_index[word]\n            k += vocab[word]\n        except:\n            oov[word] = vocab[word]\n            i += vocab[word]\n            pass\n    print('Found embeddings for {:.2%} of vocab'.format(len(a) \/ len(vocab)))\n    print('Found embeddings for {:.2%} of all text'.format(k \/ (k + i)))\n    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]\n    return sorted_x\n\noov = check_coverage(vocab,wordvectors_index)\n<\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1731089\/1731089 [00:06&lt;00:00, 286005.84it\/s]\nFound embeddings for 16.91% of vocab\nFound embeddings for 91.37% of all text<\/code><\/pre>\n\n\n\n<p>Oh well, something doesn&#8217;t feel right. We have only 16.9% of vocabularies with word vectors. <\/p>\n\n\n\n<iframe loading=\"lazy\" width=\"800\" height=\"520\" frameborder=\"0\" scrolling=\"no\" src=\"\/\/plot.ly\/~codeastar\/28.embed\"><\/iframe>\n\n\n\n<p>Let&#8217; see what are those top 10 oov words:<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">oov[:10]<\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>[(\"Trump's\", 24673),\n (\"aren't\", 21696),\n (\"Don't\", 20822),\n (\"wouldn't\", 20611),\n ('Yes,', 20040),\n (\"wasn't\", 19084),\n (\"You're\", 14486),\n (\"Let's\", 14392),\n ('So,', 13755),\n ('it?', 12927)]<\/code><\/pre>\n\n\n\n<p>Got it! We have to deal with punctuation and apostrophes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Word Preprocessing <\/h3>\n\n\n\n<p>We make a <em>punct_apo_fix<\/em> function to replace and separate punctuation and apostrophes in our comments.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">import re\ndef punct_apo_fix(x):\n    x = str(x)\n    x = x.replace(\"_\",\" \")    \n    for punct in \"`\u2019\":\n        x = x.replace(punct,\"'\")    \n    for punct in '!,?()%\":.$\u201c\/;#+*=&gt;[]&amp;-':\n        x = x.replace(punct, f' {punct} ')\n    apos = re.findall(\"'.*?[\\s]\", x)\n    for apo in apos: \n        if apo.lower() in [\"'t \",\"'re \", \"' \", \"'ve \", \"'s \", \"'ll \", \"'d \", \"'n \", \"'clock \", \"'m \"]:\n            x = x.replace(apo, f' {apo}')\n        else:\n            x = x.replace(apo, f\" ' {apo[1:]}\")\n    if (x.endswith(\"'\")): \n        x = x[:-1]+\" '\"        \n    return x\n<\/pre>\n\n\n\n<p>After that, we apply our function to <em>df_merge<\/em>. Then we can check the coverage again.<\/p>\n\n\n\n<pre lang=\"python\" line=\"1\">df_merge[\"comment_text\"] = df_merge[\"comment_text\"].apply(lambda x: punct_apo_fix(x))\nvocab = build_vocab(df_merge['comment_text'])\noov = check_coverage(vocab,wordvectors_index)\n<\/pre>\n\n\n\n<iframe loading=\"lazy\" width=\"800\" height=\"520\" frameborder=\"0\" scrolling=\"no\" src=\"\/\/plot.ly\/~codeastar\/30.embed\"><\/iframe>\n\n\n\n<p>Wow! That is an improvement. And we have changed the coverage of unique words from 16.9% to 56.6%. Moreover, now we have 99.7% of all words covered. That means the uncovered 43.4% unique words are only the 0.3% of total words there. <\/p>\n\n\n\n<p>We have a pre-processed data set, so the next thing we do, is to build our deep learning model. Stay tuned for the NLP in Python <a href=\"https:\/\/www.codeastar.com\/recurrent-neural-network-rnn-in-nlp-and-python-part-2\/\">part 2<\/a>.<\/p>\n\n\n\n<div style=\"height:120px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\"> What have we learnt in this post? <\/h3>\n\n\n\n<ol><li>Definition of Word Embedding<\/li><li>Concept of Word2Vec<\/li><li>Concept of fastText<\/li><li>Usage of pre-trained model on word embedding<\/li><li>Usage of word pre-processing in NLP <\/li><\/ol>\n","protected":false},"excerpt":{"rendered":"<p>We have handled text in machine learning using TFIDF. And we can use it to build word cloud for analytic purpose. But is it the capability of a machine can do on text? Definitely not, as we just haven&#8217;t let machine to &#8220;learn&#8221; about text yet. TFIDF is a statistics process, while we want a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1935,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"site-sidebar-layout":"default","site-content-layout":"default","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"default","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[18],"tags":[57,142,140,141,143],"jetpack_publicize_connections":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.8.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Word Embedding in NLP and Python - Part 1 &#8902; Code A Star<\/title>\n<meta name=\"description\" content=\"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Word Embedding in NLP and Python - Part 1 &#8902; Code A Star\" \/>\n<meta property=\"og:description\" content=\"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\" \/>\n<meta property=\"og:site_name\" content=\"Code A Star\" \/>\n<meta property=\"article:publisher\" content=\"codeastar\" \/>\n<meta property=\"article:author\" content=\"codeastar\" \/>\n<meta property=\"article:published_time\" content=\"2019-04-30T21:36:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-05-15T19:30:45+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/happy_embedding.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"779\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Raven Hon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@codeastar\" \/>\n<meta name=\"twitter:site\" content=\"@codeastar\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raven Hon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\"},\"author\":{\"name\":\"Raven Hon\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"headline\":\"Word Embedding in NLP and Python &#8211; Part 1\",\"datePublished\":\"2019-04-30T21:36:47+00:00\",\"dateModified\":\"2019-05-15T19:30:45+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\"},\"wordCount\":1103,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"keywords\":[\"deep learning\",\"fastText\",\"NLP\",\"Word embedding\",\"word2vec\"],\"articleSection\":[\"Learn Machine Learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\",\"url\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\",\"name\":\"Word Embedding in NLP and Python - Part 1 &#8902; Code A Star\",\"isPartOf\":{\"@id\":\"https:\/\/www.codeastar.com\/#website\"},\"datePublished\":\"2019-04-30T21:36:47+00:00\",\"dateModified\":\"2019-05-15T19:30:45+00:00\",\"description\":\"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.codeastar.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Word Embedding in NLP and Python &#8211; Part 1\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.codeastar.com\/#website\",\"url\":\"https:\/\/www.codeastar.com\/\",\"name\":\"Code A Star\",\"description\":\"We don&#039;t wish upon a star, we code a star\",\"publisher\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.codeastar.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd\",\"name\":\"Raven Hon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1\",\"width\":70,\"height\":70,\"caption\":\"Raven Hon\"},\"logo\":{\"@id\":\"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/\"},\"description\":\"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.\",\"sameAs\":[\"https:\/\/www.codeastar.com\",\"codeastar\",\"https:\/\/twitter.com\/codeastar\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Word Embedding in NLP and Python - Part 1 &#8902; Code A Star","description":"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","og_locale":"en_US","og_type":"article","og_title":"Word Embedding in NLP and Python - Part 1 &#8902; Code A Star","og_description":"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.","og_url":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","og_site_name":"Code A Star","article_publisher":"codeastar","article_author":"codeastar","article_published_time":"2019-04-30T21:36:47+00:00","article_modified_time":"2019-05-15T19:30:45+00:00","og_image":[{"width":800,"height":779,"url":"https:\/\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/happy_embedding.png","type":"image\/png"}],"author":"Raven Hon","twitter_card":"summary_large_image","twitter_creator":"@codeastar","twitter_site":"@codeastar","twitter_misc":{"Written by":"Raven Hon","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#article","isPartOf":{"@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/"},"author":{"name":"Raven Hon","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"headline":"Word Embedding in NLP and Python &#8211; Part 1","datePublished":"2019-04-30T21:36:47+00:00","dateModified":"2019-05-15T19:30:45+00:00","mainEntityOfPage":{"@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/"},"wordCount":1103,"commentCount":2,"publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"keywords":["deep learning","fastText","NLP","Word embedding","word2vec"],"articleSection":["Learn Machine Learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","url":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","name":"Word Embedding in NLP and Python - Part 1 &#8902; Code A Star","isPartOf":{"@id":"https:\/\/www.codeastar.com\/#website"},"datePublished":"2019-04-30T21:36:47+00:00","dateModified":"2019-05-15T19:30:45+00:00","description":"In order to let a machine handling text, we need NLP (Natural Language Processing). And word embedding is one of the essential processes of NLP.","breadcrumb":{"@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.codeastar.com\/"},{"@type":"ListItem","position":2,"name":"Word Embedding in NLP and Python &#8211; Part 1"}]},{"@type":"WebSite","@id":"https:\/\/www.codeastar.com\/#website","url":"https:\/\/www.codeastar.com\/","name":"Code A Star","description":"We don&#039;t wish upon a star, we code a star","publisher":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.codeastar.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/832d202eb92a3d430097e88c6d0550bd","name":"Raven Hon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/","url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2018\/08\/logo70.png?fit=70%2C70&ssl=1","width":70,"height":70,"caption":"Raven Hon"},"logo":{"@id":"https:\/\/www.codeastar.com\/#\/schema\/person\/image\/"},"description":"Raven Hon is\u00a0a 20 years+ veteran in information technology industry who has worked on various projects from console, web, game, banking and mobile applications in different sized companies.","sameAs":["https:\/\/www.codeastar.com","codeastar","https:\/\/twitter.com\/codeastar"]}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.codeastar.com\/wp-content\/uploads\/2019\/04\/happy_embedding.png?fit=800%2C779&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8PcRO-uz","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1895"}],"collection":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/comments?post=1895"}],"version-history":[{"count":40,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1895\/revisions"}],"predecessor-version":[{"id":1984,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/posts\/1895\/revisions\/1984"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media\/1935"}],"wp:attachment":[{"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/media?parent=1895"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/categories?post=1895"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.codeastar.com\/wp-json\/wp\/v2\/tags?post=1895"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}