{"id":1895,"date":"2019-04-30T21:36:47","date_gmt":"2019-04-30T21:36:47","guid":{"rendered":"https:\/\/www.codeastar.com\/?p=1895"},"modified":"2019-05-15T19:30:45","modified_gmt":"2019-05-15T19:30:45","slug":"word-embedding-in-nlp-and-python-part-1","status":"publish","type":"post","link":"https:\/\/www.codeastar.com\/word-embedding-in-nlp-and-python-part-1\/","title":{"rendered":"Word Embedding in NLP and Python – Part 1"},"content":{"rendered":"\n

We have handled text in machine learning using TFIDF<\/a>. And we can use it to build word cloud<\/a> for analytic purpose. But is it the capability of a machine can do on text? Definitely not, as we just haven’t let machine to “learn” about text yet. TFIDF is a statistics process, while we want a machine to learn the correlation of text. So a machine can read and<\/strong> understand texts like a human does, we call it Natural Language Processing (NLP<\/a>). We can think about that, we teach TFIDF to a machine in a Mathematics class. This time, we teach a machine in a Linguistics class. And our first lesson will be — word embedding. <\/p>\n\n\n\n\n\n\n\n

Word in machine’s POV<\/h3>\n\n\n\n

When we read a word, “apple”, we may think of a reddish fruit, a pie ingredient or even a technology company. We have such thinking because we have learnt from class or experienced from our daily life. For a machine, “apple” is just a five-character length word. In fact, it is a bunch of 0 and 1 from a machine’s point of view.<\/p>\n\n\n\n

\"machine's<\/figure><\/div>\n\n\n\n

Our work is helping a machine to learn the correlation of words. So we let words attach with other words — that is why we call it the “word embedding”.<\/p>\n\n\n\n

How word embedding works in a machine <\/h3>\n\n\n\n

Now we know what to do for machine learning on text. It is time to understand how do we achieve our goal. As we mentioned earlier, a machine treats words as a bunch of 0 and 1. The first thing we do is to encode words into vectors, i.e. each word has its geometric location. The next thing we do is to bind words with similar context into close geometric locations. Mathematically, the cosine of the angle between those vectors should be close to 1 and the angle should be close to 0. Let’ see following example, we have 3 different words, “Lorry”, “Truck” and “Plane”. Our objectives are letting machine has similar words in close position, and has unmatched word in far position. <\/p>\n\n\n\n

\"word<\/figure><\/div>\n\n\n\n

Word Embedding technology #1 – Word2Vec<\/h3>\n\n\n\n

In order to do word embedding, we will need Word2Vec<\/a> technology on neural networks. Word2Vec was developed by Tomas Mikolov and his teammates at Google. It consists of two methods, Continuous Bag-Of-Words (CBOW)<\/strong> and Skip-Gram<\/strong>.<\/p>\n\n\n\n

CBOW is the way we predict a result word using surrounding words. For example, “day” is the predicted word, when our bag-of-words inputs are “Have”, “a”, “nice”. And “nice” is the predicted one when our inputs are “Have”, “a”, “day”. <\/p>\n\n\n\n

On the other hand, Skip-Gram is a reverse of CBOW. We start our prediction from our target word, and calculate the possibility of its surrounding words.<\/p>\n\n\n\n

Word Embedding technology #2 – fastText<\/h3>\n\n\n\n

After the release of Word2Vec, Facebook’s AI Research (FAIR) Lab has built its own word embedding library referring Tomas Mikolov’s paper. So we have fastText<\/a> library. The major difference of fastText and Word2Vec is the implementation of n-gram. We can think of n-gram as sub-word. FastText breaks a word into several 3-4 characters length n-grams. For example, “action<\/em>“, fastText will handle it as “<ac”<\/em>, “act”, “cti<\/em>“, “tio”<\/em>, “ion<\/em>“, “on><\/em>. Please note that “<” and “>” are added to the first-gram and the last-gram. So a machine can distinguish shorter words and other n-grams. Likes the word “act<\/em>“, is “<act><\/em>” from a machine’s POV, not the n-gram “act<\/em>“.<\/p>\n\n\n\n

NLP in action<\/h3>\n\n\n\n

I believe practicing with example is a good way of learning. And it is always good to code something in CodeAStar here :]] . This time, we use a Kaggle’s competition topic, Toxicity Classification<\/a>, as our NLP example. In the competition, our goal is to find out toxic comments, i.e. comments with vulgar or insulting languages.<\/p>\n\n\n\n

Now we know that we must train a machine with texts first, so it can correlate what words are toxic and non-toxic. We can pick one of those embedding technologies and start to train a machine with training data. <\/p>\n\n\n\n

\"wait\"<\/figure><\/div>\n\n\n\n

When we learn a common language, in this case, English, we know some words are correlated and some are not. And we should not have much difference in general, otherwise we cannot communicate with each other. Then we apply this case to a machine. Each machine should has same word embedding vectors when each of them use the same technology and the same training data.<\/p>\n\n\n\n

That’s it. We can use the pre-trained word embedding model instead of training ourselves. From fastText official website, we can download<\/a> the pre-trained model which fastText used 600 billion tokens (“words”) to make 300 million vectors (“unique words”) from Common Crawl<\/a>. It is a good way to save our time and effort. And most importantly, I don’t think I have a powerful machine to train with 600 billion tokens. :]]<\/p>\n\n\n\n

Load the Pre-Trained Model<\/h3>\n\n\n\n

Okay, we first load the Kaggle’s training and testing data which contain toxic\/non-toxic comments. Then we merge them together as a data frame, df_merge<\/em>.<\/p>\n\n\n\n

import pandas as pd\ndf_train = pd.read_csv('..\/input\/jigsaw-unintended-bias-in-toxicity-classification\/train.csv')\ndf_test = pd.read_csv('..\/input\/jigsaw-unintended-bias-in-toxicity-classification\/test.csv')\ndf_merge = pd.concat([df_train[['id','comment_text']], df_test], axis=0)\n<\/pre>\n\n\n\n

Let’ see what’s inside the data frame:<\/p>\n\n\n\n

\"comments\"<\/figure><\/div>\n\n\n\n

Our objective is changing the words in our comments into word vectors. So we load the word vectors from fastText pre-trained model<\/a> using Gensim’s KeyedVectors<\/a> library.<\/p>\n\n\n\n

from gensim.models import KeyedVectors\nfasttext_300d_2m_model = '..\/input\/fasttext-crawl-300d-2m\/crawl-300d-2M.vec'\nwordvectors_index = KeyedVectors.load_word2vec_format(fasttext_300d_2m_model)\n<\/pre>\n\n\n\n

Now we have 2 million word vectors, but is it large enough to cover the words in our comments? And what are those words that we cannot find in our word vectors? Don’t worry, let’s find it out. Firstly, we discover the unique words (vocabulary list) from our comments.<\/p>\n\n\n\n

from tqdm import tqdm\ndef build_vocab(texts):\n    sentences = texts.apply(lambda x: x.split()).values\n    vocab = {}\n    for sentence in tqdm(sentences):\n        for word in sentence:\n            try:\n                vocab[word] += 1\n            except KeyError:\n                vocab[word] = 1\n    return vocab\n\nvocab = build_vocab(df_merge['comment_text'])\nprint(f\"The length of vocab: {len(vocab)}\")\n<\/pre>\n\n\n\n
100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1902194\/1902194 [00:32<00:00, 58340.85it\/s]\nThe length of vocab: 1731089<\/code><\/pre>\n\n\n\n

Wow, we have 1.73 million unique words and 2 million word vectors. The next things we do are to check our word vector coverage and find the “out of vocabulary” (oov) words.<\/p>\n\n\n\n

def check_coverage(vocab,embeddings_index):\n    a = {}\n    oov = {}\n    k = 0\n    i = 0\n    for word in tqdm(vocab):\n        try:\n            a[word] = embeddings_index[word]\n            k += vocab[word]\n        except:\n            oov[word] = vocab[word]\n            i += vocab[word]\n            pass\n    print('Found embeddings for {:.2%} of vocab'.format(len(a) \/ len(vocab)))\n    print('Found embeddings for {:.2%} of all text'.format(k \/ (k + i)))\n    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]\n    return sorted_x\n\noov = check_coverage(vocab,wordvectors_index)\n<\/pre>\n\n\n\n
100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1731089\/1731089 [00:06<00:00, 286005.84it\/s]\nFound embeddings for 16.91% of vocab\nFound embeddings for 91.37% of all text<\/code><\/pre>\n\n\n\n

Oh well, something doesn’t feel right. We have only 16.9% of vocabularies with word vectors. <\/p>\n\n\n\n