For word2vec, instead of using the frequency of the words, we can use a one-hot approach to turning words into vectors. Given the sentence: “May the force be with you”, we seperate the sentence into may, the, force, be, with, you, then simply it to may, force, with, you. Now, for each word, we make a new array, meaning may becomes [1, 0, 0, 0], force becomes [0, 1, 0, 0], with becomes [0, 0, 1, 0], and you becomes [0, 0, 0, 1]. In the end, the sentence becomes a combination of all the words: [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]. While our original sentence was small, the arrays can become massive for longer and longer sentences.
One major hurdle for NLP research is meaning - how does a computer understand the actual meaning of a word, rather than just as a 1 or a 0. To pass this hurdle, we turn to Word2Vec, a program that can guesstimate a word’s meaning based on it’s context, a major breakthrough for AI research. It’s ability to guess the meaning of words is based on it’s ability to group together words that have similar meanings and connotations - for example, it groups king with queen and car with train, as they are both part of the same whole group. To do this, Word2Vec uses one of two different models, depending on how it’s used. While these models and how they work are beyond the scope of this class, you can read up on them here: https://towardsdatascience.com/word2vec-explained-49c52b4ccb71.
Here, you will understand what to do in order to implement Word2Vec. The following is a high level explanation of the code below. To implement Word2Vec, we first use a library called nltk, which gives us stopwords and punkt, which divides text into a series of sentences. Next, we split the data up into words, and then we filter the stopwords out using a simple function. Finally, we apply word2vec and recieve a treasure trove of NLP information, such as word similarities and visualizations.
#large imports import pandas as pd import nltk import string import matplotlib.pyplot as plt #specialized imports from nltk.corpus import stopwords #stopwords from nltk from nltk import word_tokenize #word_tokenize splits sentences into words from gensim.models import Word2Vec as w2v #imports word2vec # constants PATH = 'data/shakespeare.txt' #any data works here, it doesn't matter sw = stopwords.words('english') nltk.download('punkt') #punkt splits text into a series of sentences nltk.download('stopwords') # fully import the text file lines =  with open(PATH, 'r') as f: for l in f: lines.append(l) #basic preprocessing to remove new lines, punctuation, and make all words lowercase lines = [line.rstrip('\n') for line in lines] lines = [line.lower() for line in lines] lines = [line.translate(str.maketrans('', '', string.punctuation)) for line in lines] # split sentences into words lines = [word_tokenize(line) for line in lines] #remove the stopwords def remove_stopwords(lines, sw = sw): res =  for line in lines: original = line line = [w for w in line if w not in sw] #keeps all words that aren't stopwords if len(line) < 1: line = original #if entire line is stopwords, keep the line the same res.append(line) return res filtered_lines = remove_stopwords(lines = lines, sw = sw) w = w2v( filtered_lines, min_count=3, sg = 1, window=7 ) print(w.wv.most_similar('thou')) emb_df = ( pd.DataFrame( [w.wv.get_vector(str(n)) for n in w.wv.key_to_index], index = w.wv.key_to_index ) ) print(emb_df.shape) emb_df.head()
Add a page link block here (if there is no previous section, delete this column)
Add a page link block here (if there is no next section, delete this column)