↔️

# 2.3 Word2Vec

### Embedding

For word2vec, instead of using the frequency of the words, we can use a one-hot approach to turning words into vectors. Given the sentence: “May the force be with you”, we seperate the sentence into may, the, force, be, with, you, then simply it to may, force, with, you. Now, for each word, we make a new array, meaning may becomes [1, 0, 0, 0], force becomes [0, 1, 0, 0], with becomes [0, 0, 1, 0], and you becomes [0, 0, 0, 1]. In the end, the sentence becomes a combination of all the words: [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]. While our original sentence was small, the arrays can become massive for longer and longer sentences.

### Word2Vec

One major hurdle for NLP research is meaning - how does a computer understand the actual meaning of a word, rather than just as a 1 or a 0. To pass this hurdle, we turn to Word2Vec, a program that can guesstimate a word’s meaning based on it’s context, a major breakthrough for AI research. It’s ability to guess the meaning of words is based on it’s ability to group together words that have similar meanings and connotations - for example, it groups king with queen and car with train, as they are both part of the same whole group. To do this, Word2Vec uses one of two different models, depending on how it’s used. While these models and how they work are beyond the scope of this class, you can read up on them here: https://towardsdatascience.com/word2vec-explained-49c52b4ccb71.

### Implementation (Conceptual)

Here, you will understand what to do in order to implement Word2Vec. The following is a high level explanation of the code below. To implement Word2Vec, we first use a library called nltk, which gives us stopwords and punkt, which divides text into a series of sentences. Next, we split the data up into words, and then we filter the stopwords out using a simple function. Finally, we apply word2vec and recieve a treasure trove of NLP information, such as word similarities and visualizations.
``````#large imports
import pandas as pd
import nltk
import string
import matplotlib.pyplot as plt

#specialized imports
from nltk.corpus import stopwords #stopwords from nltk
from nltk import word_tokenize #word_tokenize splits sentences into words
from gensim.models import Word2Vec as w2v #imports word2vec

# constants
PATH = 'data/shakespeare.txt' #any data works here, it doesn't matter
sw = stopwords.words('english')

# fully import the text file
lines = []
with open(PATH, 'r') as f:
for l in f:
lines.append(l)

#basic preprocessing to remove new lines, punctuation, and make all words lowercase
lines = [line.rstrip('\n') for line in lines]
lines = [line.lower() for line in lines]
lines = [line.translate(str.maketrans('', '', string.punctuation)) for line in lines]

# split sentences into words
lines = [word_tokenize(line) for line in lines]

#remove the stopwords
def remove_stopwords(lines, sw = sw):
res = []
for line in lines:
original = line
line = [w for w in line if w not in sw] #keeps all words that aren't stopwords
if len(line) < 1:
line = original #if entire line is stopwords, keep the line the same
res.append(line)
return res

filtered_lines = remove_stopwords(lines = lines, sw = sw)

w = w2v(
filtered_lines,
min_count=3,
sg = 1,
window=7
)

print(w.wv.most_similar('thou'))

emb_df = (
pd.DataFrame(
[w.wv.get_vector(str(n)) for n in w.wv.key_to_index],
index = w.wv.key_to_index
)
)
print(emb_df.shape)