👝

# 2.2 Bag of Words

### Converting Text to Numbers

To convert text to numbers, we can use a process called ‘bag of words’. We start with two sentences, “Hello how are you” and “I am an engineer”. We can split these 2 sentences into the words hello, how, are, you, I, am, an, and engineer. Now, for each sentence, we can create a vector based on whether or not each word appears in the sentence. For sentence 1, we make [1, 1, 1, 1, 0, 0, 0, 0]. Now, for sentence 2, we get [0, 0, 0, 0, 1, 1, 1, 1]. The numbers we assign for each word is based on how many times that word appears in our word bag.

### Streamlining

This process is very ineffective because it includes words like are, am, and an, which aren’t needed for humans or a computer to understand the full sentence. We also make all the letters lowercase, because our previous version would seperate Hello and hello, for example. All of this decreases how large our bag of words is, decreasing processing time when training. Our new streamlined bag of words is hello, how, you, i, and engineer. Now, sentence 1 is [1, 1, 1, 0, 0] and sentence 2 is [0, 0, 0, 1, 1].

### Implementation

We can use the CountVectorizer function from SkiKitLearn to implement bag of words (more info on Count Vectorization here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
``````import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

sentence_1="I'm gonna make him an offer he can't refuse."
sentence_2="Toto, I've a feeling we're not in Kansas anymore."

vector = CountVectorizer(stop_words='english')

data = vector.fit_transform([sentence_1,sentence_2])``````

⚖️