To convert text to numbers, we can use a process called ‘bag of words’. We start with two sentences, “Hello how are you” and “I am an engineer”. We can split these 2 sentences into the words hello, how, are, you, I, am, an, and engineer. Now, for each sentence, we can create a vector based on whether or not each word appears in the sentence. For sentence 1, we make [1, 1, 1, 1, 0, 0, 0, 0]. Now, for sentence 2, we get [0, 0, 0, 0, 1, 1, 1, 1]. The numbers we assign for each word is based on how many times that word appears in our word bag.
This process is very ineffective because it includes words like are, am, and an, which aren’t needed for humans or a computer to understand the full sentence. We also make all the letters lowercase, because our previous version would seperate Hello and hello, for example. All of this decreases how large our bag of words is, decreasing processing time when training. Our new streamlined bag of words is hello, how, you, i, and engineer. Now, sentence 1 is [1, 1, 1, 0, 0] and sentence 2 is [0, 0, 0, 1, 1].
We can use the CountVectorizer function from SkiKitLearn to implement bag of words (more info on Count Vectorization here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer sentence_1="I'm gonna make him an offer he can't refuse." sentence_2="Toto, I've a feeling we're not in Kansas anymore." vector = CountVectorizer(stop_words='english') data = vector.fit_transform([sentence_1,sentence_2])
Add a page link block here (if there is no previous section, delete this column)
Add a page link block here (if there is no next section, delete this column)