2.2 Bag of Words

Converting Text to Numbers

To convert text to numbers, we can use a process called ‘bag of words’. We start with two sentences, “Hello how are you” and “I am an engineer”. We can split these 2 sentences into the words hello, how, are, you, I, am, an, and engineer. Now, for each sentence, we can create a vector based on whether or not each word appears in the sentence. For sentence 1, we make [1, 1, 1, 1, 0, 0, 0, 0]. Now, for sentence 2, we get [0, 0, 0, 0, 1, 1, 1, 1]. The numbers we assign for each word is based on how many times that word appears in our word bag.


This process is very ineffective because it includes words like are, am, and an, which aren’t needed for humans or a computer to understand the full sentence. We also make all the letters lowercase, because our previous version would seperate Hello and hello, for example. All of this decreases how large our bag of words is, decreasing processing time when training. Our new streamlined bag of words is hello, how, you, i, and engineer. Now, sentence 1 is [1, 1, 1, 0, 0] and sentence 2 is [0, 0, 0, 1, 1].


We can use the CountVectorizer function from SkiKitLearn to implement bag of words (more info on Count Vectorization here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer sentence_1="I'm gonna make him an offer he can't refuse." sentence_2="Toto, I've a feeling we're not in Kansas anymore." vector = CountVectorizer(stop_words='english') data = vector.fit_transform([sentence_1,sentence_2])

Previous Section

Next Section

Copyright © 2021 Code 4 Tomorrow. All rights reserved. The code in this course is licensed under the MIT License. If you would like to use content from any of our courses, you must obtain our explicit written permission and provide credit. Please contact classes@code4tomorrow.org for inquiries.