What is Bag of Words?
We are going shopping for a new concept to learn. Keep your hands free because we are going to have a lot of bags to deal with. You guessed it. The topic for today is B&G foods. B&G foods is a feature extraction technique to convert text into numbers, and it's exactly what it sounds like.
A collection of different words. A great use case for B&G foods is spam filters in your emails. For example, you might be receiving different emails about the latest news, maybe some interesting messages from your friends, and perhaps a few spammy content. Saying that you have won a lottery and you're about to become a millionaire. Bag of words looks at the different words present and the frequency in each of these emails and trusted in. Which of these would be spam? So today we are going to be looking at what bag of words means, as well as some examples.
We will be looking at the pros and cons of bag of words, certain applications, and also modifications that we can use to improve our bag of words algorithm. Like I said, bag of words is a feature extraction technique, which means that all of your different texts or different words are converted into numbers. After all, numbers is what our machine learning models understand. I like to think of Bag of Words as a bag of popcorn. Let's think of the different words as kernels of popcorn.
And each word represents a kernel. Or rather, each kernel represents a different word. The cool thing about Bag of Words is that it's not just limited to words, but it can also be applied to visual elements, which is bag of visual words. Let's say, for instance, you have an image of a cat. And yes, this is how I draw a cat, but you can break down this image of a cat into multiple different key features.
You could have an ear, you could have whiskers, a body, legs and a tail. And each of these different elements help in multiple computer vision techniques, such as object detection. So you can use bag of words, not just in words, but also on visual words, which is images. Next, let's take a look at what bag of words looks like for different sentences, and see the pros and cons for it. Common MLP tasks where bag of words comes in handy is text classification.
Let's say for example, spam or not, you could have your email and depending on what the words in that are, you could identify. So this is an example of text classification. Another example could be that of document similarity where perhaps you want to compare two different documents and check how similar they are to each other. Or maybe you have a particular query, like the type you put in a search engine, and you want to find the most relevant documents.
Both of these examples text classification and document similarity. Use bag of words in the back end. Now let's take an example of two sentences and see how we can convert the text other words into features on numbers for our machine learning model. To understand.
Consider two sentences. Sentence number one I think. Therefore, I am. And sentence number two. I love learning.
Python. Now that we have our two examples sentences, what we are going to begin with is creating our vocabulary or a dictionary, which is the set of unique words set up in all of the given documents. In our case, here are only two sentences that we are looking at. But let's take a look at all the unique words present in here.
So we have AI as a unique word. Think. Therefore. AI has already been covered over here, so we move on to the next one.
I'm going to the next sentence. AI is also included here. Love learning Python one? That's 12345677.
Words are seven. Unique words is what makes up our dictionary or our vocabulary based on these two sentences. Let's look at what the text representation of the bag of words representation for each of those sentences would be, and what we are constructing over here is called a document term matrix. So here are our documents. We consider our first document. And these are the different terms or the vocabulary present in here.
So going over the first sentence I occurs a total of two times. She look at the count of the word I of the particular words. And you try to see how many times it occurs in that particular sentence.
So I have used a total of two times. Think once. Therefore once. once. And in our first sentence, love learning and Python do not appear, which is why they get a score of zero.
Doing the same technique for our second sentence, I appeals a total of one time. Think therefore and are absent in that sentence, which is why they get zero and love learning in Python, each of those other ones, which is why they get one. So what you're seeing over here is a vector of numbers that represent the first sentence. So we have now taken words and converted it into a feature representation. That is we have numbers over here, which is what our machine learning models used to understand. And similarly this is the feature representation for our second sentence.
Now that we've seen what bag of words looks like or how to calculate it, the pros are kind of obvious. It's simple, which is how you saw it. You count the number of times particular word occurs, and you denote that count to that particular position for that sentence.
It's easy, which is what we did over here. And it's explainable as opposed to certain other algorithms that maybe are not as intuitive. Unfortunately, as with all things in life.
There are going to be pros and there are going to be cons. Next, we'll take a look at the cons of the simplistic algorithm and see if we can modify it to make it work better for us. Let's look at some of the drawbacks associated with bag of words. The first one being a compound word. Think about words like AI, artificial intelligence, or New York. In a simplistic bag of words approach.
You break down artificial and intelligence, and now they are treated as two separate words with no correlation or no meaning between the two. That would apply to New York as well, where new is one word and York is another word. In this case, we are losing this semantic or the meaning that exists between the two words, which is a drawback. Let's look at another example. Perhaps kick. And baking.
Maybe racing as well. Given these three words cake baking and racing, cake and baking are more likely to co-occur to occur in the same context, in the same documents as opposed to cake and racing. Well, of course, if tomorrow somebody invents a new sport called cake racing, that's going to change.
But let's hope it doesn't. In this case, our Bag of words model is not able to associate the correlations that exist among the words, which might pose a problem to our machine learning models. Let's look at another drawback of Polyphemus words.
Consider the word biting. Looking at just this word, it's hard to tell if I'm talking about biting the programing language or Python. The animal. Maybe there's another word that's content or content. It could mean either of the two, but just looking at the spelling, it's hard to see which is which. Another drawback that exists is that we lose the order associated between the words.
Like I mentioned, Bag of words is nothing but a bag of popcorn, with each of the kernels being a specific word. And when you shake that bag, you lose all of the relationships that exist. As far as the order of the words is concerned. Let's say, for example, I have a sentence that says flight San Francisco, Mumbai from unto.
What does this mean? Am I trying to fly from San Francisco to Mumbai? Am I trying to fly the other way around from Mumbai to San Francisco? It's hard to tell when we have only the bag of words available. Last but not the least is the problem of sparsity in our bag of words approach. We look at each of the unique words which makes up our vocabulary, and denote the presence of that particular word in a sentence given a large number of documents.
You could have a very, very high number of vocabulary or words. Yet in each of the sentences, there could be maybe only three words or a very, very small proportion of words that actually are present with most of the other spaces being zeros. This leads to the problem of sparsity.
Since our matrix are our vectors, over, here is sparse in the sense most of these elements are unoccupied because they're denoted by zeros, and very few of them are actually present. This could also pose a challenge with our models. Fear not though. Despite these drawbacks, we do have a certain modification in mind. Let's take a look at some of the modifications that can help improve our bag of words.
Approach. Our first modification is n grams. Instead of looking at each individual word, you can now look at a combination of words that occur together. For example, in our artificial intelligence being the phrase we don't break it into artificial intelligence, but now we look at the presence of artificial and intelligence together and denote how many times it occurred in a particular document. Similarly, for New York, we look at the presence of New and York right after each other and denote the number of counts or the times a duck goes in that document.
In this case, since our words are made up, or our faces are made up of two words and is equal to two, you could extend this with n is equal to three and is equal to five, so on and so forth. In which case you would look at, for example, if n is equal to three, you would look at three words that occur right next to each other. So maybe it is Python artificial intelligence. And any time these three words occur in your document, you would count the number of times that happens.
And given the occurrence in here. Another modification that we can do is text normalization. Text normalization refers to certain preprocessing activities that you can do before you pass on the text to your bag of words. Model. A good example for this is the process of stemming, in which case you're trying to remove the ends of the words in the hope of getting back to its base word or its base stem.
Consider the words coding coded codes and code. When you start removing the ends of the words. You can try to get to its base word, which is called in this case. This is a way to reduce the number of vocabulary or reduce your dictionary words, and hopefully that will help with the sparsity issue. An important concept that builds upon bag of words is Tf-Idf or term frequency.
Inverse document frequency. You can think of Tf-Idf as a weight or a score associated with words, or perhaps even a feature scaling technique. TF is the term frequency, or the number of times a particular word occurs in your document. Let's say the words votes. President.
Governments occur a lot of times in your document. Probably has something to do with maybe elections or some other government matter. So higher the term frequency higher is the score or the weight associated with that word. That makes sense with inverse document frequency.
However, if you look at the number of documents. That that particular word occurs in. And if that word occurs in multiple documents or a huge proportion of documents, you actually give it a lower score.
So the more number of documents the word occurs in, the lower the IDF score and the lower the whole Tf-Idf score becomes. This may seem a little counterintuitive, right? It's opposite of the term frequency. But I give you the example of words like d, un and some. But what's it? Don't really have any meaning on their own, but they're used to create grammatically correct sentences. As you can imagine, in an English language or a lot of documents with English language in it, these words would occur a lot of times. Perhaps, maybe even the most frequently occurring word.
In that case, we do not want these words to have a high tfidf score, which is where the IDF component lowers their score. As these scores are not representative of the topics or the sentence of the documents. Let's take a look at some applications of tf IDF. Let's consider document classification as an example. Perhaps you have a company and a product that you're selling to your customers, and you have a support channel for them to come and read certain concerns, complaints or questions about your product.
Maybe you have a chat associated with your customers or some support tickets, and you could use the bag of Words approach to understand which of the teams are associated with the problem that is there in the ticket. Maybe you have a building team or an onboarding team. Or a trial team. Or maybe it's a documentation issue. Looking at the vocabulary that is present, that is looking at the bag of words, representation of what is entailed in the customer chat or the support ticket.
You will then be able to identify which of these teams is it, right, and the appropriate team to deal and resolve the customer's issue. Another example of bag of words is what to make. You might have heard of a to. These are word embeddings that exist in an n dimensional space. Your words are represented as vectors in this n dimensional space.
For example, king and queen are two words, and the closer the words are in this n dimensional space, that means they are more related to each other. In this case, king and queen would be fairly close to each other, as you would find documents or sentences where king and queen appear together. Maybe you have another word swim, that comes in those documents as well, but you wouldn't really associate swim with king or queen as much as you would with king and queen with each other. So swim would be further away from the vectors of king and queen. This is called what to work on word embeddings, and it does use bag of words as a back end to create this n dimensional space. Another example where bag of words comes in handy is for sentiment analysis.
You could look at the collection of words in a given text, and understand if a lot of those words are positive. Maybe words like happy, joy, excited, or words that are negative, frustrated, angry, hate, terrible. And depending on the bag of words representation, you would be able to identify with sentiment at false positive or negative.
You could even take this further and try to create a model that helps to keep speech. So you would look at the negative sentiments or the negative words present in there, and maybe extended with other words, for example, racism or other discrimination forms, and try to create a model that helps you distinguish these annoying or unexpected texts on the internet. Now that you have this concept in the bag, I hope this helps you understand a little more about natural language processing and encourages you to continue your journey into the field of artificial intelligence. If you like this video and want to see more like it, please like and subscribe. If you have any questions or want to share your thoughts about this topic, please leave a comment below.
2024-06-09 01:35