Implementation Of Bag Of Words Using Python
“Language is a tool for communication”
Humans discovered many languages for communication. But now we have machines to do our work easy and fast. But machines can’t understand our languages as it is, they break sentences in the numeric form to understand.
The procedure of preparing the content into a number or a vector design is known as Bag of Words. This checks words which are utilized often in the archives. Envision a table that contains a tally of words that relate to the word itself. In straightforward words, this is where you can utilize highlights in AI calculations which are extricated from highlights from text archives. Here it makes indexes of novel expressions of preparing set records.
Pack of words is applied in the field of natural language preparation, data recovery from reports, and furthermore record groupings.
Below are steps followed by this technique:
Let’s take a very common example to make sure you will understand this better. We all do shopping, of course, online shopping. Because there are several advantages we get with online shopping. The most loveable one is knowing other people’s reviews about the product in which you are interested. So will see this here.
Consider these are few reviews of hair straighter.
- Review no. 1: This is very easy to use and affordable too.
- Review no. 2: This product is fancy to use but doesn’t give much accuracy
- Review no.3: This is good but was expected more trendy.
We can see n number of such reviews. Some may be good, bad, or average. After going through such different opinions we are able to make a decision on either to buy a product or not.
Now Bag of Words Plays an important role here by simply converting these words into numbers. Here its term means, representation of each word in number string. We take 10 words from the above sentences
‘this’, ‘product’, ‘is’, ‘useful’, ‘and’, ‘fancy’, ‘but’, ‘not’, ‘trending’, ‘awesome’.
We need to pre-process the reviews before forming the table, i.e., first, convert sentences into lower cases then apply stemming and lemmatization, and lastly remove stopwords.
Now, mark the word occurrence with 1s and 0s which is shown below table:
So the main idea of this table is if the word is present it will mark as 1 and if the word is not present it will mark 0. After this example, you can easily conclude that this model work when the word is present in the document since it does not consider any meaning, an order of sentences, or any context. So simply if more words are similar in a document then it means documents are more similar.
Even if is a very easy technic it still has some drawbacks and because of that developers prefer using TF-IDF or word2vec when they have to deal with a large amount of data.
Listed below are some:-
- The first issue comes in the picture when the new sentences with new words enter. In such cases, the size of vocabulary will increase which will directly increase vector size.
- Second is, the vectors would also contain many 0s, which results in a sparse matrix which is something we don’t want.
- After that in this method, we are not able to get any information about the grammatical section and we are not even focusing on the order of words in the text.
we hope we got a detailed idea about Bag Of wors through this blog. This blog is inspired by Excelr Solution.