Bag Of Words And TFIDF

Vijay Anaparthi
2 min readAug 19, 2020

Bag of words(BOW) and Term frequency and inverse document frequency(TFIDF) are used to convert text data to vector or numerical data.

For clear understanding Lets take simple example suppose if you have dataset which contain text feature or column. If you want to train machine learning models like logistic regression or SVM etc on above dataset then we will get error because they only understood numerical data not text data for that we have to convert text data to vector form.

For converting text data to numeric form we use simple techniques like BOW, TFIDF, Word2Vec and TFIDF-w2v. In this blog i will discuss about BOW and TFIDF.

Bag of words(BOW)

Lets take simple example for that i am taking 2 reviews from ecommerce website.

Step1:- Take unique words from both reviews. i.e

It, is, very, good, mobile, phone, for, android, users, but, expensive, cheap, not are unique words.

step2:- Look at above step clearly we have taken unique words not all words now every unique word will become one feature/column in dataset like below.

How i am inserting 0 and 1 to each word feature is that suppose if the word occur in that review then 1 otherwise 0. If the word occur more than 1 time in review then insert that number instead of 1.

Now you can train your machine learning models by using this dataset because it is numerical form.

Term Frequency And Inverse Document Frequency(TFIDF)

TFIDF is more useful than BOW because it gives importance to both frequently occur words and rare words. All the reviews in dataset are called document or corpus.

First we calculate term frequency then IDF

Term frequency(TF) = Number of times word occur in review/Total no.of words in review

IDF = log(In how many reviews that word is present/ Total no.of reviews in corpus)

TFIDF = TF * IDF

Here also we will take unique words but instead of 0 and 1 we will replace each word by tf-idf value.

Thanks for reading if you want you can use word2vec technique as well.

--

--