BoW Bag-of-Words: From Text to Frequency Vectors or to simply Word Counts

Written by
Ala GARBAA
Full Stack AI Developer & Software Engineer
Bag-of-words (BoW) turns each document into a vector of word counts.
That means we represent a document as a list of numbers.
It works by counting how many times each word from a vocabulary appears in that document.
It's called a "bag" because it ignores word order and sentence structure, just focusing on the word counts.
In other word it's treats text like a bag of loose items, It builds a vocabulary from the corpus. Each column is a word. Each row is a document.
Similar documents tend to have similar frequency patterns meanwhile this keeps frequency info, which helps spot similar docs (e.g., ones with many shared words likely mean similar things).
Example: Python snippet—"Python is easy to learn and fun to use"—counts "to" 2 times, "python" 1, "is" 1, "and" 1, "easy" 1, "learn" 1, "fun" 1, "use" 1. No order preserved.
we can code that like this first we have the function bag_of_words:
import numpy as np
def bag_of_words(sentences):
tokenized_sentences = [sentence.lower().split() for sentence in sentences]
flat_words = [word for sublist in tokenized_sentences for word in sublist]
vocabulary = sorted(set(flat_words))
word_to_index = {word: i for i, word in enumerate(vocabulary)}
bow_matrix = np.zeros((len(sentences), len(vocabulary)), dtype=int)
for i, sentence in enumerate(tokenized_sentences):
for word in sentence:
if word in word_to_index:
bow_matrix[i, word_to_index[word]] += 1
return vocabulary, bow_matrix
then we apply like this:
corpus = [
"Python is easy to learn and fun to use",
"I do not like bugs in my code",
"Learning programming takes practice and patience",
"Debugging code can be frustrating but rewarding",
"Writing clean code makes projects easier to maintain"
]
vocabulary, bow_matrix = bag_of_words(corpus)
print("Vocabulary:", vocabulary)
print("Bag of Words Matrix:\n", bow_matrix)
But this has problems: Vectors get huge and sparse vectors (lots of zeros) as vocab grows (e.g., 200k words), so Still creates huge vectors as vocabulary grows.
This eats memory, slows computation, and hits the curse of dimensionality—distances between points lose meaning in other words (more features = less meaningful comparisons).
Risk of overfitting: means the model fails to generalize and make accurate predictions on different data
Fixes:
Skip punctuation, stem words, drop common junk like "the" or "and."
Process:
- Split (Tokenize) documents into words
- Build vocabulary of unique words (word-to-index map)
- Create matrix: rows = documents, columns = words, values = counts
Important limitation:
It loses context, remember that the context matters. The word "awesome" could mean positive, neutral, or negative depending on the sentence. Frequency alone doesn't tell you sentiment.
BoW is an improvement on one-hot encoding because it tracks word frequency, but it still fails to capture the meaning and order of words.
Step-by-Step Execution
1. Tokenization & Lowercasing
Each sentence → split into lowercase words:
[
['python', 'is', 'easy', 'to', 'learn', 'and', 'fun', 'to', 'use'],
['i', 'do', 'not', 'like', 'bugs', 'in', 'my', 'code'],
['learning', 'programming', 'takes', 'practice', 'and', 'patience'],
['debugging', 'code', 'can', 'be', 'frustrating', 'but', 'rewarding'],
['writing', 'clean', 'code', 'makes', 'projects', 'easier', 'to', 'maintain']
]
2. Build Vocabulary
Flatten all words → unique → sort:
vocabulary = [
'and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging',
'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is',
'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not',
'patience', 'practice', 'programming', 'projects', 'python',
'rewarding', 'takes', 'to', 'use', 'writing'
]
→ 33 unique words
3. Build BoW Matrix (5 docs × 33 words)
Now count occurrences per document.
Final Output:
Vocabulary: ['and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging', 'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is', 'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not', 'patience', 'practice', 'programming', 'projects', 'python', 'rewarding', 'takes', 'to', 'use', 'writing']
Bag of Words Matrix:
[[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 2 1 0]
[0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]
[0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1]]
Matrix Explained (Row = Document, Column = Word)
| Word Index | Word | Doc 1 | Doc 2 | Doc 3 | Doc 4 | Doc 5 |
|---|---|---|---|---|---|---|
| 0 | and | 0 | 0 | 1 | 0 | 0 |
| 10 | easy | 1 | 0 | 0 | 0 | 0 |
| 16 | learn | 1 | 0 | 0 | 0 | 0 |
| 30 | to | 2 | 0 | 1 | 0 | 1 |
| 2 | bugs | 0 | 1 | 0 | 0 | 0 |
| 6 | code | 0 | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| Example: First row (Doc 1): |
"to"appears 2 times → column 30 = 2"python", "is", "easy", "learn", "and", "fun", "use"→ 1 each
Key Observations
- Sparse matrix: Most entries are 0 → typical of BoW
- No punctuation handling: Words like
"use"and"use"are fine, but no stemming (e.g.,"learn"≠"learning") - Order ignored:
"fun to use"→ same as"use to fun"in representation - High dimensionality: 33 words from just 5 short sentences → scales poorly
Summary: Script Output
Vocabulary: ['and', 'be', 'bugs', 'but', 'can', 'clean', 'code', 'debugging', 'do', 'easier', 'easy', 'frustrating', 'fun', 'i', 'in', 'is', 'learn', 'learning', 'like', 'makes', 'maintain', 'my', 'not', 'patience', 'practice', 'programming', 'projects', 'python', 'rewarding', 'takes', 'to', 'use', 'writing']
BoW Matrix (5 x 33):
[[0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 2 1 0]
[0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]
[0 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1]]