The final BM25 improvement is document length normalization – ensuring longer documents don't get unfair advantages over shorter, more focused ones. Longer documents contain more words, which can artificially boost their scores:
Query: "bear"
Document A: "Boots is a silly bear wizard"
Document B: "Ted is a wonderful, amazing, fantastic human who has a stuffed bear that loves honey, salmon, picnics, and hanging out with other bears in the woods. Ted's bear is so nice to hang out with Ted all day long."
Document B has higher term frequencies because it's longer, but not because it's more relevant!
The Length Normalization Solution
BM25 adjusts term frequency based on document length:
# Length normalization factor
length_norm = 1 - b + b * (doc_length / avg_doc_length)
# Apply to term frequency
tf_component = (tf * (k1 + 1)) / (tf + k1 * length_norm)
Let's break it down.
The Core Ratio: doc_length / avg_doc_length
This ratio tells us how this document's length compares to the average document length in the dataset:
Ratio
Meaning
Effect
= 1.0
Average length
No change
> 1.0
Longer than average
Penalized
< 1.0
Shorter than average
Boosted
b (Normalization Strength)
b is a tunable parameter that controls how much we care about document length.
If b=0 then length norm is always 1.
If b=1 then full normalization is applied.
The key insight is:
Long documents get higher length_norm and are penalized (lower scores)
Short documents get lower length_norm and are boosted (higher scores)
A common value is 0.75, which tends to work well in most scenarios.
Assignment
Implement the BM25 document length normalization formula in our InvertedIndex class. Use a b value of 0.75 and a k1 value of 1.5 to control the saturation effect.
BM25_B = 0.75
bm25_tf_parser.add_argument("b", type=float, nargs='?', default=BM25_B, help="Tunable BM25 b parameter")