We're sorry but this app doesn't work properly without JavaScript enabled. Please enable it to continue.

Document Length Normalization

The final BM25 improvement is document length normalization – ensuring longer documents don't get unfair advantages over shorter, more focused ones. Longer documents contain more words, which can artificially boost their scores:

Query: "bear"

  • Document A: "Boots is a silly bear wizard"
  • Document B: "Ted is a wonderful, amazing, fantastic human who has a stuffed bear that loves honey, salmon, picnics, and hanging out with other bears in the woods. Ted's bear is so nice to hang out with Ted all day long."

Document B has higher term frequencies because it's longer, but not because it's more relevant!

The Length Normalization Solution

BM25 adjusts term frequency based on document length:

# Length normalization factor
length_norm = 1 - b + b * (doc_length / avg_doc_length)

# Apply to term frequency
tf_component = (tf * (k1 + 1)) / (tf + k1 * length_norm)

Let's break it down.

The Core Ratio: doc_length / avg_doc_length

This ratio tells us how this document's length compares to the average document length in the dataset:

Ratio Meaning Effect
= 1.0 Average length No change
> 1.0 Longer than average Penalized
< 1.0 Shorter than average Boosted

b (Normalization Strength)

b is a tunable parameter that controls how much we care about document length.

  • If b=0 then length norm is always 1.
  • If b=1 then full normalization is applied.

The key insight is:

  • Long documents get higher length_norm and are penalized (lower scores)
  • Short documents get lower length_norm and are boosted (higher scores)

A common value is 0.75, which tends to work well in most scenarios.

Assignment

Implement the BM25 document length normalization formula in our InvertedIndex class. Use a b value of 0.75 and a k1 value of 1.5 to control the saturation effect.

  1. BM25_B = 0.75
    
    1. bm25_tf_parser.add_argument("b", type=float, nargs='?', default=BM25_B, help="Tunable BM25 b parameter")
      

Run and submit the CLI tests.