TL;DR: Putting stuff in order usually doesn't tell us much of anything about the values themselves we are ordering, but in the case of word frequencies, Zipf found that it does.
Let W be a set of words in a large text. Let f(w) be the frequency of the word w (say, as a ratio between w's count in the text and the most popular word's count in that same text). Assume for simplicity that no two words have precisely the same frequency.
Then we can take all n words of W and put them in descending order by their frequency, thus giving each word a unique index:
w_1, w_2, ..., w_n
where
f(w_1) > f(w_2) > ... > f(w_n).
The only information we've organized is an ordering of words by their frequency. We can't really decide in any generalizable way by how much it decreases, at what rate, or anything. (We could, of course, attempt to characterize these things for a single, specific text.) It could be that in your favorite book, the frequency decreases by 0.0000001% for each index in the list for all we know, or it could be that in all Hacker News posts of 2021,
f(w_(i+1)) = f(w_i) / 2,
that is, each word in the list occurs half as often as the word just before it. It seems reasonable to believe that if there is some relationship, it should depend on f (which depends on the text being analyzed).
What is surprising is that it doesn't really! For all practical purposes, regardless of text, f(w_i) can be written in terms of a simple function of i, specifically as (roughly) 1/i. That means w_1 occurs the most, w_2 occurs half as often as w_1, w_3 occurs a third as often as w_1, w_4 a fourth as often as w_1 (and thus half as often as w_2), etc.
Let W be a set of words in a large text. Let f(w) be the frequency of the word w (say, as a ratio between w's count in the text and the most popular word's count in that same text). Assume for simplicity that no two words have precisely the same frequency.
Then we can take all n words of W and put them in descending order by their frequency, thus giving each word a unique index:
where The only information we've organized is an ordering of words by their frequency. We can't really decide in any generalizable way by how much it decreases, at what rate, or anything. (We could, of course, attempt to characterize these things for a single, specific text.) It could be that in your favorite book, the frequency decreases by 0.0000001% for each index in the list for all we know, or it could be that in all Hacker News posts of 2021, that is, each word in the list occurs half as often as the word just before it. It seems reasonable to believe that if there is some relationship, it should depend on f (which depends on the text being analyzed).What is surprising is that it doesn't really! For all practical purposes, regardless of text, f(w_i) can be written in terms of a simple function of i, specifically as (roughly) 1/i. That means w_1 occurs the most, w_2 occurs half as often as w_1, w_3 occurs a third as often as w_1, w_4 a fourth as often as w_1 (and thus half as often as w_2), etc.