"In spite of decades of theorizing, the origins of Zipf’s law remain elusive. I propose that a Zipfian distribution straightforwardly follows from the interaction of syntax (word classes differing in class size) and semantics (words having to be sufficiently specific to be distinctive and sufficiently general to be reusable). These factors are independently motivated and well-established ingredients of a natural-language system. Using a computational model, it is shown that neither of these ingredients suffices to produce a Zipfian distribution on its own and that the results deviate from the Zipfian ideal only in the same way as natural language itself does."
I remember reading that article when it was on the front page of HN, and felt underwhelmed by the conclusion.
Reading the quoted conclusions again, I almost fail to see what the discovery is.
Natural language follows a slightly modified version of Zipf's Law / Power distribution, but from what I remember from college, so does randomly distributed writing (if you draw randomly from [a-z ] and split on spaces, you get a similar distribution)
EDIT: From Chris Mannings book "Statistical Foundation of NLP" (MIT-1999)
> As a final remark on Zipf’s law, we note that there is a debate on how
surprising and interesting Zipf’s law and ‘power laws’ in general are as
a description of natural phenomena. It has been argued that randomly
generated text exhibits Zipf’s law (Li 1992).
>To show this, we construct
a generator that randomly characters from the 26 letters of the
alphabet and the blank (that is, each of these 27 symbols has an equal
chance of being generated next). Simplifying slightly, the probability of a
word of length n being generated is the probability of generating
a non-blank character n times and the blank after that. One can show
that the words generated by such a generator obey a power law of the
form Mandelbrot suggested. The key insights are that there are 26
times more words of length n 1 than length n, and (ii) that there is a
constant ratio by which words of length are more frequent than words
of length n + 1. These two opposing trends combine into the regularity
of Mandelbrot’s law. See exercise 1.4.
Where "Mandelbrot’s law" is the above "slightly modified version of Zipf's Law / Power distribution" I mentioned, and was published in 1954
As a non-mathematician working in finance w access to illiquid alternative returns (VC $s), I had some smarter mathematician interns look at power laws in the 2000s. We could find them everywhere - but at the time we interpreted it as our inability to use the appropriate methods. If our methods were right, we didn’t have a way to explain how such findings were relevant. My interpretation then was that the finding of PLs was an indicator of randomness.
Exactly this! Meaning intelligent language has high entropy. And it has high entropy because it's compressed. We don't notice this compression and seems counter intuitive that wordy natural language is actually compressed, but it has evolved to be optimal and compressed.
Would it be possible to test the law against a corpus of a creole or pidgin? Since such a language is by definition new and may be rapidly evolving, it might have less entropy, and therefore might not fit the law. That might provide some evidence for this idea.
Is there a way to compare power-laws between two series and take something from the data?
For example the slopes of public market returns are different than that of illiquid markets. Is a power law w more slope more random?
I was really looking for a t-test or chi-squared type take away. My smarter, math under grad interns pointed out that we were at the edge of interpretive capabilities - but this was ~2000s.
I never understood why this is such a surprise? The frequency of a word is proportional to its rank in the frequency table... what else would it be? It always struck me as circuitous logic, or a truism? But I'm not a math person so I'm probably oversimplifying it.
TL;DR: Putting stuff in order usually doesn't tell us much of anything about the values themselves we are ordering, but in the case of word frequencies, Zipf found that it does.
Let W be a set of words in a large text. Let f(w) be the frequency of the word w (say, as a ratio between w's count in the text and the most popular word's count in that same text). Assume for simplicity that no two words have precisely the same frequency.
Then we can take all n words of W and put them in descending order by their frequency, thus giving each word a unique index:
w_1, w_2, ..., w_n
where
f(w_1) > f(w_2) > ... > f(w_n).
The only information we've organized is an ordering of words by their frequency. We can't really decide in any generalizable way by how much it decreases, at what rate, or anything. (We could, of course, attempt to characterize these things for a single, specific text.) It could be that in your favorite book, the frequency decreases by 0.0000001% for each index in the list for all we know, or it could be that in all Hacker News posts of 2021,
f(w_(i+1)) = f(w_i) / 2,
that is, each word in the list occurs half as often as the word just before it. It seems reasonable to believe that if there is some relationship, it should depend on f (which depends on the text being analyzed).
What is surprising is that it doesn't really! For all practical purposes, regardless of text, f(w_i) can be written in terms of a simple function of i, specifically as (roughly) 1/i. That means w_1 occurs the most, w_2 occurs half as often as w_1, w_3 occurs a third as often as w_1, w_4 a fourth as often as w_1 (and thus half as often as w_2), etc.
Fairly random. There is no obvious reason to think that words frequency across different languages would be so similar when ranked. Why is it not the case that in language A the 2nd most popular word is used 90% as often as the first, and in language B the 2nd most popular word is used 70% as often as the first?
One would think with something like languages which are so complex and developed independently that there wouldn't be such a surprisingly consistent ratio.
And beyond language, it occurs in stuff city populations, the amount of traffic websites receive, last names, ingredients used in cook books, etc...
As others have noted, the surprising part is that it follows a particular ~1/<ranking> type pattern rather than being, say, linear or something.
An interesting application is looking at untranslated works -- for example while they can't translate the Voynich Manuscript, it does follow the distribution, so it is probably not just random scribbling (of course, this doesn't rule out the likely options of cypher or constructed language).
The surprise I guess is that—when plotted on a log chart—the prevalence of each rank forms a straight line with respect to its rank.
Of course the frequency is going to be proportional in some way to the rank. But there are many ways that could happen. #1 could occur 10% more than #2. Or twice as much.
And for the law to hold true no matter how deep you go is also surprising. Language seems like it should be a little more chaotic than that, with the top, say, 50 words following one distribution, then the longer tail kinda bumping around at different slopes.
This is my lay understanding. Corrections welcome :)
The power law is surprising. We're taught that the Gaussian distribution shows up everywhere in nature (aka the "Normal" distribution) yet here is a distribution that converges to a power law tail.
For comparison, Zipf has a power law tail (x^{-\alpha}) whereas a Gaussian distribution falls off doubly exponentially (e^{-x^2}).
Minor nitpick about being surprised that a power law shows up: you need to be really careful when you say something follows a power law. The appearance of a nice power law may actually be illusion, sometimes, because log-scale tends to 'squash' data with quite a lot of 'pressure'.
That's because it's quite possible to have 2 lines of fit that look like they both very nicely follow a power law, but then once you go from a log-scale plot to linear plot by doing the inverse of the log, ie exponential transform, one of the 2 plots may turn out to be really crappy, actually.
So, if the error bars on the log-scale plot are not in some sense really small (e.g logarithmic in themselves) already to begin with, then, you may actually be committing ~great crimes of statistics~, unknowingly.
The problem with Zipf’s law in language is there are dozens and dozens of different explanations for it, each plausible and well-motivated. For any particular explanation to be convincing, it will have to make novel testable predictions, something which is rarely done.
The idea is that if the growth rate of a city is drawn from the same distribution, regardless of the actual city size, we get Zipf's law. To me this is a reasonable explanation why we see Zipf's law in many places in nature.
The article mentions looking at randomly generated words and suggests that the power law being verified in that case may have to do with the generated lengths.
Has someone tried to generate random words whose length follow the distribution of natural languages?
> Has someone tried to generate random words whose length follow the distribution of natural languages?
Yes, this is an exercise in chapter one of Chris Manning et. al. "Foundations of Statistical NLP" (MIT-1999).
You can do it in almost 2 lines of very code golfed python. 5 if you are pedantic and count the imports (and a million if you count the lines in matplotlib).
import random
from collections import Counter
import matplotlib.pyplot as plt
corpus = Counter(''.join(["qwertyuiopasdfghjklzxcvbnm "[random.randint(0,26)] for i in range(100_00_000)]).split(' ')[:-1])
plt.loglog(range(len(corpus)-1), [el[1] for el in corpus.most_common()], '.')
I remove the last generated word as it is not terminated by a space (I'm a bit unsure what the statistical implication is if the last generated character is a space)
I do not get anywhere near the power distribution that the text book suggest I will get.
https://journals.plos.org/plosone/article?id=10.1371/journal...
"In spite of decades of theorizing, the origins of Zipf’s law remain elusive. I propose that a Zipfian distribution straightforwardly follows from the interaction of syntax (word classes differing in class size) and semantics (words having to be sufficiently specific to be distinctive and sufficiently general to be reusable). These factors are independently motivated and well-established ingredients of a natural-language system. Using a computational model, it is shown that neither of these ingredients suffices to produce a Zipfian distribution on its own and that the results deviate from the Zipfian ideal only in the same way as natural language itself does."