Zipf's Law (2008)

GolDDranks · on June 6, 2022

Just last year, there was also this publication: Sander Lestrade: Unzipping Zipf’s law

https://journals.plos.org/plosone/article?id=10.1371/journal...

"In spite of decades of theorizing, the origins of Zipf’s law remain elusive. I propose that a Zipfian distribution straightforwardly follows from the interaction of syntax (word classes differing in class size) and semantics (words having to be sufficiently specific to be distinctive and sufficiently general to be reusable). These factors are independently motivated and well-established ingredients of a natural-language system. Using a computational model, it is shown that neither of these ingredients suffices to produce a Zipfian distribution on its own and that the results deviate from the Zipfian ideal only in the same way as natural language itself does."

wodenokoto · on June 6, 2022

I remember reading that article when it was on the front page of HN, and felt underwhelmed by the conclusion.

Reading the quoted conclusions again, I almost fail to see what the discovery is.

Natural language follows a slightly modified version of Zipf's Law / Power distribution, but from what I remember from college, so does randomly distributed writing (if you draw randomly from [a-z ] and split on spaces, you get a similar distribution)

EDIT: From Chris Mannings book "Statistical Foundation of NLP" (MIT-1999)

> As a final remark on Zipf’s law, we note that there is a debate on how surprising and interesting Zipf’s law and ‘power laws’ in general are as a description of natural phenomena. It has been argued that randomly generated text exhibits Zipf’s law (Li 1992).

>To show this, we construct a generator that randomly characters from the 26 letters of the alphabet and the blank (that is, each of these 27 symbols has an equal chance of being generated next). Simplifying slightly, the probability of a word of length n being generated is the probability of generating a non-blank character n times and the blank after that. One can show that the words generated by such a generator obey a power law of the form Mandelbrot suggested. The key insights are that there are 26 times more words of length n 1 than length n, and (ii) that there is a constant ratio by which words of length are more frequent than words of length n + 1. These two opposing trends combine into the regularity of Mandelbrot’s law. See exercise 1.4.

Where "Mandelbrot’s law" is the above "slightly modified version of Zipf's Law / Power distribution" I mentioned, and was published in 1954

flybrand · on June 6, 2022

As a non-mathematician working in finance w access to illiquid alternative returns (VC $s), I had some smarter mathematician interns look at power laws in the 2000s. We could find them everywhere - but at the time we interpreted it as our inability to use the appropriate methods. If our methods were right, we didn’t have a way to explain how such findings were relevant. My interpretation then was that the finding of PLs was an indicator of randomness.

ffhhj · on June 6, 2022

》PLs was an indicator of randomness

Exactly this! Meaning intelligent language has high entropy. And it has high entropy because it's compressed. We don't notice this compression and seems counter intuitive that wordy natural language is actually compressed, but it has evolved to be optimal and compressed.

doctor_eval · on June 6, 2022

Would it be possible to test the law against a corpus of a creole or pidgin? Since such a language is by definition new and may be rapidly evolving, it might have less entropy, and therefore might not fit the law. That might provide some evidence for this idea.

hammock · on June 6, 2022

Unfortunately so many people encounter power laws in business and interpret the opposite

flybrand · on June 8, 2022

Is there a way to compare power-laws between two series and take something from the data?

For example the slopes of public market returns are different than that of illiquid markets. Is a power law w more slope more random?

I was really looking for a t-test or chi-squared type take away. My smarter, math under grad interns pointed out that we were at the edge of interpretive capabilities - but this was ~2000s.

sbf501 · on June 6, 2022

I never understood why this is such a surprise? The frequency of a word is proportional to its rank in the frequency table... what else would it be? It always struck me as circuitous logic, or a truism? But I'm not a math person so I'm probably oversimplifying it.

reikonomusha · on June 6, 2022

TL;DR: Putting stuff in order usually doesn't tell us much of anything about the values themselves we are ordering, but in the case of word frequencies, Zipf found that it does.

Let W be a set of words in a large text. Let f(w) be the frequency of the word w (say, as a ratio between w's count in the text and the most popular word's count in that same text). Assume for simplicity that no two words have precisely the same frequency.

Then we can take all n words of W and put them in descending order by their frequency, thus giving each word a unique index:

    w_1, w_2, ..., w_n

where

    f(w_1) > f(w_2) > ... > f(w_n).

The only information we've organized is an ordering of words by their frequency. We can't really decide in any generalizable way by how much it decreases, at what rate, or anything. (We could, of course, attempt to characterize these things for a single, specific text.) It could be that in your favorite book, the frequency decreases by 0.0000001% for each index in the list for all we know, or it could be that in all Hacker News posts of 2021,

    f(w_(i+1)) = f(w_i) / 2,

that is, each word in the list occurs half as often as the word just before it. It seems reasonable to believe that if there is some relationship, it should depend on f (which depends on the text being analyzed).

What is surprising is that it doesn't really! For all practical purposes, regardless of text, f(w_i) can be written in terms of a simple function of i, specifically as (roughly) 1/i. That means w_1 occurs the most, w_2 occurs half as often as w_1, w_3 occurs a third as often as w_1, w_4 a fourth as often as w_1 (and thus half as often as w_2), etc.

invalidusernam3 · on June 6, 2022

> What else would it be?

Fairly random. There is no obvious reason to think that words frequency across different languages would be so similar when ranked. Why is it not the case that in language A the 2nd most popular word is used 90% as often as the first, and in language B the 2nd most popular word is used 70% as often as the first?

One would think with something like languages which are so complex and developed independently that there wouldn't be such a surprisingly consistent ratio.

And beyond language, it occurs in stuff city populations, the amount of traffic websites receive, last names, ingredients used in cook books, etc...

jumhyn · on June 6, 2022

Example of a distribution which would not fit this description:

1. 4096

2. 2048

3. 1024

4. 512

5. 256

6. 128

...

Here, the frequency of a word is inversely proportional to its rank to the power of two, not the rank itself.

bee_rider · on June 6, 2022

As others have noted, the surprising part is that it follows a particular ~1/<ranking> type pattern rather than being, say, linear or something.

An interesting application is looking at untranslated works -- for example while they can't translate the Voynich Manuscript, it does follow the distribution, so it is probably not just random scribbling (of course, this doesn't rule out the likely options of cypher or constructed language).

function_seven · on June 6, 2022

The surprise I guess is that—when plotted on a log chart—the prevalence of each rank forms a straight line with respect to its rank.

Of course the frequency is going to be proportional in some way to the rank. But there are many ways that could happen. #1 could occur 10% more than #2. Or twice as much.

And for the law to hold true no matter how deep you go is also surprising. Language seems like it should be a little more chaotic than that, with the top, say, 50 words following one distribution, then the longer tail kinda bumping around at different slopes.

This is my lay understanding. Corrections welcome :)

abetusk · on June 6, 2022

The power law is surprising. We're taught that the Gaussian distribution shows up everywhere in nature (aka the "Normal" distribution) yet here is a distribution that converges to a power law tail.

For comparison, Zipf has a power law tail (x^{-\alpha}) whereas a Gaussian distribution falls off doubly exponentially (e^{-x^2}).

pizza · on June 6, 2022

Minor nitpick about being surprised that a power law shows up: you need to be really careful when you say something follows a power law. The appearance of a nice power law may actually be illusion, sometimes, because log-scale tends to 'squash' data with quite a lot of 'pressure'.

That's because it's quite possible to have 2 lines of fit that look like they both very nicely follow a power law, but then once you go from a log-scale plot to linear plot by doing the inverse of the log, ie exponential transform, one of the 2 plots may turn out to be really crappy, actually.

So, if the error bars on the log-scale plot are not in some sense really small (e.g logarithmic in themselves) already to begin with, then, you may actually be committing ~great crimes of statistics~, unknowingly.

mdp2021 · on June 6, 2022

> shows up everywhere in nature

Shows up "everywhere", not exclusively. For that matter, also Erlang distributions are occurring models.

mdp2021 · on June 6, 2022

Quantitatively. Proportional (numerically) to /the value/ of its rank.

canjobear · on June 6, 2022

The problem with Zipf’s law in language is there are dozens and dozens of different explanations for it, each plausible and well-motivated. For any particular explanation to be convincing, it will have to make novel testable predictions, something which is rarely done.

mathieutd · on June 6, 2022

One cool way to generate Zipf's law is through random growth. See an example here for the distribution of cities:

https://academic.oup.com/qje/article/114/3/739/1848099

The idea is that if the growth rate of a city is drawn from the same distribution, regardless of the actual city size, we get Zipf's law. To me this is a reasonable explanation why we see Zipf's law in many places in nature.

asaddhamani · on June 6, 2022

Vsauce has a great video on this topic

https://youtu.be/fCn8zs912OE

lgeorget · on June 6, 2022

The article mentions looking at randomly generated words and suggests that the power law being verified in that case may have to do with the generated lengths.

Has someone tried to generate random words whose length follow the distribution of natural languages?

wodenokoto · on June 6, 2022

> Has someone tried to generate random words whose length follow the distribution of natural languages?

Yes, this is an exercise in chapter one of Chris Manning et. al. "Foundations of Statistical NLP" (MIT-1999).

You can do it in almost 2 lines of very code golfed python. 5 if you are pedantic and count the imports (and a million if you count the lines in matplotlib).

    import random
    from collections import Counter
    import matplotlib.pyplot as plt

    corpus = Counter(''.join(["qwertyuiopasdfghjklzxcvbnm "[random.randint(0,26)] for i in range(100_00_000)]).split(' ')[:-1])
    plt.loglog(range(len(corpus)-1), [el[1] for el in corpus.most_common()], '.')

I remove the last generated word as it is not terminated by a space (I'm a bit unsure what the statistical implication is if the last generated character is a space)

I do not get anywhere near the power distribution that the text book suggest I will get.

flir · on June 6, 2022

You can do that with Markov chains. Just letter-by-letter instead of word-by-word.

pmoriarty · on June 6, 2022

I wonder if this hold for every language or only English?

wyes · on June 6, 2022

Zipf’s law is also super useful for web caching

thunkshift1 · on June 6, 2022

How will you use it for web caching?

dredmorbius · on June 7, 2022

Requests follow an inverse log relationship of request frequency.

A little caching buys you tremendous savings in regeneration or back-end fetching.

throwoutway · on June 6, 2022

Does this have application in cryptanalysis (is that the right word?)? Or finding attribution of anonymous authors?

SnowHill9902 · on June 6, 2022

Does GPT-3 or other similarly automatically generated text also follow this law?

canjobear · on June 6, 2022

Yes, and the presence of Zipf's law in generated text is sometimes used as an evaluation metric. See https://aclanthology.org/2021.acl-long.414.pdf

Rallen89 · on June 6, 2022

Considering it used training data of human generated text I would assume so.

paulpauper · on June 6, 2022

really amaizng but simple insight