You're treating this sample-is-the-population issue as if it's resolved in the s...

dragonwriter · on Feb 10, 2016

> Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.

bonoboTP · on Feb 10, 2016

Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.

nanis · on Feb 10, 2016

As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?