Part III -- Statistics For statistics, that is a messy field. It has too many in...

Part III -- Statistics

For statistics, that is a messy field. It has too many introductory texts that over simplify the subject and not enough well done intermediate or advanced texts.

Also the subject has essentially a lie: They explain that a random variable has a distribution. Right, it does. Then they mention some common distributions, especially Gaussian, exponential, Poisson, multinomial, and uniform. Then the lie: The suggestion is that in practice we collect data and try to find the distribution. Nope: Mostly not. Mostly in practice, we can't find the distribution, not even of one random variable and much less likely for the joint distribution of several random variables (that is, of a vector valued random variable). Or, to estimate the distribution of a vector valued random variable commonly would encounter the curse of dimensionality and require really big big data. Instead, usually we use limit theorems, techniques that don't need the distribution, or in some cases make, say, a Gaussian assumption and get a first-cut approximation.

Early in my career I did a lot in applied statistics but later concluded I'd done a lot of slogging through a muddy swamp of low grade material.

A clean and powerful first cut approach to statistics is just via a good background in probability: With this approach, for statistics, you take some data, regard that as values of some random variables with some useful properties, stuff the data into some computations, and get out data that you regard as the values of some more random variables which are the statistics. The big deal is what properties the output random variables have -- maybe they are unbiased, minimum variance, Gaussian, maximum likelihood, estimates of something, etc.

For this work you will want to know the classic limit theorems of probability theory -- weak and strong laws of large numbers, elementary and advanced (Lindeberg-Feller) versions of the central limit theorem, the law of the iterated logarithm (and its astounding application to an envelope of Brownian motion), and martingales and the martingale convergence theorem ("the most powerful limit theorem in mathematics" -- it's possible to have making applications of that result much of a successful academic career). And, generally beyond the elementary statistics books, you will want to understand sufficient statistics (and the astounding fact that, for the Gaussian, sample mean and variance are sufficient with generalizations to the exponential family) and, also, U-statistics where the order of the input data makes no difference (and order statistics are always sufficient). Sufficient statistics is really from (a classic paper by Halmos and Savage and) the Radon-Nikodym theorem (with a famous, very clever, cute proof by von Neumann), and that result is in, say, the first half of W. Rudin, Real and Complex Analysis (with von Neumann's proof).

Also with the Radon-Nikodym theorem, can quickly do the Hahn decomposition and, then, knock off a very general proof of the Neyman-Pearson result in statistics. How 'bout that!

Thus, to some extent to do well with statistics, both for now and for the future, especially if you want to do some work that is original, you will need much of the rest of a good ugrad major in math and the courses of a Master's in selected topics in pure/applied math.

So, for such study, sure, at one time Harvard's famous Math 55 used the Halmos text above along with W. Rudin, Principles of Mathematical Analysis (calculus done very carefully and a good foundation for more), and Spivak, Calculus on Manifolds, e.g., for people interested in more modern approaches to relativity theory (but Cartan's book is available in English now). It may be that you are not interested in relativity theory or the rest of mathematical physics -- fine, and that can help you set aside some topics.

Then, Royden, Real Analysis and the first half of Rudin's R&CA as above, along with any of several alternatives, cover measure theory and the beginnings of functional analysis. Measure theory does calculus again and in a more powerful way -- in freshman calculus, want to integrate a continuous function defined on a closed interval of finite length, but in measure theory get much more generality.

And measure theory also provides the axiomatic foundation for modern probability theory and of random variables. Seeing that definition of a random variable is a real eye opener, for me a life-changing event: Get a level of understanding of randomness that cuts out and tosses into the dumpster or bit bucket nearly all the elementary and popular (horribly confused) treatments of randomness.

Functional analysis? Well, in linear algebra you get comfortable with vector spaces. So, for positive integer n and the set of real numbers R, you get happy in the n-dimensional vector space R^n. But, also be sure to see the axioms of a vector space where R^n is just the leading example. You want the axioms right away for, say, the (affine) vector subspace of R^n that is the set of all solutions of a system of linear equations. How 'bout that!

Then in functional analysis, you work with functions and where each function is regarded as a point in a vector space. The nicest such vector space is Hilbert space which has an inner product (essentially the same as angle or in probability covariance and in statistics correlation) and gives a metric in which the space is complete -- that is, as in the real numbers but not in the rationals, a sequence that appears to converge really has something to converge to. Then wonder of wonders (really, mostly due just to the Minkowski inequality), the set of all real valued random variables X such that the expectation (measure theory integral) E[X^2] is finite is a Hilbert space, right, is complete. Amazing, but true.

Then in Hilbert space, get to see how to approximate one function by others. So, in particular, get to see how to approximate a random variable don't have by ones you do have -- might call that statistical estimation and would be correct.

Then can drag out the Hahn-Banach result and do projections, that is, least squares, that is, in an important sense (from a classic random variable convergence result you should be sure to learn), best possible linear approximations. And maybe such an approximation is the ad targeting that makes you the most money.

So, that projection is a baby version of regression analysis. There's a problem here: The usual treatments of regression analysis make a long list of assumptions that look essentially impossible in practice to verify or satisfy and, thus, leave one with what look like unjustified applications.

Nope: Just do the derivations yourself with fewer assumptions and get fewer results but still often enough in practice. And they are still solid results.

For the usual text derivations, by assuming so much, they get much more, especially lots of confidence intervals. In practice often you can use those confidence interval results as first-cut, rough measures of goodness of fit or some such.

But the idea of just a projection can give you a lot. In particular there is an easy, sweetheart way around the onerous, hideous, hated over fitting -- it seems silly that having too much data hurts, and it shouldn't hurt and doesn't have to!

And the now popular practice in machine learning of just fitting with learning data and then verifying with test data, with some more considerations which are also appropriate, can also be solid with even fewer assumptions.

Go for it!