For statistics, that is a messy field. It
has too many introductory texts that over
simplify the subject and not enough well
done intermediate or advanced texts.
Also the subject has essentially a lie:
They explain that a random variable has a
distribution. Right, it does. Then they
mention some common distributions,
especially Gaussian, exponential, Poisson,
multinomial, and uniform. Then the lie:
The suggestion is that in practice we
collect data and try to find the
distribution. Nope: Mostly not. Mostly
in practice, we can't find the
distribution, not even of one random
variable and much less likely for the
joint distribution of several random
variables (that is, of a vector valued
random variable). Or, to estimate the
distribution of a vector valued random
variable commonly would encounter the
curse of dimensionality and require
really big big data. Instead, usually
we use limit theorems, techniques that
don't need the distribution, or in some
cases make, say, a Gaussian assumption and
get a first-cut approximation.
Early in my career I did a lot in applied
statistics but later concluded I'd done a
lot of slogging through a muddy swamp of
low grade material.
A clean and powerful first cut approach to
statistics is just via a good background
in probability: With this approach, for
statistics, you take some data, regard
that as values of some random variables
with some useful properties, stuff the
data into some computations, and get out
data that you regard as the values of some
more random variables which are the
statistics. The big deal is what
properties the output random variables
have -- maybe they are unbiased, minimum
variance, Gaussian, maximum likelihood,
estimates of something, etc.
For this work you will want to know the
classic limit theorems of probability
theory -- weak and strong laws of large
numbers, elementary and advanced
(Lindeberg-Feller) versions of the central
limit theorem, the law of the iterated
logarithm (and its astounding application
to an envelope of Brownian motion), and
martingales and the martingale convergence
theorem ("the most powerful limit theorem
in mathematics" -- it's possible to have
making applications of that result much of
a successful academic career). And,
generally beyond the elementary statistics
books, you will want to understand
sufficient statistics (and the
astounding fact that, for the Gaussian,
sample mean and variance are sufficient
with generalizations to the exponential
family) and, also, U-statistics where
the order of the input data makes no
difference (and order statistics are
always sufficient). Sufficient statistics
is really from (a classic paper by Halmos
and Savage and) the Radon-Nikodym theorem
(with a famous, very clever, cute proof by
von Neumann), and that result is in, say,
the first half of W. Rudin, Real and
Complex Analysis (with von Neumann's
proof).
Also with the Radon-Nikodym theorem, can
quickly do the Hahn decomposition and,
then, knock off a very general proof of
the Neyman-Pearson result in statistics.
How 'bout that!
Thus, to some extent to do well with
statistics, both for now and for the
future, especially if you want to do some
work that is original, you will need much
of the rest of a good ugrad major in math
and the courses of a Master's in selected
topics in pure/applied math.
So, for such study, sure, at one time
Harvard's famous Math 55 used the Halmos
text above along with W. Rudin,
Principles of Mathematical Analysis
(calculus done very carefully and a good
foundation for more), and Spivak,
Calculus on Manifolds, e.g., for people
interested in more modern approaches to
relativity theory (but Cartan's book is
available in English now). It may be that
you are not interested in relativity
theory or the rest of mathematical physics
-- fine, and that can help you set aside
some topics.
Then, Royden, Real Analysis and the
first half of Rudin's R&CA as above,
along with any of several alternatives,
cover measure theory and the beginnings of
functional analysis. Measure theory does
calculus again and in a more powerful way
-- in freshman calculus, want to integrate
a continuous function defined on a closed
interval of finite length, but in measure
theory get much more generality.
And measure theory also provides the
axiomatic foundation for modern
probability theory and of random
variables. Seeing that definition of a
random variable is a real eye opener, for
me a life-changing event: Get a level of
understanding of randomness that cuts
out and tosses into the dumpster or bit
bucket nearly all the elementary and
popular (horribly confused) treatments of
randomness.
Functional analysis? Well, in linear
algebra you get comfortable with vector
spaces. So, for positive integer n and
the set of real numbers R, you get happy
in the n-dimensional vector space R^n.
But, also be sure to see the axioms of a
vector space where R^n is just the leading
example. You want the axioms right away
for, say, the (affine) vector subspace of
R^n that is the set of all solutions of a
system of linear equations. How 'bout
that!
Then in functional analysis, you work
with functions and where each function is
regarded as a point in a vector space.
The nicest such vector space is Hilbert
space which has an inner product
(essentially the same as angle or in
probability covariance and in statistics
correlation) and gives a metric in which
the space is complete -- that is, as in
the real numbers but not in the rationals,
a sequence that appears to converge really
has something to converge to. Then wonder
of wonders (really, mostly due just to the
Minkowski inequality), the set of all real
valued random variables X such that the
expectation (measure theory integral)
E[X^2] is finite is a Hilbert space,
right, is complete. Amazing, but true.
Then in Hilbert space, get to see how to
approximate one function by others. So,
in particular, get to see how to
approximate a random variable don't have
by ones you do have -- might call that
statistical estimation and would be
correct.
Then can drag out the Hahn-Banach result
and do projections, that is, least
squares, that is, in an important sense
(from a classic random variable
convergence result you should be sure to
learn), best possible linear
approximations. And maybe such an
approximation is the ad targeting that
makes you the most money.
So, that projection is a baby version of
regression analysis. There's a problem
here: The usual treatments of regression
analysis make a long list of assumptions
that look essentially impossible in
practice to verify or satisfy and, thus,
leave one with what look like unjustified
applications.
Nope: Just do the derivations yourself
with fewer assumptions and get fewer
results but still often enough in
practice. And they are still solid
results.
For the usual text derivations, by
assuming so much, they get much more,
especially lots of confidence intervals.
In practice often you can use those
confidence interval results as first-cut,
rough measures of goodness of fit or
some such.
But the idea of just a projection can
give you a lot. In particular there is an
easy, sweetheart way around the onerous,
hideous, hated over fitting -- it seems
silly that having too much data hurts, and
it shouldn't hurt and doesn't have to!
And the now popular practice in machine
learning of just fitting with learning
data and then verifying with test
data, with some more considerations which
are also appropriate, can also be solid
with even fewer assumptions.
Fabulous write up. Do you somewhere besides here to publish it so it's easier for people to wander across in the future? I'd be interested to see what else you've written.
For statistics, that is a messy field. It has too many introductory texts that over simplify the subject and not enough well done intermediate or advanced texts.
Also the subject has essentially a lie: They explain that a random variable has a distribution. Right, it does. Then they mention some common distributions, especially Gaussian, exponential, Poisson, multinomial, and uniform. Then the lie: The suggestion is that in practice we collect data and try to find the distribution. Nope: Mostly not. Mostly in practice, we can't find the distribution, not even of one random variable and much less likely for the joint distribution of several random variables (that is, of a vector valued random variable). Or, to estimate the distribution of a vector valued random variable commonly would encounter the curse of dimensionality and require really big big data. Instead, usually we use limit theorems, techniques that don't need the distribution, or in some cases make, say, a Gaussian assumption and get a first-cut approximation.
Early in my career I did a lot in applied statistics but later concluded I'd done a lot of slogging through a muddy swamp of low grade material.
A clean and powerful first cut approach to statistics is just via a good background in probability: With this approach, for statistics, you take some data, regard that as values of some random variables with some useful properties, stuff the data into some computations, and get out data that you regard as the values of some more random variables which are the statistics. The big deal is what properties the output random variables have -- maybe they are unbiased, minimum variance, Gaussian, maximum likelihood, estimates of something, etc.
For this work you will want to know the classic limit theorems of probability theory -- weak and strong laws of large numbers, elementary and advanced (Lindeberg-Feller) versions of the central limit theorem, the law of the iterated logarithm (and its astounding application to an envelope of Brownian motion), and martingales and the martingale convergence theorem ("the most powerful limit theorem in mathematics" -- it's possible to have making applications of that result much of a successful academic career). And, generally beyond the elementary statistics books, you will want to understand sufficient statistics (and the astounding fact that, for the Gaussian, sample mean and variance are sufficient with generalizations to the exponential family) and, also, U-statistics where the order of the input data makes no difference (and order statistics are always sufficient). Sufficient statistics is really from (a classic paper by Halmos and Savage and) the Radon-Nikodym theorem (with a famous, very clever, cute proof by von Neumann), and that result is in, say, the first half of W. Rudin, Real and Complex Analysis (with von Neumann's proof).
Also with the Radon-Nikodym theorem, can quickly do the Hahn decomposition and, then, knock off a very general proof of the Neyman-Pearson result in statistics. How 'bout that!
Thus, to some extent to do well with statistics, both for now and for the future, especially if you want to do some work that is original, you will need much of the rest of a good ugrad major in math and the courses of a Master's in selected topics in pure/applied math.
So, for such study, sure, at one time Harvard's famous Math 55 used the Halmos text above along with W. Rudin, Principles of Mathematical Analysis (calculus done very carefully and a good foundation for more), and Spivak, Calculus on Manifolds, e.g., for people interested in more modern approaches to relativity theory (but Cartan's book is available in English now). It may be that you are not interested in relativity theory or the rest of mathematical physics -- fine, and that can help you set aside some topics.
Then, Royden, Real Analysis and the first half of Rudin's R&CA as above, along with any of several alternatives, cover measure theory and the beginnings of functional analysis. Measure theory does calculus again and in a more powerful way -- in freshman calculus, want to integrate a continuous function defined on a closed interval of finite length, but in measure theory get much more generality.
And measure theory also provides the axiomatic foundation for modern probability theory and of random variables. Seeing that definition of a random variable is a real eye opener, for me a life-changing event: Get a level of understanding of randomness that cuts out and tosses into the dumpster or bit bucket nearly all the elementary and popular (horribly confused) treatments of randomness.
Functional analysis? Well, in linear algebra you get comfortable with vector spaces. So, for positive integer n and the set of real numbers R, you get happy in the n-dimensional vector space R^n. But, also be sure to see the axioms of a vector space where R^n is just the leading example. You want the axioms right away for, say, the (affine) vector subspace of R^n that is the set of all solutions of a system of linear equations. How 'bout that!
Then in functional analysis, you work with functions and where each function is regarded as a point in a vector space. The nicest such vector space is Hilbert space which has an inner product (essentially the same as angle or in probability covariance and in statistics correlation) and gives a metric in which the space is complete -- that is, as in the real numbers but not in the rationals, a sequence that appears to converge really has something to converge to. Then wonder of wonders (really, mostly due just to the Minkowski inequality), the set of all real valued random variables X such that the expectation (measure theory integral) E[X^2] is finite is a Hilbert space, right, is complete. Amazing, but true.
Then in Hilbert space, get to see how to approximate one function by others. So, in particular, get to see how to approximate a random variable don't have by ones you do have -- might call that statistical estimation and would be correct.
Then can drag out the Hahn-Banach result and do projections, that is, least squares, that is, in an important sense (from a classic random variable convergence result you should be sure to learn), best possible linear approximations. And maybe such an approximation is the ad targeting that makes you the most money.
So, that projection is a baby version of regression analysis. There's a problem here: The usual treatments of regression analysis make a long list of assumptions that look essentially impossible in practice to verify or satisfy and, thus, leave one with what look like unjustified applications.
Nope: Just do the derivations yourself with fewer assumptions and get fewer results but still often enough in practice. And they are still solid results.
For the usual text derivations, by assuming so much, they get much more, especially lots of confidence intervals. In practice often you can use those confidence interval results as first-cut, rough measures of goodness of fit or some such.
But the idea of just a projection can give you a lot. In particular there is an easy, sweetheart way around the onerous, hideous, hated over fitting -- it seems silly that having too much data hurts, and it shouldn't hurt and doesn't have to!
And the now popular practice in machine learning of just fitting with learning data and then verifying with test data, with some more considerations which are also appropriate, can also be solid with even fewer assumptions.
Go for it!