For some areas of research, truly understanding causality is essentially impossible - if well-controlled experiments are impossible and the list of possible colliders and confounders is unknowable.
The key problem is that any causal relation can be an illusion caused by some other, unobserved relation!
This means that in order to show fully valid causal effect estimates, we need to
- measure precisely
- measure all relevant variables
- actively NOT measure all harmful (i.e. falsely correlated) variables
I heartily recommend the book of why [1] by Pearl and Mackenzie for a deeper reading and the "haunted DAG" in McElreath's wonderful Statistical Rethinking.
Pearl's Causality is very high on my "re-read while making flashcards" list. It is depressing how hard it is to establish causality, but also inspiring how causality can be teased out of observational statistics provided one dares assume a model on which variables and correlations are meaningful.
"provided one dares assume ..." - that's a great quote which I'll steal in the future if you allow!
Most things we learn about DAGs and causality are frustrating, but simulating a DAG (e.g. with lavaan in R) is a technique that actually helps in understanding when and how those assumptions make sense. That's (to me) a key part of making causality productive.
even if you hit all the assumptions you need to make Pearl/Rubin causality work, and there is no unobserved factor to cause problems, there is still a philosophical problem.
it all assumes you can divide the world cleanly into variables that can be the nodes of your DAG. The philosopher Nancy Cartwright talks about this a lot, but it’s also a practical problem.
And this is even before we get into the philosophical / epistemological questions about "cause."
You can make the argument, from correlative data, that bridges and train tracks cause truck accidents. And more importantly, if you act like they do when designing roadways, you actually will decrease truck accidents. But it's a common-sense-odd meaning of causality to claim a stationary object is acting upon a mobile object...
And even if you do know there's causality (eg: the input variable X is part of software that provides some output Y), the exact nature of the causality can be too complex to analyze due to emergent and chaotic effects. It's seldom as simple as: an increase in X will result in an increase in Y
Can we nevertheless extract causality from correlation?
I would argue that, theoretically, we cannot. Practically speaking, however, we frequently settle for “very, very convincing correlations” as indicative of causation. A correlation may be persuasively described as causation if three conditions are met:
Completeness: The association itself (R²) is 100%. When we observe X, we always observe Y.
No bias: The association between X and Y is not affected by a third, omitted variable, Z.
I feel like you have this backwards. In the assignment Y:=2X, each unit of Y is caused by half a unit of X. In the game where we flip a coin at fair odds, if you have increased your wealth by 8× in 3 tosses, that was caused by you getting heads every toss. Theoretically establishing causality is trivial.
The problem comes when we try to do so practically, because reality is full of surprising detail.
> No bias: The association between X and Y is not affected by a third, omitted variable, Z.
This is, practically speaking, the difficult condition. I'm not so convinced the others are necessary (practically speaking, anyway) but you should read Pearl if you're into this!
You probably also need at least:
- Y does not appear when X does not
- We need an overwhelming sample size containing examples of both X and not X
- The experiment and data collection and trivially repeatable (so that we don't need to rely on trust)
- The experiment, data collection and analysis must be easy to understand and sensible in every way without leaving room for error
And as another commenter already pointed out: You can't really eradicate the existence of an unknown Z
In general you assume DAGs, i.e. non-cyclical causality. Cyclical relations must be resolved through distinct temporal steps, i.e. u_t0 causes v_t1 and v_t1 causes u_t2. When your measurement precision only captures simultaneous effects of both u on v and v on u you have a problem.
That colliders and confounders have technical definitions is not known by some:
------------------
Confounders
------------------
A variable that affects both the exposure and the outcome. It is a common cause of both variables.
Role: Confounders can create a spurious association between the exposure and outcome if not properly controlled for. They are typically addressed by controlling for them in statistical models, such as regression analysis, to reduce bias and estimate the true causal effect.
Example: Age is a common confounder in many studies because it can affect both the exposure (e.g., smoking) and the outcome (e.g., lung cancer).
------------------
Colliders
------------------
A variable that is causally influenced by two or more other variables. In graphical models, it is represented as a node where the arrowheads from these variables "collide."
Role: Colliders do not inherently create an association between the variables that influence them. However, conditioning on a collider (e.g., through stratification or regression) can introduce a non-causal association between these variables, leading to collider bias.
Example: If both smoking and lung cancer affect quality of life, quality of life is a collider. Conditioning on quality of life could create a biased association between smoking and lung cancer.
------------------
Differences
------------------
Direction of Causality: Confounders cause both the exposure and the outcome, while colliders are caused by both the exposure and the outcome.
Statistical Handling: Confounders should be controlled for to reduce bias, whereas controlling for colliders can introduce bias.
Graphical Representation: In Directed Acyclic Graphs (DAGs), confounders have arrows pointing away from them to both the exposure and outcome, while colliders have arrows pointing towards them from both the exposure and outcome.
------------------
Managing
------------------
Directed Acyclic Graphs (DAGs): These are useful tools for identifying and distinguishing between confounders and colliders. They help in understanding the causal structure of the variables involved.
Statistical Methods: For confounders, methods like regression analysis are effective for controlling their effects. For colliders, avoiding conditioning on them is crucial to prevent collider bias.
Sure, but someone else did this for me, using AI, I found it useful to scan in the moment. I appreciated it and upvoted it.
Like that experience, this was meant as a scannable introduction to the topic, not an exact reference. Happy to hear altenative views, or downvote to give herding-style feedback.
Had I done a short AI-generated summary, it would have been a bit less helpful, but there wouldn't have been downvotes.
Had I linked instead of posted the same AI explanation, there would have been no or fewer downvotes, because many wouldn't click, and some of those that did would find it helpful.
Had I linked to something else, many would not click and read without a summary, both of which could have been AI-created.
I chose to move on and accept a few downvotes. The votes count less than the helpfulness to me. Votes don't mean it helps or doesn't. Many people accept confusion without seeking clarification, and appreciate a little help.
Although I personally do tend to downvote content-free unhelpful Reddit-style comments, I'm not overly fond of trying to massage things to help people manage their feelings when posts are only information, with no framing or opinion content. I understand that there is value in downvotes as herding-style feedback (as PG has pointed out). Yes, I've read the HN guidelines.
I think beyond herding-style feedback downvotes, AI info has become a bit socially unacceptable—okay to talk about it but not share it. But I find AI particularly useful as an initial look at information about a domain, though not trustworthy as a detailed source. I appreciate the footnotes that Perplexity provides for this kind of usage that let me begin checking for accurate details.
For some areas of research, truly understanding causality is essentially impossible - if well-controlled experiments are impossible and the list of possible colliders and confounders is unknowable.
The key problem is that any causal relation can be an illusion caused by some other, unobserved relation!
This means that in order to show fully valid causal effect estimates, we need to
- measure precisely
- measure all relevant variables
- actively NOT measure all harmful (i.e. falsely correlated) variables
I heartily recommend the book of why [1] by Pearl and Mackenzie for a deeper reading and the "haunted DAG" in McElreath's wonderful Statistical Rethinking.
[1] https://en.wikipedia.org/wiki/The_Book_of_Why