Addressing the issue of causality in data can be very difficult. Originally, the only method people had of getting at causality was through conflating it with correlation. This method was crystalized with the creation of the OLS (Ordinary Least Squares) regression model, used by astronomers to predict movements. However, while correlation may have worked to some extent for astronomers, it is ineffective for determining causality in most fields. The biggest reason for this is that people who choose something may have been more likely to have a given outcome than the broader population even before they made that choice. Later, a major milestone towards causal inference was introduced in the form of randomization. Interpreting the reason for this, and its importance, is an important part of the main model for understanding causality, which is to say potential outcomes.
Potential Outcomes is a model of comparing a hypothetical outcome with the outcome that actually happened. For example, to determine the effect of having heart surgery on a given person, one must compare the outcome of the person after they had heart surgery to the outcome they would have had if they had not had heart surgery. To put it into concrete terms, one may define the outcome that happened as Yᵢ and the outcomes that would happen depending on whether or not a given action was taken, or treatment, as Yᵢ¹ and Yᵢ⁰ respectively. Lastly, if one defines whether or not the treatment was applied as a binary Dᵢ, one can create an equation that the outcome in terms of potential outcomes.
Yᵢ = DᵢYᵢ¹ + (1-Dᵢ)Yᵢ⁰
In order to find causality from this equation, on simply needs to subtract Yᵢ⁰ from Yᵢ¹:
δᵢ = Yᵢ¹ - Yᵢ⁰
While this equation does provide a simple way to understand causality, it also makes clear why causality is so difficult to determine. This is because we only know the actual outcome and whether or not the treatment was applied. In terms of the equation, we know Yᵢ and Dᵢ for certain. However, we can know either Yᵢ¹ or Yᵢ⁰, but not both. As such, simply understanding the equation is insufficient for determining causality and it is impossible to know the causal effects for an individual.
By expanding the equation to deal with populations and population means, one can attempt to estimate causality. Rather than using a certain metric, one may instead use the average, and thus an expectation.
E[δᵢ] = E[Yᵢ¹] - E[Yᵢ⁰]
This equation, by itself, appears sufficient to estimate the Average Treatment Effect, or ATT. However, it is not enough. This is because, like the equation for the effect on individuals, we do not know all Yᵢ¹s or Yᵢ⁰. We only know E[Yᵢ¹|Dᵢ=1] and E[Yᵢ⁰|Dᵢ=0], or the average outcome of those who were treated and the average outcome of those who were not treated. Using some creative algebra though, one can create a new equation, using E[Yᵢ¹|Dᵢ=1] and E[Yᵢ⁰|Dᵢ=0], that does determine the ATE:
E[Yᵢ¹|Dᵢ=1] - E[Yᵢ⁰|Dᵢ=0] = ATE + E[Yᵢ⁰|Dᵢ=1] - E[Yᵢ⁰|Dᵢ=0] + (1−π)(ATT−ATU)
In this new equation π is the share of participants who were treated, while ATT is the average treatment effect on those who were treated and ATU is the average treatment effect on those who were not treated. At first glance, this new equations appears to be insufficient to determine ATE, after all there are to other parts to the equation. However, once one understands what those other parts mean, it becomes clear that it is possible to minimize or outright eliminate them.
The first part, E[Yᵢ⁰|Dᵢ=1] - E[Yᵢ⁰|Dᵢ=0], can be understood as selection bias. In other words, it is the difference between the treated and untreated groups that would have been present regardless. To use the heart surgery example, doctors only give heart surgery to people who need it. The outcome of a person who needs heart surgery and doesn’t get it is probably very different from someone who doesn’t need it.
The second bias in the equation is (1−π)(ATT−ATU). This can be interpreted as heterogenous treatment effect bias. In other words, the effect of being treated may be different between the groups. Continuing with heart surgery, if someone needs heart surgery and gets it, the outcome will be better than the outcome had they not gotten it. On the other hand, if someone does not need heart surgery, getting it will probably be harmful.
Fortunately, both selection bias and heterogenous treatment effect bias have the same solution. Essentially, one must ensure that people are treated independently of their prior state or outcome of the treatment. The easiest method of doing this, as hinted at before, is randomization. By ensuring that who gets treated is random, the difference between the two groups will, on average, be zero.