HowTo: Propensity Score Matching
I give you this not because it is of great use or importance (or, to be fair, even interest), but because I’m looking into it for purposes of my own. And misery loves company. Suckers. So this is not a ‘how to’. I’m not even the first person to write about it on a blog. Consider it more of a ‘what is’. Without the maths, even.
I’ve mentioned the evaluation problem previously. I’ll illustrate with an example upon which I’ve worked previously: hysterectomy (NB: I’m not one of those authors). Simplifying that problem a bit, suppose there are two types of hysterectomy: Abdominal, and Vaginal (fellows: a hysterectomy is the removal of the uterus). The evaluation problem is this. You cannot give the same woman both types. Once you’ve given her an Abdominal hysterectomy, that’s it (fellows: women only have one uterus. What?). She cannot have a Vaginal one. There is no perfect counter-factual information with which to compare the factual.
Ergo you cannot compare the effectiveness of two types of hysterectomy on exactly the same person. This is also know as the effect of treatment on the untreated. This is where clinical trials come in. A Randomised Control Trial is one where all participants have been randomly allocated treatment or non-treatment (or Treatment A, Treatment B, etc. You get the picture). The idea is that the pool of treated patients is exactly the same as the pool of untreated patients. More importantly the probability of being treated, given any personal characteristics you might have, should be exactly the same as the probability of not being treated (and vice versa: the probability that you’re a male, for example, given you were treated, should be exactly the same as the probability that you’re a male given you weren’t treated – assuming treatment was not a sex-change operation. That really was beneath me).
Now it gets interesting. This is all well and good for clinical trials, but something else exists – something called a natural experiment. A natural experiment for our purposes is one without a properly-constructed control group, and usually arises with retrospective data analysis. If you wanted to examine something pertaining to an outbreak of Ebola Zaire, you can hardly randomise villages in central Africa, giving some of them Ebola and others not. If it so happens that you do have delusions of Dr. Mengele, kindly keep them to yourself. Nor, if any outbreak happens to occur, can you easily find a control group – no other region will be exactly alike in terms of culture, climate, etc. and there’s no guarantee that if you use the same region at an earlier period you will be successful (you could have missed the famine or war that led to young men eating a gorilla that was found dead rather than shot, or however the outbreak started).
An example I’ve also used in class is so-called 9/11 (no, I’m not disputing it happened, Americans just have a thoroughly backwards approach to dates. The American 9/11 was actually our 11/9 and it is the Americna who are wrong. I’ve also been to ‘ground zero’ and discovered a truly disgusting entrepreneurial spirit at work. Perhaps more on that another day, but I doubt it. It makes me want to throw up). It is a perfect natural experiment for emergency services responsiveness – but not one that can be carried out under RCT conditions, for obvious reasons.
Now, non-experimental data is likely to contain biases of one kind or another. In health care, for example, we cannot simply look at the health care demand of insured vs. uninsured people – the very fact that an insured person is insured means they will/may
(i) use more health care because there’s a co-pay, or some other deduction, and/or
(ii) use more health care because they’re sicker, which is why the bought the bloody health insurance in the first place, or
(iii) use more health care because they are more educated and make more money, hence can afford health insurane and understand the benefits of investments in their health stock (this last category consists entirely of Michael Grossman).
This is known as selection bias. And it can be passive or active, if you like. If you conduct a survey in only, say, the north of Italy, you get selection bias. Because is more health, wealthy, educated and industrialised than the south. If you conduct a survey on Fox News, you get self-selection bias, because only viewers of Fox News will respond. So you can’t compare teaching in public/private schools using test scores, because there will be systematic (non-random) differences. You can’t compare an outbreak of Ebola in Central Africa with no outbreak in West Africa. You can’t compare emergency response to a terrorist attack in New York to one in London or Madrid for the same reason.
There is also simultaneity bias, or reverse causation. That is to wit (I’m old-fashioned), consider this: does greater litigiousness generate more litigation lawyers, or did more litigation lawyers generate more litigation? Some of both, no doubt, but a regression model needs to be built to pick up causation in the right direction.
So, Propensity Score Matching. This is an econometric technique for use with non-experimental data. It is designed to overcome the bias that you will face if and when you decide to compare the effects of treatment on the treated with those of non-treatment on the non-treated, if you get your data from outside an RCT setting. If your treated and control groups are from different areas, different datasets, different time periods, and so forth. Propensity Score Matching is also not the only approach. Follow the link for non-experimental data, and you’ll find a handful more. The elsblog discusses some, as well.
The seminal paper for this stuff, by the by, is Propensity Score-Matching Methods For Nonexperimental Causal Studies by Rajeev Dehejia and Sadek Wahba (he’s from Morgan Stanley – no webpage).
So the trick is matching. Using covariates (Age, Gender, Income, Education, Marital Status, Hair Colour – anything of economic and/or statistical importance to the outcome of interest), we can theoretically break our sample of treated people into groups, or bins. All single males 28 years old, college-educated, non-smoking, living in urban environments, and so forth can then be compared, in terms of treatment, only with other single males 28 years old, college-educated, non-smoking, living in urban environments, and so forth. Repeat. However, see the problem? Every bin needs a corresponding bin in the non-treated group (this could be another sample). If there isn’t one, you have to exclude all of those people. If there is, but the bin is not sufficiently populated, so too out it goes.
The trick, the goal, the purpose of Propensity Score Matching is overcoming the problem of dimensionality. Comparing treated individuals to non-treated with non-experimental data is an enterprise entirely victim to the covariates at hand. Suppose you have 15 econometrically relevant explanatory variables. To employ them all is to render the econometric problem practically unsurpassable, but to use only, say, 5 of the variables will render the estimation and explanation of the response variable practically useless. Propensity score matching does this: it is a matching method that, instead of using every Xi, uses p(Xi), where p(Xi) is the probability of having been treated, given the covariates. The probability p(Xi) is the propensity score.
Bingo! Our dimensionality problem is gone. Instead of n covariates, we have a single score.
Then, each treated individual (assuming for the sake of argument that it is an individual with whose treatment we are dealing) is compared to non-treated individuals according to their propensity score. There are a few approaches to this, too. First, the treated individuals have to be ranked. They can be ranked in ascending order, descending order (this is with respect to their propensity score) or randomly. The ranking determines the order in which they are matched.
Once ranked, they are matched. Individuals can be matched with replacement or without replacement. Without replacement, once a non-treated individual has been matched, or ‘paired’ with a treated individual, they are removed from the pool. This can be a problem if you don’t have loads of non-treated individuals, as each subsequent match may involve greater distances (also in terms of the propensity score). Moreover it may not make sense. If you do use RCT data, comparing means (or using regression), you still compare with replacement, effectively. It should be case-by-case, but for me the arguments for matching with replacement are convincing enough.
Next: how many matches? If you match strictly one-to-one, you guarantee minimum bias, and minimum ‘distance’ between matches. But if you use more matches, you should get more precise estimates, albeit at the risk of greater bias (like the with/without replacement. As this propensity score ‘distance’ increases, so does the likelihood that you are comparing a treated individual with a systematically different non-treated individual). There are a couple of algorithms for this. One is the nearest-neighbour method, which automatically takes the m nearest non-treated propensity scores (but you pick the m – although that too can be optimised), and another is the ‘caliper’ method, which picks however many non-treated individuals are within a pre-specified ‘distance’ (that too could be optimised). This is also a case-by-case concern. There is no set rule for applying the Propensity Score Matching method to non-experimental data.
The Dehejia and Wahba (2002) paper uses data from a preceding paper by Robert LaLonde, comparing training programmes. This gave them the advantage of having on hand experimental data with which to compare results from a constructed non-experimental problem. Their results were pretty good, and Propensity Score Matching has entered the methodological world.
Why of interest to me? I intend, along with a colleague, in applying some Bayesian value-of-information standards to the method, to look at some of the preliminary testing that goes on to assess the suitability of comparison groups, as well as to evaluate the likelihood that the ultimately-estimated Treatment Effect is correct for a given individual. Look out for future posts containing discussion of value-of-information analysis. Now that stuff is fun.