Bill Easterly sent me a link to the post
The Vortex of Vacuousness that I posted the other day, but I like this one
better:
Maybe we should put rats in charge of foreign aid research, by William
Easterly: Laboratory experiments show that rats outperform humans in
interpreting data... The amazing
finding on rats is described in an equally amazing book by
Leonard Mlodinow. The experiment consists of drawing green and red balls at
random, with the probabilities rigged so that greens occur 75 percent of the
time. The subject is asked to watch for a while and then predict whether the
next ball will be green or red. The rats followed the optimal strategy of always
predicting green (I am a little unclear how the rats communicated, but never
mind). But the human subjects did not always predict green, they usually want to
do better and predict when red will come up too, engaging in reasoning like
“after three straight greens, we are due for a red.” As Mlodinow says, “humans
usually try to guess the pattern, and in the process we allow ourselves to be
outperformed by a rat.”
Unfortunately, spurious patterns show up in some important real world settings,
like research on the effect of foreign aid on growth. Without going into any
unnecessary technical detail, research looks for an association between economic
growth and some measure of foreign aid, controlling for other likely
determinants of economic growth. Of course, since there is some random variation
in both growth and aid, there is always the possibility that an association
appears by pure chance. The usual statistical procedures are designed to keep
this possibility small. The convention is that we believe a result if there is
only a 1 in 20 chance that the result arose at random. So if a researcher does a
study that finds a positive effect of aid on growth and it passes this “1 in 20”
test (referred to as a “statistically significant” result), we are fine, right?
Alas, not so
fast. A researcher is very eager to find a result, and such eagerness usually
involves running many statistical exercises (known as “regressions”). But the 1
in 20 safeguard only applies if you only did ONE regression. What if you did 20
regressions? Even if there is no relationship between growth and aid whatsoever,
on average you will get one “significant result” out of 20 by design. Suppose
you only report the one significant result and don’t mention the other 19
unsuccessful attempts. You can do twenty different regressions by varying the
definition of aid, the time periods, and the control variables. In aid research,
the aid variable has been tried, among other ways, as aid per capita, logarithm
of aid per capita, aid/GDP, logarithm of aid/GDP, aid/GDP squared, [log(aid/GDP)
- aid loan repayments], aid/GDP*[average of indexes of budget deficit/GDP,
inflation, and free trade], aid/GDP squared *[average of indexes of budget
deficit/GDP, inflation, and free trade], aid/GDP*[ quality of institutions],
etc. Time periods have varied from averages over 24 years to 12 years to to 8
years to 4 years. The list of possible control variables is endless. One of the
most exotic I ever saw was: the probability that two individuals in a country
belonged to different ethnic groups TIMES the number of political assassinations
in that country. So it’s not so hard to run many different aid and growth
regressions and report only the one that is “significant.”
This practice
is known as “data mining.” It is NOT acceptable practice, but this is very hard
to enforce since nobody is watching when a researcher runs multiple regressions.
It is seldom intentional dishonesty by the researcher. Because of our
non-rat-like propensity to see patterns everywhere, it is easy for researchers
to convince themselves that the failed exercises were just done incorrectly, and
that they finally found the “real result” when they get the “significant” one.
Even more insidious, the 20 regressions could be spread across 20 different
researchers. Each of these obediently does only one pre-specified regression, 19
of whom do not publish a paper since they had no significant results, but the
20th one does publish their spuriously “significant” finding (this is known as
“publication bias.”)
But don’t
give up on all damned lies and statistics, there ARE ways to catch data mining.
A “significant result” that is really spurious will only hold in the original
data sample, with the original time periods, with the original specification. If
new data becomes available as time passes you can test the result with the new
data, where it will vanish if it was spurious “data mining”. You can also try
different time periods, or slightly different but equally plausible definitions
of aid and the control variables.
So a few
years ago, some World Bank research found
that “aid works {raises economic growth} in a good policy environment.” This
study got published in a premier journal, got huge publicity, and eventually led
President George W. Bush (in his only known use of econometric research) to
create the Millennium Challenge Corporation, which he set up precisely to direct
aid to countries with “good policy environments.”
Unfortunately, this result later turned out to fail the data mining tests. Subsequent published studies found
that it failed the “new data” test, the different time periods test, and the
slightly different specifications test.
The original
result that “aid works in a good policy environment” was a spurious association.
Of course, the MCC is still operating, it may be good or bad for other reasons.
Moral of the
story: beware of these kinds of statistical “results” that are used to determine
aid policy! Unfortunately, the media and policy community don’t really get this,
and they take the original studies at face value (not only on aid and growth,
but also in stuff on determinants of civil war, fixing failed states,
peacekeeping, democracy, etc., etc.) At the very least, make sure the finding is
replicated by other researchers and passes the “data mining” tests. ...
I saw Milton Friedman provide an interesting example of avoiding data mining.
I was at a SF Fed conference where he was a speaker, and his talk was about a paper he
had written 20 years earlier on "The Plucking Model." From a post in January
2006,
New Support for Friedman's Plucking Model:
Friedman found evidence for the Plucking Model of
aggregate fluctuations in a 1993
paper in Economic Inquiry. One reason I've always liked this paper is that
Friedman first wrote it in 1964. He then waited for more than twenty years for
new data to arrive and retested his model using only the new data. In
macroeconomics, we often encounter a problem in testing theoretical models. We
know what the data look like and what facts need to be explained by our models.
Is it sensible to build a model to fit the data and then use that data to test
it to see if it fits? Of course the model will fit the data, it was built to do
so. Friedman avoided this problem since he had no way of knowing if the next
twenty years of data would fit the model or not. It did.
The other thing I'll note is that there is a literature on how test
statistics are affected by pretesting, but it is ignored for the most part (e.g.
if you run a regression, then throw out an insignificant variable, anything you
do later must take account of the fact that you could have made a type I or type
II error during the pretesting phase). The bottom line is that the test statistics from the final version of the model are almost
always non-normal, and the distribution of the test statistics is not generally
known.
[One more note. I wrote a paper on Friedman's Plucking Model, and had a
revise and resubmit at a pretty good journal. I satisfied all the referee's
objections, at least I thought I had, and it was all set to go. I had sent the
first version of the paper to Friedman, and he wrote back with a long,
multi-page letter that was very encouraging, and I incorporated his
suggestions into the revision (a reason I'll always have a soft spot for him,
his time was valuable, yet he took the time to do this). But the final results
weren't robust, and had come about through trying different specifications until
one worked. The final specification worked well, very well in fact, but the results were pretty
fragile. As a result, I pulled the paper and did not resubmit it. The paper was
completely redone and rewritten, but after thinking it over I decided it wasn't robust
enough to publish. I find myself regretting that sometimes, the referees would
have probably taken the paper since the final version satisfied all their
objections, and it was a good journal - I told myself I had simply done what
everyone else does, etc. But, hard as it was for an assistant professor in
need of publications to pull a paper, especially one Friedman himself had endorsed - this was just before going up for tenure
so it could have mattered a lot - pulling the paper was the right thing to do. The only way to solve this problem - and data mining in economics is a problem - is for the people involved in the research to self-police the integrity of the process.]
Update: Seems like a good time to rerun this graph on publications in political science journals:
Lies, Damn Lies, and....: Via Kieran Healy,
...It is, at first glance, just what it says it is: a study of
publication bias, the tendency of academic journals to publish studies
that find positive results but not to publish studies that fail to find
results. ...
The
chart on the right shows G&M's basic result. In statistics jargon,
a significant result is anything with a "z-score" higher than 1.96, and
if journals accepted articles based solely on the quality of the work,
with no regard to z-scores, you'd expect the z-score of studies to
resemble a bell curve. But that's not what Gerber and Malhotra found. Above a z-score of 1.96, the results fit the bell curve pretty well, but below
a z-score of 1.96 there are far fewer studies than you'd expect.
Apparently, studies that fail to show significant results have a hard
time getting published.
So far, this is unsurprising. Publication bias is a well-known and widely studied effect, and it would be surprising if G&M hadn't
found evidence of it. But take a closer look at the graph. In
particular, take a look at the two bars directly adjacent to the magic
number of 1.96. That's kind of funny, isn't it? They should be roughly
the same height, but they aren't even close. There are a lot of studies that just barely show significant results, and there are hardly any
that fall just barely short of significance. There's a pretty obvious
conclusion here, and it has nothing to do with publication bias: data
is being massaged on wide scale. A lot of researchers who almost find significant results are fiddling with the data to get themselves just over the line into significance. ... Message to political
science professors: you are being watched. And if you report results
just barely above the significance level, we want to see your work....