Not too long ago, I posted this graph showing a (not so) curious gap in t-statistics just below the 5% significance level in published work in political science, and a similarly surprising amount just above the 5% cutoff. This implies “scholars “tweak” regression specifications and samples to move barely insignificant results above conventional thresholds” using data mining and other techniques to manipulate the results. Edward Glaeser discusses this and other issues associated with empirical research and gives ten recommendations for "a more economic approach to empirical methods" and "paths for methodological progress." Here's the introduction to the paper:
Researcher Incentives and Empirical Methods, by Edward L. Glaeser, NBER Technical WP 329. October 2006 [open link]: I. Introduction A central theme of economics is respect for the power of incentives. Economists explain behavior by pointing towards the financial gains that follow from that behavior, and we predict that a behavior will increase when the returns to that behavior rise. Our discipline has long emphasized the remarkable things – from great inventions to petty crimes – that human beings can do when they face the right incentives.
But while economists assiduously apply incentive theory to the outside world, we use research methods that rely on the assumption that social scientists are saintly automatons. No economist would ever write down a model using the assumptions about individual selflessness that lie behind our statistical methods.
A small but strong literature that dates back to Sterling (1959) and Tullock (1959) and that is most strongly associated with Edward Leamer, counters this tendency and examines the impact of researcher initiative on statistical results. But while many of the key points of this literature (especially those associated with data mining and specification searches as in Leamer, 1974 and Lovell, 1983) are both understood and accepted, the only obvious impact of this literature is a healthy skepticism about marginally significant results. Certainly, there is no sense in which standard techniques have been adjusted to respond to researcher incentives.
Indeed, the modest number of econometric papers in this area is probably best understood as the result of the very modest interest of empirical researchers in using techniques that correct for data mining and other forms of researcher initiative. After all, the impact of such techniques is invariably to reduce the significance levels of results. The same incentives that induce researchers to data mine will induce them to avoid techniques that appropriately correct for that data mining.
The importance of these issues increases during periods of rapid change in empirical techniques. Even if economists are not using formal Bayesian methods for dealing with data mining in standard regressions, experience-based rules of thumb proliferate.
Everyone knows to be skeptical about a new growth regression purporting to present a new variable that predicts growth (see Sala-I-Martin, 1997). However, new methods, such as the use of lab experiments or genetic information or instrumental variables techniques, give researchers new opportunities to use their initiative to enhance their results. During these periods of rapid change, it becomes particularly important to think about how to adjust statistical inference for researcher initiative.
In this essay, I make ten points about researcher incentives and statistical work. The first and central point is that we should accept researcher initiative as being the norm, and not the exception. It is wildly unrealistic to treat activity like data mining as being rare malfeasance; it is much more reasonable to assume that researchers will optimize and try to find high correlations. This requires not just a blanket downward adjustment of statistical significance estimates but more targeted statistical techniques that appropriately adjust across data sets and methodologies for the ability of researchers to impact results. Point estimates as well as t-statistics need to be appropriately corrected.
The second point is that the optimal amount of data mining is not zero, and that even if we could produce classical statisticians, we probably would not want to. Just as the incentives facing businessmen produce social value added, the data mining of researchers produces knowledge. The key is to adjust our statistical techniques to realistically react to researcher initiative, not to try and ban this initiative altogether.
The third point is that research occurs in a market where competition and replication matters greatly. Replication has the ability to significantly reduce some of the more extreme forms of researcher initiative (e.g. misrepresenting coefficients in tables), but much less ability to adjust for other activity, like data mining. Moreover, the ability to have competition and replication to correct for researcher initiative differs from setting to setting. For example, data mining on a particular micro data set will be checked by researchers reproducing regressions on independent micro data sets. There is much less ability for replication to correct data mining in macro data sets, especially those that include from the start all of the available data points.
Fourth, changes in technology generally decrease the costs of running tests and increase the availability of potential explanatory variables. As a result, the ability of researchers to influence results must be increasing over time, and economists should respond for regular increases in skepticism. At the same time however, improvements in technology also reduce the cost of competitors checking findings, so the impact of technology on overall bias is unclear.
Fifth, increasing methodology complexity will generally give the researcher more degrees of freedom and therefore increase the scope for researcher activity. Methodological complexity also increases the costs to competitors who would like to reproduce results. This suggests that the skepticism that is often applied to new, more complex technologies may be appropriate.
My sixth point is the data collection and cleaning offers particularly easy opportunities for improving statistical significance. One approach to this problem is to separate the tasks of data collection and analysis more completely. However, this has the detrimental effect of reducing the incentives for data collection which may outweigh the benefits of specialization. At the least, we should be more skeptical of results produced by analysts who have created and cleaned their own data.
A seventh point is that experimental methods both restrict and enlarge the opportunities for researcher action and consequent researcher initiative bias. Experiments have the great virtue of forcing experimenters to specify hypotheses before running tests. However, they also give researchers tremendous influence over experimental design, and this influence increases the ability of researchers to impact results. .
An eighth point is that the recent emphasis on causal inferences seems to have led to the adoption of instrumental variables estimators which can particularly augment researcher flexibility and increase researcher initiative bias. Since the universe of potential instruments is enormous, the opportunity to select instruments creates great possibilities for data mining. This problem is compounded when there are weak instruments, since the distribution of weak instrument t-statistics can have very fat tails. The ability to influence significance by choosing the estimator with the best fit increases as the weight in the extremes of the distribution of estimators increases.
A ninth point is that researcher initiative complements other statistical errors in creating significance. This both means that spurious significance rises spectacularly when there are even modest overestimates in statistical significance that are combined with researcher initiative. This complements also creates particularly strong incentives to fail to use more stringent statistical techniques.
My tenth and final point is that model driven empirical work has an ambiguous impact on researcher initiative bias. One of the greatest values of specialization in theory and empirics is that empiricists end up being constrained to test theories proposed by others. This is obviously most valuable when theorists produce sharp predictions about empirical relationships. On the other hand, if empirical researchers become wedded to a particular theory, they will have an incentive to push their results to support that theory.
Glaeser goes on to look at these issues in more detail using a formal statistical framework.