Tests of Statistical Significance in Economics
This one's for the "regression-heads". Andrew Gelman comments on the McCloskey, Ziliak, Hoover, and Siegler debate over the use of tests of statistical significance in economics:
Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler, by Andrew Gelman: Scott Cunningham writes,
Today I was rereading Deirdre McCloskey and Ziliak's JEL paper on statistical significance, and then reading for the first time their detailed response to a critic who challenged their original paper. I was wondering what opinion you had about this debate. Is statistical significance and Fisher tests of significance as maligned and problematic as McCloskey and Ziliak claim? In your professional opinion, what is the proper use of seeking to scientifically prove that a result is valid and important?
The relevant papers are:
- McCloskey and Ziliak, "The Standard Error of Regressions," Journal of Economic Literature 1996.
- Ziliak and McCloskey, "Size Matters: The Standard Error of Regressions in the American
Economic Review," Journal of Socio-Economics 2004.- Hoover and Siegler, "Sound and Fury: McCloskey and Significance Testing in Economics," Journal of Economic Methodology, 2008.
- McCloskey and Ziliak, "Signifying Nothing: Reply to Hoover and Siegler."
My comments:
1. I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects--positive for some people, negative for others--but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, "while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance." Certainly, I'd rather see an estimate with an assessment of statistical significance than an estimate without such an assessment.
2. Hoover and Siegler's discussion of the logic of significance tests (section 2.1) is standard but, I believe, wrong. They talk all about Type 1 and Type 2 errors, which are irrelevant for the reasons described in point 1 above.
3. I agree with most of Hoover and Siegler's comments in their Section 2.4, in particular with the idea that the goal in statistical inference is often not to generalize from a sample to a specific population, but rather to learn about a hypothetical larger population, for example generalizing to other schools, other years, or whatever. Some of these concerns can best be handled using multilevel models, especially when considering different possible generalizations. This is most natural in time-series cross-sectional data (where you can generalize to new units, new time points, or both) but also arises in other settings. For example, in our analyses of electoral systems and redistricting plans, we were careful to set up the model so that our probability distribution generalized to other possible elections in existing congressional districts, not to hypothetical new districts drawn from a common population.
4. Hoover and Siegler's Section 2.5, while again standard, is I think mistaken in ignoring Bayesian approaches, which limits their "specification search" approach to the two extremes of least squares or setting coefficients to zero. They write, "Additional data are an unqualified good thing, which never mislead." I'm not sure if they're being sarcastic here or serious, but if they're being serious, I disagree. Data can indeed mislead on occasion.
Later Hoover and Siegler cite a theorem that states "as the sample size grows toward infinity and increasingly smaller test sizes are employed, the test battery will, with a probability approaching unity, select the correct specification from the set. . . . The theorem provides a deep justification for search methodologies . . that emphasize rigorous testing of the statistical properties of the error terms." I'm afraid I disagree again--not about the mathematics, but about the relevance, since, realistically, the correct specification is not in the set, and the specification that is closest to the ultimate population distribution should end up including everything. A sieve-like approach seems more reasonable to me, where more complex models are considered as the sample size increases. But then, as McCloskey and Ziliak point out, you'll have to resort to substantive considerations to decide whether various terms are important enough to include in the model. Statistical significance or other purely data-based approaches won't do the trick.
Although I disagree with Hoover and Siegler in their concerns about Type 1 error etc., I do agree with them that it doesn't pay to get too worked up about model selection and its distortion of results--at least in good analyses. I'm reminded of my own dictum that multiple comparisons adjustments can be important for bad analyses but are not so important when an appropriate model is fit. I agree with Hoover and Siegler that it's worth putting in some effort in constructing a good model, and not worrying if said model was not specified before the data were seen.
5. Unfortunately my copy of McCloskey and Ziliak's original article is not searchable, but if they really said, "all the usual econometric problems have been solved''--well, hey, that's putting me out of a job, almost! Seriously, there are lots of statistical (thus, I assume econometric) problems that are still open, most notably in how to construct complex models on large datasets, as well as more specific technical issues such as adjustments for sample surveys and observational studies, diagnostics for missing-data imputations, models for time-series cross-sectional data, etc etc etc.
6. I'm not familiar enough with the economics to comment much on the examples, but the study of smoking seems pretty wacky to me. First there is a discussion of "rational addiction." Huh?? Then Ziliak and McCloskey say "cigarette smoking may be addictive." Umm, maybe. I guess the jury is still out on that one . . . .
OK, regarding "rational addiction," I'm sure some economists will bite my head off for mocking the concept, so let me just say that presumably different people are addicted in different ways. Some people are definitely addicted in the real sense that they want to quit but they can't, perhaps others are addicted rationally (whatever that means). I could imagine fitting some sort of mixture model or varying-parameter model. I could imagine some sort of rational addiction model as a null hypothesis or straw man. I can't imagine it as a serious model of smoking behavior.
7. Hoover and Siegler must be correct that economists overwhelmingly understand that statistical and practical significance are not the same thing. But Ziliak and McCloskey are undoubtedly also correct that most economists (and others) confuse these all the time. They have the following quote from a paper by Angrist: "The alternative tests are not significantly different in five out of nine comparisons (p<0.02), but the joint test of coefficient equality for the alternative estimates of theta.t leads to rejection of the null hypothesis of equality." This indeed does not look like good statistics.
Similar issues arise in the specific examples. For instance, Ziliak and McCloskey describe where Becker, Grossman, and Murphy summarize their results in terms of t-ratios of 5.06, 5.54, etc, which indeed miss the point a bit. But Hoover and Siegler point out that Becker et al. also present coefficient estimates and interpret them on relevant scales. So they make some mistakes but present some things reasonably.
8. People definitely don't understand that the difference between significant and not significant is not itself statistically significant.
9. Finally, what does this say about the practice of statistics (or econometrics)? Does it matter at all, or should we just be amused by the gradually escalating verbal fireworks of the McCloskey/Ziliak/Hoover/Siegler exchange? In answer to Scott's original questions, I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler's attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference and also from expanding models to include treatment interactions.
I'll disagree mildly with point one. I don't view the "paradigmatic example in economics" to be program evaluation. We do some of that, but much of what econometricians do is
test the validity of alternative theories and in those contexts the hypothesis
of a zero coefficient can make sense. For
example, New Classical models imply that expected changes in the money
supply should not impact real variables. Thus, a test of a zero
coefficient on expected money in an equation with a real activity as
the dependent variable is a test of the validity of the New Classical
model's prediction. These tests requires sharp
distinctions between models, i.e. to find variables that can impact
other variables in one theory but not another, and that's something we
try hard to find, but when such sharp distinctions exist I believe
classical hypothesis tests have something useful to contribute.
Posted by Mark Thoma on Friday, October 5, 2007 at 12:51 PM in Economics, Methodology |
Permalink
TrackBack (0)
Comments (23)