Jennifer Castle and David Hendry on data mining
‘Data mining’ with more variables than observations: While ‘fool’s gold’ (iron pyrites) can be found by mining, most mining is a productive activity. Similarly, when properly conducted, so-called ‘data mining’ is no exception –despite many claims to the contrary. Early criticisms, such as the review of Tinbergen (1940) by Friedman (1940) for selecting his equations “because they yield high coefficients of correlation”, and by Lovell (1983) and Denton (1985) of data mining based on choosing ‘best fitting’ regressions, were clearly correct. It is also possible to undertake what Gilbert (1986) called ‘strong data mining’, whereby an investigator tries hundreds of empirical estimations, and reports the one she or he ‘prefers’ – even when such results are contradicted by others that were found. As Leamer (1983) expressed the matter: “The econometric art as it is practiced at the computer terminal involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for reporting purposes”. That an activity can be done badly does not entail that all approaches are bad, as stressed by Hoover and Perez (1999), Campos and Ericsson (1999), and Spanos (2000) – driving with your eyes closed is a bad idea, but most car journeys are safe.
Why is ‘data mining’ needed?
Econometric models need to handle many complexities if they are to have any hope of approximating the real world. There are many potentially relevant variables, dynamics, outliers, shifts, and non-linearities that characterise the data generating process. All of these must be modelled jointly to build a coherent empirical economic model, necessitating some form of data mining – see the approach described in Castle et al. (2011) and extensively analysed in Hendry and Doornik (2014).
Any omitted substantive feature will result in erroneous conclusions, as other aspects of the model attempt to proxy the missing information. At first sight, allowing for all these aspects jointly seems intractable, especially with more candidate variables (denoted N) than observations (T denotes the sample size). But help is at hand with the power of a computer. ...[gives technical details]...
Appropriately conducted, data mining can be a productive activity even with more candidate variables than observations. Omitting substantively relevant effects leads to mis-specified models, distorting inference, which large initial specifications should mitigate. Automatic model selection algorithms like Autometrics offer a viable approach to tackling more candidate variables than observations, controlling spurious significance.