Retail performance models have long utilized trade area demographics to explain differences in store sales, and for good reason – the relationship between the demographic characteristics of a store trade area and its performance are both obvious and well understood. At its simplest level, such models take the classic form of a linear regression model where sales, or sales per household, is estimated using a set of predictor variables, such as:
From simple regression equations to spatial interaction models, decision trees and unsupervised AI models, all in theory require (or at least strongly prefer) that the predictor variables – like income and education– be statistically independent of each other. Most good models incorporate other factors (site characteristics, for example), but in our trivial model, we certainly know that higher income tends to accompany both more education and more vehicles, either for individuals or geographic aggregates of individuals (e.g. a trade area).
The results of this correlation between “independent” predictors are well known – in simple terms, the models will almost always perform better on paper than in reality. This, of course, is not music to the ears of the user about to drop a couple of million dollars on a new location, so the topic is normally addressed obliquely in the footnotes of a report. Reports tend to focus on the variables that are easily defined, discussed, and explained in simple terms.
There is a better way, and that is to use a set of variables which encapsulates the essence of the demographics that differentiate neighborhoods and has the property that each is independent of all others. The AGS Demographic Dimensions set consists of 26 carefully constructed variables which do exactly this – they capture the essence of what makes neighborhoods different while being statistically independent of each other. The scatter plot of Affluence by Family Status demonstrates the lack of correlation between the two variables.
Even with large samples, multicollinearity causes an inflated expectation of the model performance (through a larger r2) and in some severe cases can result in biased predictions (consistently over- or under-predicting).
In the practical world of site sales forecasting, with small sample sizes come large problems. The actual range of the predictor values can be quite narrow – the sites having been chosen based on the knowledge of a few individuals as to “what will work” and they stick to what they know. In those cases, when the model is “rolled out” into other markets, the predictor variables can often be well outside the previous ranges and the results can be disastrous. The relative size and even direction of predictor variables are highly sensitive when multicollinearity is present.
While we can’t overcome the lack of diversity in a sample of small sites (we will talk about sample diversity in a future article), we can at least minimize estimation errors and model sensitivity by using measures which have been explicitly defined for analytics. Models will be more stable and reliable, although this admittedly does come at the price of some clarity because the measures are no longer simple and readily understood. When models are used to guide capital investments in new sites, we favor better models over more easily explained models.