Powering Smarter Market Decisions™     Toll Free 877.944.4AGS

Household Data: A Heretical View

Posted by on Apr 25, 2016 in Blog | No Comments

At the heart of geodemographic analysis lies the once revolutionary but now mundane site report – a simple aggregation of geographic data within a predefined geometric or arbitrary shape – a radius, a drive time, or perhaps a ZIP code polygon.  Simple, but it serves in one way or another as the basis for most site location research.

The earliest site reports were typically constructed from census tract level data by evaluating the location of the tract centroids to produce rather imprecise estimates, but which nevertheless were an enormous leap forward in our ability to objectively evaluate retail site locations.   While the basic methodology of the site report remains largely unchanged, the precision of the underlying geographic data has improved by at least two orders of magnitude as the base unit has progressed from the census tract to the block group to the block.

So, it seems natural on first glance that if block based site reports are better than block group based site reports, then surely if we could get to the real atomic level – the household – that it would be, well, better.   The error band at the edge of our study area would be reduced to the scale of a house.   Is the house centroid, rather than block centroid, within the study area?   Would it be perfect?  Of course not, but it would clearly be better.  Quite obviously.

Or would it?

At the risk of yet again being labelled as a heretic, it is worthwhile to at least challenge what seems like a foregone, obvious conclusion.   After all, the costs of going from geographic to household data are substantial and if we are to go down this road, there had better be some very, very clear advantages.

Household level data has now been readily available for several decades, and has been steadily improving in quality over time.   Geographic coverage, once at best spotty, can now be declared to be national in scope.   Household by household, we know or can guess names, sex, ages, ethnic origin, incomes, education levels, occupation, purchasing preferences, political affiliations, and so on.   At least that’s what the brochures say.

But as is often the case, one should examine beyond the marketing sound bites.   Break out the manual and actually read it.   Including the footnotes.   Footnotes are always where the good stuff is, or where the bodies are buried, depending on your outlook.

Until a few years ago, the household and individual lists of the messy direct marketing world rarely collided with our neat and tidy geographic realm.   If there was interaction, it was almost always initiated from the geography side – given a geographic selection area, give me the names and addresses of all the households in an area which have a specified set of characteristics.   This was usually the direct marketing extension to the geographic analysis.   Only rarely was the interaction in the other direction.

Two major shifts have begun to change all that.   First, the success of these databases in improving target marketing responses became so well known that everybody and his dog had a campaign going, so the deluged respondents began to approach their satiation point and subsequently, despite ever improving target delineation, response rates have generally been in decline.   Too much of a good thing.  Second, as with so many aspects of modern life, the Internet changed everything about how we shop and how marketers can connect us with products that we want.  Consumers now more often than not actually initiate the contact – visiting the web site of a crib manufacturer is a pretty good indication that they might be planning to buy one.  As a result, times became tough for the list business.   Having household data, why not venture into the geographic realm?

Armed with a list of households, complete with addresses and often including demographic information, why not geocode these lists and aggregate them to those same geographic areas.   Since the data is coming from a more focused source, then it ought to be an improvement over those block and block group based systems.   While there are many contributing factors as to why this is not really true (such as source material conflicts, temporal issues, etc.), there really is but one main reason why this is absolutely not true.

Enumerated Versus Positively Indicated Attributes

The household list emerges from a compilation of data obtained by matching key identifiers (name, address) from multiple sources, each of which may have been collected for different purposes, and most of it, voluntarily provided.   Warranty cards, for example, have long been a fabulous source of household information – always including the name and address of the purchaser and often, some key demographics such as size of family, race, age, and income.   And the subject of the card, the product itself, can often tell much about the individual.   A purchase of a baby monitor probably means that, depending on the age of the respondent, the adults in the household are either new parents or new grandparents.   For target marketing, the ability to set the “new baby” flag means that anyone marketing products in this space is immediately more likely to get a response from a household which positively indicates the presence of a baby than one which was simply randomly targeted. That the household next door to them may as well harbor an infant who is not known to the system is unfortunate in that we can’t add them to that target list, but not problematic. If our goal is to maximize response rates, then a selection of those records containing the positive indicated attribute will be highly effective.   However, if our goal is to target the negative – say households who do not have children –  we will be much less successful.

Take a simple classification of the highest level of educational attainment as an example – where every adult is assigned to a particular group (e.g. some college, graduate degree).   If we were able to classify every individual correctly, we would have an enumerated attribute.   There might even be an “unknown” bucket in the classification, but every individual is assigned uniquely to a single class and the classification is of value only if the unknown bucket is relatively small.

However, in our list world, we obtain data over time from multiple streams which might tell us that a specific individual has a bachelor’s degree.   Call this is a positively indicated attribute.  The attribute is not a classification, in that the positive attribute of having a bachelor’s degree does not mean that they don’t also have a professional or graduate degree.   The result is that lists are maintained for each potential class separately and the values are set not as “yes or no” but “yes or unknown”.    We might not have this information for the vast majority of records, but no matter if our goal is to target individuals with bachelor’s degrees.

Where this becomes unpleasant is when households are summarized over a geographic area and compared to enumerated sources.  That the two data systems are more or less incompatible quickly manifests itself in the radical differences between variables, especially those where the “unknown” is substantial – which is the majority of variables.

One would think that something as fundamental as the age distribution of an area’s population would be easily identified by summarizing the household level data.   It isn’t.   The distributions rarely look much like the published enumerated distributions.    Children, who typically don’t engage directly in the information gathering process, are grossly undercounted.   The ages of adults can be obtained from a wide range of sources, or even inferred from the given name under some circumstances, but the result is a hodge-podge of data which remains a mixture of unknown, inferred, and stated values.

The result of all this is that the very first thing that the list aggregator must do is standardize each variable or enumeration to known values for a recognizable geographic area.    Let’s assume that our target geography is the census block group, which is the lowest level for which most of the interesting attributes of the population are published.   This standardization takes place at two levels.   The first, and most prevalent, is that the unknown attributes of the household are filled in by assigning the average value for the geographic area using the known attributes of the household.

For example, if we wish to estimate the likely household income, we could use the median income for households of a certain age or race as an estimate.   The more known attributes we have, the closer our initial estimate will likely be to reality.   There is a perception that the primary list compilers, who are generally also credit reporting agencies, have superior data because they have access to the credit data.   Yes, they have it.  But no, they can’t use it, as the fines which would be imposed upon them would bankrupt both their list and credit scoring operations. So they resort to other means – using things like automobile ownership to predict income levels, then comparing the results to aggregate income numbers from their credit files (which themselves contain many unknowns) or to the enumerated data at a higher level of geography.

If you were to take the four hundred or so essential demographic variables from the AGS core estimates and projections product, you would discover that for the majority of households, a majority of the variables would need to be inferred from geographic enumerated sources.

Worse, the second level of inference arises because of the need to align the list derived estimates with other sources – meaning that individual variables and enumeration sets are standardized to fit the higher order geographic data.   This is very much akin to creating a regression equation in which our dependent variable is also found, usually disguised via some transformation, on both sides of the equation.   Delighted by the apparent good fit of the model and unwilling to peruse the footnotes which accompany the documentation, we are then shocked and dismayed to discover that the model doesn’t really work that well.

As an aside here, you might ask a reasonable question.   Since the Census detail is largely replaced by the American Community Survey (ACS), don’t we have the same problem?   No, we don’t.   The ACS is based on a structured geographically stratified sample which affords the ability to make statistical inferences about the underlying population.  In the case of positively indicated attributes on household files, we have no means of inference because we have no means of distinguishing between “no” and “unknown”.   If 8% of our households in a zip code are flagged as Hispanic based on last name and other sources, can we estimate the total percent of households which are Hispanic?   No, actually we can’t, because we have no way of discerning which of the remaining 92% are Hispanic but were neither asked nor told us so.   The only way is to utilize the enumerated sources to fill in the gaps, in which case we have gained very, very little additional information because we have purged most of the variance introduced by the detailed level in our cleanup effort.

So let us now return to our site report, a simple five-mile radius around a location.   If we use the compiled list based data, we have undertaken a great deal of very expensive work to obtain the data, and we have attempted to correct its deficiencies by infilling and standardizing to higher levels of geography.    It should be readily apparent that if we are standardizing our household data to a block group, and then reporting data for groups of block groups, that we have actually not significantly added to the information content of the result.  If, and only if, our site reports are for areas smaller than a block group would there be any potential for increasing the information content by using the household data.   The standardization function ensures that the larger the trade area size, the less impact will be felt by the “improved” data.

This is shocking on its surface, but actually logical and simple if one works through the details.

Imagine if you will that the pixel level display on your monitor is a block group map.   I come along and tell you that I have a better way of creating that pixel.   For each pixel, I have a greatly detailed 100 by 100 sub-pixel level which I have painstakingly and at great cost which I shall gleefully pass on to you, obtained.   While I don’t know all of these sub-pixel values, I have gone to great lengths to estimate what the missing sub-pixels are by considering the color of the pixel itself.   Having filled in my 100 by 100 matrix of sub-pixels, I find that I need to make adjustments because the blended pixel color doesn’t match the actual pixel color.   The result is that my “enhanced” pixel display looks exactly like the regular one, but it is quite obviously better because it was based on the more detailed but now standardized sub-pixels.   Silly example perhaps, but this is really what is going on here.   The enhancements, to the extent that they are truly meaningful, are lost because the resolution of the output is less than the resolution of the data itself and the enhancements were but a sample – and then not a stratified sample on which statistical inferences can actually be developed.   By the time we consider the overall character of a block of pixels (our five-mile ring), the value of any individual sub-pixel is essentially irrelevant unless the values of those sub-pixels are radically different from our original pixel view (in which case we would not have standardized them in the first place).  In our silly example, all would be well if we were colorizing an old black and white film and thus adding information.   But our film is already in color, so why spend the money to colorize it?  There is almost no perceived increase in our information content.

Consider this equivalent to calculating the value of PI to fifty significant digits and then having no way of displaying the detail because our output device shows only whole numbers.   In that scenario, 3.1415 remains 3 and no amount of effort at getting the last digit correct will change the reported result.  And no amount of prattling on about how much better it is anyway will make it so.

At the end of the day, the problem here is that we must match the object of our analysis to the data by which we analyze it.   If our object is to target individuals, household data is a marvelous tool.   If it is to analyze geographic areas, it adds almost nothing to the analysis and it does so at great cost.   So the next time you are told that “we have household data and it is better and worth paying for”, just ask yourself if this is really true before you sign the contract.

But we have not addressed here the elephant who isn’t just in the room, but is resting comfortably on your La-Z-Boy chair, remote in hand, munching on your popcorn, drinking your last cold one, and watching NatGeoWild shark week on your sub-pixel enhanced television display.  His t-shirt simply says that awful four letter word “Error” and below it in a much smaller and fanciful font “Just ignore it, and it will go away”.   But we shall endeavor to deal with said shabbily attired elephant at another time, since no matter what data you use, he is not about to go away.

Gary Menger
Applied Geographic Solutions, Inc.
April 2016