Powering Smarter Market Decisions™     Toll Free 877.944.4AGS

The Illusion of Precision

Posted by on Dec 16, 2015 in Blog | No Comments
The Illusion of Precision

Sometimes the increasingly prevalent marketing hype and oversell in this business simply annoys me to the point of arousing the normally dormant academic in me.

When the first computerized and geo-referenced censuses were released in the United States, Canada, and Great Britain, a small but interesting industry known as Geodemographics emerged from a limited number of geography departments where quantitative geography and the newly emerging computer cartography fields were rapidly converging.

In those early days, I abandoned my academic career to join Compusearch, a relatively new company in Toronto which eventually became part of MapInfo. With the massive computing power available, somewhat underpowered compared to a modern watch, the site report was born. It typically consisted of a few pages of horribly formatted census and estimated demographics for varying radii around a location, which was typically manually geocoded using, yes, paper topographic sheets.

We used block groups (enumeration areas in Canada) as the core data building block for these reports. Having only an area centroid, a particular ring report would include those block groups where the centroid was in the ring. In or out, all or nothing. While errors could be significant, the service nevertheless offered analysts valuable data not previously available to them.

Computing technology and digital block group boundaries enabled area proportioning of block groups, improving on the in/out style of retrieval, but with known issues resulting from the uneven distribution of population within block groups. Area apportionment was quickly replaced by block centroid methods once the necessary data became available. The 1990’s saw methods develop which faded away, including block area apportionment (same problem as with block group apportionment, just excessive computation times for little actual numerical change in the results). The other, thankfully long gone, method known as “circular fields” was a stroke of marketing genius which promised much but delivered nothing.

Population weighted block centroids remain the most common method of apportioning block group data, with most vendors, AGS included, updating the centroid weights between census releases.
The latest challenger to the method utilizes ZIP+4 centroids and claims to deliver a better result because of the much greater density of points compared to block centroids. A recent blog post by Bill Dakai on the Trade Area Systems website, appears at first blush to be a reasonably decent review of the strengths and weaknesses of the various methods of retrieval. On closer examination http://blog.tradeareasystems.com/blog/overview-of-demographic-retrieval-schemes is really little more than a series of contrived hypothetical examples intended to lead the reader towards their preferred conclusion.

While we at AGS are primarily producers of demographic and related data rather than builders of delivery systems and the blog post is actually on the website of an authorized AGS reseller, the topic merits a proper, thorough analysis rather than a series of contrived diagrams if the end user is to understand the ramifications of the methodology difference.   We will confine our remarks to what Dakai refers to as the “postal/building based” and “block centroid” methods, as the rest are either no longer used or were never particularly popular choices to begin with.

A Side Note on Accuracy and Precision

Commonly used as if they were synonymous, these two terms actually mean something quite different.  Accuracy refers to the closeness of a measure value to a standard, known, and true value.   Precision, on the other hand, refers to the closeness of two measurements to each other.   You can be very precise but inaccurate – imagine that you have a scale which reports your weight faithfully and consistently at one-hundred and seventy-five pounds, but sadly your actual weight is two hundred pounds.  Precise, but precisely wrong.

The best way to understand this is to imagine that you are to take one hundred shots at a target at the rifle range.  Your precision will be measured by the scatter of your shots – the less scatter, the more precise your shooting.   Taken together as a group, the accuracy of your shots is the extent to which the center of your scatter is at the center of the target.   You can be extremely precise (consistent) and always be five inches below the target.   Precise yes, but accurate no.

When we speak of accuracy of a site report, we are doing the reader a great disservice, as the assessment of accuracy requires that we are comparing to a known value.   If there is any truth in geodemographics, is that absolutely nothing is a known value.   Everything is an estimate, including data produced by the census.   What we should instead talk about in this case is the precision of the estimates.   We cannot actually speak of accuracy, because the true value is unknown, and unknowable.

Unsubstantiated Claims

One of the lead statements of the article is that “the choice of retrieval scheme can often lead to larger differences in the demographic profile than if you switched demographic providers altogether!”   This concluding assertion is not actually empirically tested, as the examples are theoretical only and make assumptions which are not realistic.  After almost a year since the post, his promise of some real world examples has not yet materialized, although examples of problematic locations can be readily found.

Boundary Effects

Apart from the example for using block group centroids, a practice long abandoned, the diagrams all utilize a radius so small as to include only four block groups, none of them completely.   This maximizes the boundary effect on the total trade area computation as in a typical urban area, the radius shown could easily be as small as a few hundred feet.

The boundary effects occur at the edge of a radius where block groups are split and are the largest for small radii but diminish with increasing size.    This is perfectly sensible, as the ratio between the area of a circle and its perimeter length follows a rather predictable formula (2πr).

The example given, shown below, maximizes the boundary effect and most certainly does not match real-world applications.  The diagram on the left shows a theoretical distribution of block centroids within the four block study area.  At right is the same base area showing the same area with theoretical ZIP+4 centroids.

The block centroid retrieval example shows block groups with 4-6 block centroids only, when the average block group contains around 28 populated blocks.  This is deliberately misleading, as the figure on the right includes an appropriate number of ZIP+4 points for this area.   The two diagrams clearly show that this would be better, if it were true.  The result is that the reader who is not fully aware of the nature of block centroids incorrectly concludes that the method is very granular when in fact it is not.

Nationwide, there are approximately 6 ¼ million populated blocks, or roughly fifty persons per block.   On average, a ZIP+4 contains about ten persons.   Theoretically, the ZIP+4 should provide more precision in our estimates, but this is true only if there is a net increase in information at this level.   We will come back to this point later, as it is critical and can only be appropriately determined analytically.

Simplified for Ease of Computation

The example which shows the rough method for accumulating portions of block groups is correct, but overly simplified.   The mathematics – resulting in a very small population total in relation to the ZIP+4 result – is nonsense, since it assumes that all blocks within the same block group have the same population.   This alone ensures that the results shown for block centroids are radically different than the postal results, as in reality, block sizes vary widely within most block groups.   By assuming that all blocks within a block group have equal size, and all postal points within a block group have equal size, the conclusion would clearly be that the postal system is more precise in an amount equal to the difference in the number of points.   Again, this must be determined numerically as the error margins at each level are fundamentally different.

The Illusion of Precision

In my view, this article presents the various methods, especially the block centroid method, in a manner which is sure to result in them being viewed as inferior to ZIP+4 centroid methods.   This, however, is simply not the case as can be readily demonstrated by using real world examples for three reasons:

  • boundary effects for any reasonable minimum radius are much lower than this example
  • we do not distort the number of block centroids per block group, and
  • we do not make simplifications.

We instead use real data, with real radius levels, and without bias.

An Analytical Approach

As useful as the diagrams are in explaining simply the types of retrieval, they are flawed and misleading, deliberately in our view in order to suggest that the ZIP+4 centroid method used by the author’s company are somehow superior to the block centroid methods used by most software and/or data vendors.

Examples which show failures of either method can be readily found, and as such, one could be easily accused of cherry picking the examples to prove a point.   Instead, we generated 1, 3, and 5-mile radius reports showing dwellings, population, and households for each and every block group centroid nationwide using both the block centroid and ZIP+4 centroid retrieval methods.

The base demographics and block centroid proportions are from the AGS 2015B release and the ZIP+4 proportions were created and normalized within block boundaries in order to properly allocate 100% of the block group.   Most vendors using ZIP+4 methods normalize the allocation to the block group only, and our method ensures that large delivery ZIP+4 locations are as precisely located as possible.   The ZIP+4 centroid weights are then aggregated for ZIP+4’s with the same geographic coordinates in order to minimize processing.

The Results

For each of 1, 3, and 5-mile radii for each method, we extracted the base counts of dwellings, population, and households.  The first table below provides summary statistics of each group.table 1

When reviewing this table, a couple of items immediately emerge.   First, the differences between the two methods are truly significant only for the one-mile radius, primarily from a percentage viewpoint.   An average absolute deviation of 10% at the one-mile radius level is indeed significant, although these results are highly skewed by a small number of sites where the methods produce substantially different results.  By the five-mile radius, interestingly enough, the average actual differences in terms of raw counts is not substantially reduced, but the percentage differences are reduced to well under one percent.

This is not surprising, given that with the larger radius, the ratio between complete and partial block groups is greatly diminished.   The boundary effects are minimized in an exponential manner, as the table below shows.

Table 2

The ratio between area and circumference is probably the best overall measure of our boundary effect problem.   Assuming that block groups are fairly regularly shaped (in urban areas, this is generally true), then the larger the A/C ratio, the larger the ratio between complete and partial block groups.

There is a slight bias towards higher counts in the 1-mile ring using the ZIP+4 method as the 50% percentile point occurs at plus 1 dwelling and households, and plus 3 persons.   In other words, we cannot predict whether we will have a greater or lesser number based on the method employed.   In very close to 50% of the cases, the number will be lower using block centroids, and in 50% of the cases, higher.

Table 3

The third table shows the absolute percentage difference for each of the three measures at each decile point.   The reading of the table is as follows.   If we have, for example, a five-mile ring, we can expect that 80% of the time, our results will not differ more than 0.98%.    Given the average five-mile ring of approximately 232,000 people, 80% of the time the results will not differ by more than 2274 people.  In other words, the estimated population is somewhere between 230,000 and 235,000 people.

Discussion

At first glance, one would assume that since the ZIP+4 method uses roughly five times the number of centroids, that although the differences are not directionally biased, they are to be preferred.   This would be true, all other things being equal.   As usual, they aren’t.   There are a number of reasons why the ZIP+4 method is, in our view, no better than the block centroid method.

Granularity and Positional Accuracy

There are approximately five postal centroids for every block centroid nationwide.   For small rings especially, it would initially seem that this would be the preferred method.   This might be the case if these points were randomly distributed within the block areas, but they are not.   In most urban areas, the block unit is simply that, a single city block.  Its centroid is truly central.   The postal geography will in most cases have at a minimum four points, one on each block face, set back (usually 25’) from the street centerline.

It is easy to see how a circle bisecting this block could get a range of values for the block simply by moving the center of the circle a few feet in either direction.   With college campuses especially, the particular location of the centroids (of any type) becomes arbitrary and subject to major error.

In practical purposes, most study areas are at least three miles, and the proportion of block groups which are split is relatively small.   For larger trade areas, the arbitrary locations tend to balance out.

Apportionment Issues

With the census block method, the location of the block centroid and its proportion relative to its parent block group are set only once per decade.  These are, however, about as accurate as can be achieved and serve as an excellent starting point.   Over time, however, the block proportions will tend to become inaccurate if the current year estimates are based solely at the block group level.   Thus most, if not all, demographics suppliers undertake their main estimates at a sub block group level by obtaining and tracking postal delivery statistics at the block level.   Some, including AGS, will split blocks during the decade should the circumstances warrant it.  Furthermore, AGS annually adjusts block centroids if either street segment changes (total distance of streets within the block) or ZIP+4 delivery changes are significant, except in small square blocks where centroids are always assigned to the geographic center of the block.  This maintains the general integrity of the system over the decade.

It should be noted that blocks are defined for the specific purpose of enumerating the population.   Postal centroids are defined for the purposes of making sure Ralph gets his mail.   As such, they may lack spatial consistency, and in the case of post boxes, may have nothing to do with where Ralph actually lives.

A review of the USPS delivery counts for any particular area reveals several weaknesses which limit their utility:

  • They rarely sum to a plausible value for the census area, even if one uses the postal file closest to the census date as a base. There are many reasons for this, including that they count mailboxes, not actual households; that they are sticky downward (e.g. if we demolish houses on a block, the delivery count is slow to diminish, if it ever does); and that often a single delivery unit actually represents multiple households
  • For many areas, the location of the delivery centroid may not even be in the block where the population resides. This is true of many mobile home parks and condominium complexes, where the address of any particular unit is something like “88 Main St, Unit 18”.   All of the mailboxes are coded to the location on Main St., even though they may be several hundred feet distant.   One of the examples below demonstrates this problem.

What must be ensured, if postal centroids are to be used, is that they are weighted according to a population value and not just afforded equal weights.   This could be a modified delivery count which sums to the block group total, or better yet, a block total.   Equal weighting is sure to produce poor results, again, for small trade areas.

Problem Cases

Some relatively easy to spot cases emerge pretty quickly by panning around a map.  I chose Oklahoma City for some real world examples because it is fairly regular and does not have terrain which results in strangely shaped blocks.

In the first example, we have two blocks, one of which has 354 people (2010 census) and the other 340.  This is a university campus, which means that the residents live in dormitories which may be spread across campus.   The block centroids are not shown, but as can be seen, there is one postal address for the campus which is nowhere near the residences (which follow the street pattern).  A range of circles could be drawn that intersect the campus, and in all cases, the entire campus population would be either in or out.   This can lead to substantial percentage error on a small radius report.   Note that the block centroid approach does not eliminate the problem, as the centroid for the physically larger block is to the northeast of the residences.

Problem 1

A second example, actually a few hundred yards to the west of the first, shows a strange block which would normally be split into two.   Here, the postal geography will likely yield better results if the block is bisected in almost any pattern.   The strip mall (in purple) on the corner is lumped together with the adjacent residential properties.Problem 2

Third, we have a frequently occurring problem with the postal geography, that of a residential complex which has a single street address.  In this case, it is a mobile home park that covers a significant chunk of land, but the only postal point is at the entrance.   The census block actually carves out this residential area separate from the one which encloses it.  Towards the southwest, it is noted that there are no ZIP+4 centroids within several of the blocks, some of which have a significant population count.   This population is unreachable in any partial apportionment of the block group.Problem 3

The conclusion is that there are problems with both approaches.   In particular, almost all of the worst discrepancies between the methods occurred in one of the following cases:

  • A college campus with a single postal address, where the postal coordinate was nowhere near the actual dormitories.
  • Military bases where troops are barracked.
  • Mobile home and condominium complexes where all residential deliveries occur at a single street address, usually using terminology such as “Unit 29”
  • Mixed use blocks, where part of the block is used as residential and part is non-residential.
  • Blocks with particularly odd shapes, where the distribution of housing units is not uniform
  • Rural areas, where the density of centroids, postal or census, is sparse. Especially at a one-mile ring, the settlement pattern itself is too granular to work well with any method

Are We Just Splitting Hairs?

Ultimately, the entire issue is moot if the underlying data is not accurate enough to support the precision that we are claiming.   Prior to modern mathematics teaching, the concept of significant digits was drilled into us until we understood that 12.13 x 5.3 does not equal 64.289 no matter what the calculator says.   It equals 64.3, since the original measurement of the least precise data element is one significant digit after the decimal place.   And this presumes that we have measured each element precisely.   If, however, our instruments can only measure to two digits, our result should be 64.   With demographic data for small areas, our tendency is that because computers will happily spit out incomes to the nearest penny that this must be correct.   It isn’t.   There is a substantial amount of measurement error on any demographic variable, and this should be accounted for, at least in the mind of the analyst reviewing the numbers.

Not to be the one to throw a spanner into the machinery, but the reality is that our source data is by no means so accurate that we can pretend that a difference of 1% on two population reports is actually real and that we can tell that one method or demographic source is “better” than another one.

If the reality is that the actual estimation error exceeds our 1% figure, we can use as much computing power as we can, utilize the best statistical algorithms available, and draw the sharpest high resolution graphics and maps to prove how accurate we are, and it is all for naught.  Because 12.13 times 5.3 equals 64.3 not 64.289.   We actually do a great disservice to the consumers of demographics by claiming, even implicitly, greater precision and accuracy than we know exists.

There are a whole host of errors which contribute to the degree of precision of the report, only one of which is error introduced via the specific means of data apportionment.   Indeed, this is probably the least of our worries.

Just how accurate is the actual data we are trying to report?   The census bureau recently published, quietly, a report on the accuracy of the 2010 census.   Overall, the results were quite good, in that nationwide they believe that the census overcounted the population by 0.51% (see https://www.census.gov/newsroom/releases/archives/2010_census/cb12-95.html).   This is a net figure, of course, so local results will be much more variable.   They give the following statistics which are helpful in putting the value into context:

  • The estimate is that 3.3 percent of the population were counted “erroneously”, of which 85% were duplicates, the rest include timing issues (birth or death near the census date).
  • There is an estimated 16 million “omissions” which include people missed in the census whose records could not be verified in the test sample, and of these, 10 million people were simply not enumerated
  • Enumeration rates vary by race, age, and tenure, with renters and minority ethnic groups tending to be undercounted. For example, renters are “undercounted” by an estimated 1.6%, and yet renter records were much more likely to be duplicates than those of owners.   Both the black and Hispanic populations are undercounted, while the white non-Hispanic population was over counted
  • Men aged 18-49 were undercounted, women aged 30-49 over counted by an unspecified amount.

In other words, many of the errors (undercounting and duplicate counting) tend to overall even out, but the fact that there are clear patterns to the detailed results indicate that we should expect the estimates for small areas to vary by much more than 0.51% from the actual population.   Given the numbers presented in the press release, we would not be surprised to find the average deviation from the true population to be in the range of at least 1.5%.

Indeed, comparisons of the basic national numbers coming out of the Census, American Community Survey (ACS), and Current Population Survey (CPS) show significant, and perhaps, surprising differences.   The 2010 estimate of occupied housing units reported by the Census was 116,716,292, but for the CPS was 117,538,000, a difference of nearly one percent.   The ACS figure for vacant housing units for 2010 was 17,223,646 whereas the Census reported it as 14,988,438, a difference of nearly 15 percent!

The American Community Survey (ACS) has replaced the sample long form from previous censuses, and it is based on a small sample of households (about 2.5%) annually.   For small areas of geography, the data are actually not for a single year, but rather for a five-year period.   Because these are sampled, estimates of the sampling error can be computed and are presented below.   The standard error is generally presented as a plus/minus, such as population = 1000 +/- 128, which effectively means that we are 68% confident that the actual value falls between 872 and 1128.

Bear in mind that these are five-year estimates and that over that period, only 12.5% of the population is expected to actually be surveyed.   The standard errors are therefore, in my view, overly pessimistic.   The reason is that the block group level numbers can be readily normalized to higher order geographies where the error is lower, and this has the effect of limiting the actual error profile more than the statistical tests would suggest.

That said, the error estimates at even the county level is quite significant.   For example, for households, the 2014 5-Year ACS estimates have an average standard error of 3.435%, and for counties with under 5000 households, the rate was over 6%.

Bear in mind that these are based only on the raw counts of dwellings, households, and population.   The error bands for such potentially useful items as age, race, and income have significantly higher error bands at the block group level.

The bottom line is that any error introduced via the method of geographic selection is but one of many sources of error, and most probably not the largest one.   The fact that we found no particular directional bias (e.g. postal is always higher) and that the differences are effectively normally distributed suggests that we cannot favor either method.   The claims that one is better than another are unsupportable and little more than marketing over-sell.

Concluding Remarks

The real issue is that over time, with faster computing, detailed maps, and shameless marketing hype, users have come to believe that the data are extremely accurate when in fact they are not.   A report for a five-mile ring is a statistical approximation only.   I would prefer if we could indicate an error band on at least the main elements of reports, but this is not simple because we are not dealing with sample statistics in the traditional sense.   Perhaps we could at least be more honest and, rather than report the population of a five-mile radius as 228,785, we should at least round it to 230,000 and the median income we should report as $55,000 not $56,498.   This would surely alert users to the true nature of the data.

We do not believe that postal centroids are sufficiently more precise to warrant the additional computational resources and apportionment estimation, as differences in the results are unbiased and major errors occur in both environments.

What we instead would recommend is the following:

  • Do A Flyover: For small radii especially, it is imperative that the user be aware of the presence of certain types of environments which cause major issues – group quarters populations and mobile home/condominium complexes.   With today’s mapping technologies, there is no excuse for not pulling up a Google hybrid street and satellite map and spotting potential problems such as colleges and large condo type developments.
  • Practice Sensitivity: For small radii values (say < 3 miles), we recommend a simple sensitivity exercise. If a two-mile radius is used, we suggest also looking at 1.8 and 2.2 mile radii, primarily to see if there are major boundary effects.
  • Know Your Boundaries: Remember that the results are approximate, and that no matter how accurate they may seem, the cumulative error from all sources can be significant. Regardless of the method of allocation, the smaller the radius the less precise the result.   The boundary effect on small radii is substantial regardless of methodology and cannot be ignored by the analyst.

I will leave you with a story – a true story – which occurred in the early 1990’s while I was with the long departed Urban Decision Systems.   Our business estimates were based on the County Business Patterns program of the Census Bureau.   The detailed data were available to the county level only, and we had counts of establishments by SIC at the ZIP code level.   From that, we had the audacity to create detailed business data at the block group level, and apportion that data within radius reports.   We had a client who received a report on gasoline stations and had a particular interest in how many competitors were in the trade area.   He was unduly upset that we had said that there were ten such competitors, and yet when he personally enumerated the area that there were eleven.   The sales staff had no reasonable answer, nor did my production staff.   He finally landed on my phone line, and when he related the details to me, I said something similar to, but using stronger language than, “Darn, that’s pretty good”.   This did nothing to assuage his growing anger.   Even though I related the methodology to him, and repeatedly said that they were estimates only, he stated that we were wrong.   My answer is one which I have since used often.   I asked him how much time it had taken to enumerate the trade area, and to compute but not tell me how much it cost his employer to do so, I asked him how much we had charged him for the report, and finally, I asked him if the difference between 10 and 11 in the trade area would change his business decision.   He said no, and I told him that we had therefore given him great value.   He became one of our best regular customers thereafter.

The moral of the story is this:  If the amount of error in the report would not cause you to change your business decision, then the report served its purpose and you got good value for your money.   On the flip side, if your business rules for go/no-go require perfect information, you will probably not be in business for long in any event.

So, don’t get too hung up over the hype associated with methods which might sound “better”.   Error, like death and taxes, is always with us, and unless there is proof of reduced error, it is mere hype.  Really folks, as an industry, we are better than that.

Gary Menger
President
Applied Geographic Solutions, Inc.

Want to comment to me personally on this subject?   I am interested in your opinions.    To post a comment publicly, see the “Leave a Reply” area below this form.