Why Does Similar Data Look So Different?

By Dr. Scott Hoke, Cedar Crest College

One of the difficult parts of evaluating any community-based problem or potential solution is making sense of the data that are collected for analysis. What makes this so challenging is that although a number of agencies might collect data on the same subject, the data might not be similar. Why is that? Why is there not a universal definition to community-based problems and why are the data different? It would be nice if we could expect uniformity, but that is simply not the reality we live in. We will try and make sense of this problem so it might be easier to understand why we often see different things even if we look through the same set of glasses.

 

The easiest way to explain this might be to use an example to guide the discussion. Figure 1 presents data from three different sources: the US Census Bureau, the Allentown Promise Neighborhood 2014 community survey, and the Allentown School District. The subject that the data are trying to evaluate is residential mobility. More specifically, we are trying to figure out how often people in a given community move in and out of that community. The point to this discussion is not to explain the theoretical reasons as to why mobility is an important issue, so we will, just for the sake of this discussion, work off of the understanding that too much mobility is not necessarily a good thing for the community.

As one can see from the table results, the estimates are certainly different from one another. The numbers represent estimates of the percentage of people who moved in or out of the community in a 12-month period of time.  In some cases, the estimates are dramatically different. As an example, according to Census Bureau, in the census block group containing the neighborhood identified as “North 1”, 5% of the heads of households moved in the past 12 months. Yet, according to elementary school estimates published by the Allentown School District, 59% of elementary school students in that same geography moved during the same 12-month period. The obvious question becomes, what causes the different results even though the sources are measuring the same concept? There are two important explanations that could possibly explain the discrepancy: different levels of geography, and different definitions for the subject being studied.

 

Different Levels of Geography

 

The first issue that often causes discrepancy in data sets is the level of geography. All three of these sources identified in Figure 1 are using different levels of geography as a means of reporting information. The maps that are presented in this summary try to identify the different levels of geography in way that allows you to see how different they actually are. It would be ideal to present the geographies in one map, but because the boundaries overlap one another distinguishing one from another would be difficult. As a result, they are presented separately. In evaluating each map, keep in mind that each covers the same part of the city. The smallest level of geography presented in the maps is that which is referred to in Figure 1 as the “Survey Estimate”). That level of geography refers to a 9-square block neighborhood in center-city Allentown that has been further divided into six smaller units. Splitting the area into smaller segments is really a function of how neighborhoods are designed in any community. In this community, the housing is fairly dense so neighborhood boundaries can potentially be small in size. In other neighborhoods housing can be less dense, causing the concept of a neighborhood to be quite large.

Figure 2 – Neighborhood Geography (“Survey Estimate” Area)

The agency that collected the “survey estimate” data represented by Figure 2 was trying to identify areas of the city that had unique characteristics and unique community needs. Using a smaller level of geography benefited the organization because of its particular mission. Often, the level of geography one might want to use to collect data depends upon the mission of the organization. Agencies designed to serve neighborhoods want data collected at smaller levels of geography as compared to agencies whose missions are to serve residents of the larger county. Some agencies that collect and publish data for the public to use are not concerned with “neighborhood” boundaries or measurements and use units of measure that do not easily fit that definition. The US Census Bureau is a perfect example of this. The boundaries that are created by the Census Bureau have more to do with population counts than neighborhood boundaries. Neighborhoods are not totally ignored, but it is possible that there are a number of different neighborhoods that all fall under some larger geographic Census Bureau boundary. Inherent in the mixing of geographies is the possibility that the data are not reflective of the geography to which it is being applied or compared. Figure 3 represents the Census Block Group geography for the same area in the city of Allentown that was presented in Figure 2.

Figure 3 – US Census Geography – Block Groups

As one can see, the geographies represented in Figures 2 and 3 are different, even though each covers the same basic geographic area. The lack of agreement between the geographic areas can cause obvious comparison problems. The areas identified as Central 1 and South 1 are separate entities according to the community survey results, but are part of the same census block group according to the Census Bureau. The same is true of the areas identified as Central 2 and South 2. It would be nice if we could break the data collected by the Census Bureau into smaller areas for a more direct comparison, but that is not possible.

The final level of geography represented by the data in Figure 1 is a school district boundary. In this case, the boundaries that guide the data collection are elementary school boundaries. Although elementary schools typically pay some attention to neighborhoods, these designations are also driven by populations and can change from time-to-time. The school boundaries are much different than the survey geography and the Census geography.

Figure 4 – Elementary School Boundaries

The point to showing the level of geographies is to highlight the fact that they are all different. Because of that fact, it is not unusual for the data to be different. The lack of uniformity does not allow for an “apples-to-apples” comparison and, as a result, the differences one might see in the data are understandable. Often, organizations try to use data collected and published by others even though it is not ideal. That practice can potentially mis-characterize the results or cause one to assume that the results are not accurate.

 

Lack of a Common Definition

 

The second basic problem with data comparisons comes from the lack of a clear, singular definition given to many social problems. Take the term poverty, as an example. There is no single, universal definition of poverty. When one compares data that evaluate those who live in poverty, it is important to understand how the term “poverty” is being defined. In some cases, poverty might be defined by income level, or by public assistance benefits, or by lack of education. None of these definitions of poverty is incorrect, each is simply different from the other. Think of it this way, if my income is $100 above what is designated by the federal government as the “poverty line”, does that mean I am not living in poverty?

The same definition problem exists with respect to the indicator being measured by the data presented in Figure 1. The term “residential mobility” can be defined differently by whomever collects the data. In this case, that is exactly what is happening.

The US Census Bureau and the organization that collected the community survey data defined mobility by asking the person who was considered the “head of household” whether or not he or she had moved in the past 12 months. The school district, on the other hand, was defining mobility by whether or not the school-aged child had changed schools in the past 12 months. Those two data sources define mobility using different points of reference. Although one can make the assumption that if a child changes schools then the “head of household” responsible for that child also changed addresses, making that assumption may not be accurate. If, as an example, the child’s parent is not the defined “head of household”, the data collected from the school would not match that collected by the Census Bureau. If a school district is characterized by a large percentage of multi-generational households, the statistics representing residential mobility could look dramatically different from other standard measures of mobility. When the two factors (level of geography and definition) are both different, as they are in this case, there is likely to be some discrepancy in a straight comparison of the results.

The moral to the story is to be aware of the level of geography and the definition of the concept being measured by the data. As nice as it would be to imagine, not all data can be directly compared. But the lack of a direct comparison does not mean that there is not value in understanding the story different types of data are trying to tell. The larger the scope of our understanding of a subject, the more likely it will be that we can make sense of the problem.