Identifying Bias When Sensitive Attribute Data is Unavailable: Geolocation in Mortgage Data

In our last post, we explored data on mortgage applicants from 2017 released in accordance with the Home Mortgage Disclosure Act (HMDA). We will use that data, which includes self-reported race of applicants, to test how well we can infer race using applicants’ geolocations in our effort to better understand methods to infer missing sensitive attribute data for bias and fairness analyses. This approach follows the procedures outlined by Chen et al. (2019) and the CFPB (2014) [1],[2].

Geolocating using HMDA data

For each applicant, the HMDA data lists the state, county, and census tract of residence. We can match this combination of geographic information for each person to the U.S. Census Bureau’s 2010 estimates of the U.S. population to get the approximate percentage of people in each geographic location who belong to each race/ethnicity group (Non-Hispanic White, Asian or Pacific Islander, Non-Hispanic Black or African American, and Hispanic/Latino). As Chen et al. do, we consider the percentages of each race group in a particular location as the probabilistic inferred race vector for each individual from that location [1]. Figure 1 illustrates such a process with a simple example for a hypothetical mortgage applicant. 

Figure 1: Example of generating race probabilities from HMDA location data by matching with Census Bureau data. The hypothetical HMDA datapoint lists an applicant from Census Tract 210924 in St Louis County, Missouri who is Black or African American. Matching this location with Census data gives the portion of the population comprised by various race groups. Here, we see the largest portion of the population is Black or African American.

After this matching process, we have the probabilistic race vector for each applicant. There are multiple different ways we could assign each applicant an “inferred race” based on this probabilistic vector, and how we assign this inferred race may affect our estimates of racial group disparities in application approval. For example, we can treat the maximum entry in the vector as the applicant’s inferred race. In the example from Figure 1, this approach would mean that the inferred race for applicants from that location would be Black/African American. 

Another approach is to set a threshold between 0.5 and 1 for the vector, remove applicants from locations that do not meet that threshold, and consider the race with the entry above the threshold as the inferred race. For example, if our threshold was .6, then only applicants from locations whose race vector has a probability over .6 would be kept in the dataset. In the example above from Figure 1, a threshold of .6 would exclude applicants from the example in Figure 1, but a threshold of .5 would include them and consider their “inferred race” to be Black/African American. In Figure 2, we compare the actual and inferred approval rates and disparities using a threshold of .6, noting that the thresholded estimator tends to overestimate the racial disparities in approval rates, in line with Chen et al.’s findings in 2011-2012 data [1]. 

Figure 2: Comparison of inferred and actual mean group outcomes and demographic disparities with a threshold of .6. The thresholding technique overestimates the size of the disparity in approval outcomes for Black and Hispanic applicants relative to White and Asian/Pacific Islander applicants.

One downside of the thresholding approach is that we throw away mortgage data by instituting a threshold. Figure 3 below shows how the size of the sample remaining changes with the threshold.

Figure 3: Size of the sample remaining (in thousands) as a function of the threshold chosen

While this approach for inferring race using geolocation correctly estimated disparities in approval rates between Black/African American or Hispanic/Latino applicants and White or Asian/Pacific Islander Applicants, it was not perfect for this particular dataset and context. Besides thresholding, we could take other approaches to assign “inferred race” from the probabilistic vectors. One example would be to use a weighted estimator, as proposed by Chen et al. (2019), which doesn’t classify individuals as a particular race but instead uses the probabilistic vector to compute average outcomes for each group [1]. 

Conclusion

In this series of posts, we explored the issue of understanding bias in automated decision-making systems along racial or gender lines when the relevant protected attribute data is not available. We discussed reasons why the data may be missing and why organizations may want to infer it, surveyed a prominent technique for inferring race using location and last name, and explored a surrogate method for inferring race using census location. We ultimately found that, on one sample of mortgage applicants, this method was imperfect at approximating racial disparities in loan approval decisions, aligning with existing research on the shortcomings of some of the techniques used to infer protected characteristics for these kinds of analyses [1],[3]. Nonetheless, while methods to infer missing protected characteristics may be imperfect, it’s important to critically evaluate the fidelity of such methods in the context of the alternative, which may be not analyzing differences with regard to those protected characteristics at all.

——

References: 

[1]: Chen, J., Kallus, N., Mao, X., Svacha, G., & Udell, M. (2019, January). Fairness under unawareness: Assessing disparity when protected class is unobserved. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 339-348).

[2]: Bureau, C. F. P. (2014). Using publicly available information to proxy for unidentified race and ethnicity: A methodology and assessment. Washington, DC: CFPB, Summer.

[3]: Kallus, N., Mao, X., & Zhou, A. (2019). Assessing algorithmic fairness with unobserved protected class using data combination. arXiv preprint arXiv:1906.00285.