surgeo.models package

Submodules

surgeo.models.base_model module

Contains the base model for First Name, Surname, Geocode, BIFSG, and Surgeo models.

class surgeo.models.base_model.BaseModel

Bases: object

Base class for the first name, surname, geocode, bifsg, and surname-geocode models.

Class creation is greatly simplified by placing most of the funcionality wihtin a single base class and leaving only small areas of responsibility for the subclass. This base class does the following operations:

  1. Creating functions to provide lookup dataframes; and,
  2. Housing normalization routines for dirty ZIP code and name data.

Note

Names are normalized in a manner consistent with Word et. al (2007) [1]. This includes removing all whitespace/punctuation/digits, making the strings upper case, and then removing elements such as “JR”, “SR”, “IV” from the tail of the string. An example would be “Dav 3idson” being translated to “DAVIDSON”.

ZCTAs, which serve as a proxy for ZIP codes, are normalized by simply translating them to stirngs and then .zfill()ing them. An example would be “531” to “00531”.

References

[1]Word, David L., Charles D. Coleman, Robert Nunziata and Robert Kominski. 2007. Demographic Aspects of Surnames from Census 2000. http://www2.census.gov/topics/genealogy/2000surnames/surnames.pdf.

surgeo.models.bifsg_model module

Module containing Surgeo BIFSG class

class surgeo.models.bifsg_model.BIFSGModel

Bases: surgeo.models.base_model.BaseModel

Subclass for running a Bayesian Improved First Name Surname Geocode model.

This class:

  1. Loads the appropriate first name, surname, and geocode lookup dataframes upon instantiation;
  2. Exposes a public get_probabilities() function to compute race probabilities based on proxy data (namely first names, surnames and ZIP codes); and,
  3. Contains a number of helper functions for cleaning ZCTA/names, multiplying probabilities, checking input values, and obtaining ZCTA/name data components.

Notes

The surname probability dataframe for this model is identical to that used for the SurnameModel (prob_race_given_surname_2010.csv); the first name probability dataframe for this modelis not the same as that used for the FirstNameModel. his model uses the prob_first_name_given_race_harvard.csv file, which has the percentage of a particular race that uses that first name (e.g. 3% of all White US citizens have the first name AARON). The FirstNameModel uses the prob_race_given_first_name_harvard.csv file, which has the race percentages for a given first name (e.g. 92% of people with the first name AARON are White); the geocode probability dataframe for this model is not the same as that used for the GeocodeModel. This model uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code). The GeocodeModel uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White).

The manner in which the first name data file was created can be found in the “fetch_first_names” Jupyter notebook.

The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.

This is based of the following general formula from Voicu [2].

\(q(r \mid s,f,g) = \Large \frac{u(r,s,f,g)}{u(1,s,f,g) \, + \, u(2,s,f,g) \, + \, u(3,s,f,g) \, + \, u(4,s,f,g) \, + \, u(5,s,f,g) \, + \, u(6,s,f,g)}\)

Where:
\(\hspace{25px} u(r,s,f,g) = p(r \mid s) \times p(g \mid r) \times p(f \mid r)\)

And where:
\(\hspace{25px} p(r \mid s)\) is the probability of a selected race given surname
\(\hspace{25px} p(g \mid r)\) is the probability of a selected census block of residence given race
\(\hspace{25px} p(f \mid r)\) is the probability of a selected first name given race
\(\hspace{25px} g\) is Census Block
\(\hspace{25px} f\) is First Name
\(\hspace{25px} s\) is Surname
\(\hspace{25px} r\) is Race

And where:
\(\hspace{25px} 1 \text{ is } r =\) Hispanic
\(\hspace{25px} 2 \text{ is } r =\) White
\(\hspace{25px} 3 \text{ is } r =\) Black
\(\hspace{25px} 4 \text{ is } r =\) Asian or Pacific Islander
\(\hspace{25px} 5 \text{ is } r =\) American Indian / Alaska Native
\(\hspace{25px} 6 \text{ is } r =\) Multi Racial

References

[2]Ioan Voicu “Using First Name Information to Improve Race and Ethnicity Classification”. Statistics and Public Policy (2018) 5:1, 1-13, https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012
get_probabilities(first_names, surnames, zctas)

Obtain a set of BIFSG probabilities for first_name/surname/ZCTA series

This method first takes the data and checks to see if the data is formatted appropriately. It triggers the _get_surname_probs(), _get_first_name_probs(), and _get_geocode_probs() helper functions to merge the probabilities for the inputs with their looked-up values. It then runs the _combined_probs() helper function to actually conduct the data calculation and obtain the BIFSG probabilities. It finally runs the _adjust_frame() method to concatenate the inputs and outputs in a single convenient frame.

Parameters:
  • first_names (pd.Series) – A series of first names to use for the BIFSG algorithm
  • surnames (pd.Series) – A series of surnames to use for the BIFSG algorithm
  • zctas (pd.Series) – A series of ZIP/ZCTA codes for the BIFSG algorithm
Returns:

Dataframe of BIFSG probability results

Return type:

pd.DataFrame

surgeo.models.first_name_model module

Module containing the FirstNameModel class.

class surgeo.models.first_name_model.FirstNameModel

Bases: surgeo.models.base_model.BaseModel

Provides a way to look up race percentages by first name.

This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the first names that are input.

Notes

The manner in which the first name data file was created can be found in the “fetch_first_names” Jupyter notebook.

The first name probability dataframe for this model is generated from the prob_race_given_first_name_harvard.csv file.

get_probabilities(names)

Obtain race probabilities for a set of first names.

Parameters:names (pd.Series) – names to which to attach race probability data
Returns:Dataframe of race probability results
Return type:pd.DataFrame

surgeo.models.geocode_model module

This module contains the GeocodeModel class

class surgeo.models.geocode_model.GeocodeModel(geo_level='ZCTA')

Bases: surgeo.models.base_model.BaseModel

Provides a way to look up race percentages by ZIP/ZCTA code

This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the ZIPs/ZCTAs that are input.

Notes

ZIP Code Tabulation Areas (ZCTAs) are approximations for US Postal ZIP codes. While ZIP codes change, ZCTAs are static for a given census cycle. They are not identical.

The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.

This does not use the same Geocode data as the Surgeo class. This model uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White). The SurgeoModel uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code).

get_probabilities(zctas)

Obtain race probabilities for a set of ZIP codes or ZCTAs.

Parameters:zctas (pd.Series) – ZIPs/ZCTAs to which to attach race probability data
Returns:Dataframe of race probability results
Return type:pd.DataFrame
get_probabilities_tract(geo_df)

Obtain race probabilities for a set of State, County, Tract.

Parameters:geo_df (pd.DataFrame) – DF of [‘state’,’county’,’tract’] codes to retrun probabilities for
Returns:Dataframe of race probability results
Return type:pd.DataFrame

surgeo.models.surgeo_model module

Module containing Surgeo BISG class

class surgeo.models.surgeo_model.SurgeoModel(geo_level='ZCTA')

Bases: surgeo.models.base_model.BaseModel

Subclass for running a Bayesian Improved Surname Geocode model.

This class:

  1. Loads the appropriate surname and geocode lookup dataframes upon instantiation;
  2. Exposes a public get_probabilities() function to compute race probabilities based on proxy data (namely surnames and ZIP codes); and,
  3. Contains a number of helper functions for cleaning ZCTA/names, multiplying probabilities, checking input values, and obtaining ZCTA/name data components.

Notes

The surname probability dataframe for this model is identical to that used for the SurnameModel (prob_race_given_surname_2010.csv); the geocode probability dataframe for this model is not the same as that used for the GeocodeModel. This model uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code). The GeocodeModel uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White).

The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.

This is based of the following general formula from Elliott et al [3].

\(q(i \mid j,k) = \Large \frac{u(i,j,k)}{u(1,j,k) \, + \, u(2,j,k) \, + \, u(3,j,k) \, + \, u(4,j,k) \, + \, u(5,j,k) \, + \, u(6,j,k)}\)

Where:
\(\hspace{25px} u(i,j,k) = P(i \mid j) \times r(k \mid i)\)

And where:
\(\hspace{25px} P(i \mid j)\) is the probability of a selected race given surname
\(\hspace{25px} r(k \mid i)\) is the probability of a selected census block of residence given race
\(\hspace{25px} k\) is Census Block
\(\hspace{25px} j\) is Surname
\(\hspace{25px} i\) is Race

And where:
\(\hspace{25px} 1 \text{ is } i =\) Hispanic
\(\hspace{25px} 2 \text{ is } i =\) White
\(\hspace{25px} 3 \text{ is } i =\) Black
\(\hspace{25px} 4 \text{ is } i =\) Asian or Pacific Islander
\(\hspace{25px} 5 \text{ is } i =\) American Indian / Alaska Native
\(\hspace{25px} 6 \text{ is } i =\) Multi Racial

References

[3]Elliott, M.N., Morrison, P.A., Fremont, A. et al. Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Method (2009) 9: 69. https://link.springer.com/article/10.1007/s10742-009-0047-1
get_probabilities(names, geo_df)

Obtain a set of BISG probabilities for name/ZCTA series

This method first takes the data and checks to see if the data is formatted appropriately. It triggers the _get_surname_probs() and _get_geocode_probs() helper function to merge the probabilities for the inputs with their looked-up values. It then runs the _combined_probs() helper function to actually conduct the data calculation and obtain the BISG probabilities. It finally runs the _adjust_frame() method to concatenate the inputs and outputs in a single convenient frame.

Parameters:
  • names (pd.Series) – A series of names to use for the BISG algorithm
  • geo_df (Union[pd.Series, pd.DataFrame]) – A series of target ZIP/ZCTA codes or State County Tract for the BISG algorithm
Returns:

Dataframe of BISG probability results

Return type:

pd.DataFrame

surgeo.models.surname_model module

Module containing the SurnameModel class.

class surgeo.models.surname_model.SurnameModel

Bases: surgeo.models.base_model.BaseModel

Provides a way to look up race percentages by surname.

This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the surnames that are input.

Notes

The manner in which the surname data file was created can be found in the “fetch_surnames” Jupyter notebook.

The surname probability dataframe for this model is generated from the prob_race_given_surname_2010.csv file.

get_probabilities(names)

Obtain race probabilities for a set of surnames.

Parameters:names (pd.Series) – names to which to attach race probability data
Returns:Dataframe of race probability results
Return type:pd.DataFrame

Module contents

This contains the First Name, Geocode, Surname, Surname-Geocode, and First Name-Surname-Geocode models