surgeo.models package¶
Submodules¶
surgeo.models.base_model module¶
Contains the base model for First Name, Surname, Geocode, BIFSG, and Surgeo models.
-
class
surgeo.models.base_model.
BaseModel
¶ Bases:
object
Base class for the first name, surname, geocode, bifsg, and surname-geocode models.
Class creation is greatly simplified by placing most of the funcionality wihtin a single base class and leaving only small areas of responsibility for the subclass. This base class does the following operations:
- Creating functions to provide lookup dataframes; and,
- Housing normalization routines for dirty ZIP code and name data.
Note
Names are normalized in a manner consistent with Word et. al (2007) [1]. This includes removing all whitespace/punctuation/digits, making the strings upper case, and then removing elements such as “JR”, “SR”, “IV” from the tail of the string. An example would be “Dav 3idson” being translated to “DAVIDSON”.
ZCTAs, which serve as a proxy for ZIP codes, are normalized by simply translating them to stirngs and then .zfill()ing them. An example would be “531” to “00531”.
References
[1] Word, David L., Charles D. Coleman, Robert Nunziata and Robert Kominski. 2007. Demographic Aspects of Surnames from Census 2000. http://www2.census.gov/topics/genealogy/2000surnames/surnames.pdf.
surgeo.models.bifsg_model module¶
Module containing Surgeo BIFSG class
-
class
surgeo.models.bifsg_model.
BIFSGModel
¶ Bases:
surgeo.models.base_model.BaseModel
Subclass for running a Bayesian Improved First Name Surname Geocode model.
This class:
- Loads the appropriate first name, surname, and geocode lookup dataframes upon instantiation;
- Exposes a public get_probabilities() function to compute race probabilities based on proxy data (namely first names, surnames and ZIP codes); and,
- Contains a number of helper functions for cleaning ZCTA/names, multiplying probabilities, checking input values, and obtaining ZCTA/name data components.
Notes
The surname probability dataframe for this model is identical to that used for the SurnameModel (prob_race_given_surname_2010.csv); the first name probability dataframe for this modelis not the same as that used for the FirstNameModel. his model uses the prob_first_name_given_race_harvard.csv file, which has the percentage of a particular race that uses that first name (e.g. 3% of all White US citizens have the first name AARON). The FirstNameModel uses the prob_race_given_first_name_harvard.csv file, which has the race percentages for a given first name (e.g. 92% of people with the first name AARON are White); the geocode probability dataframe for this model is not the same as that used for the GeocodeModel. This model uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code). The GeocodeModel uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White).
The manner in which the first name data file was created can be found in the “fetch_first_names” Jupyter notebook.
The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.
This is based of the following general formula from Voicu [2].
\(q(r \mid s,f,g) = \Large \frac{u(r,s,f,g)}{u(1,s,f,g) \, + \, u(2,s,f,g) \, + \, u(3,s,f,g) \, + \, u(4,s,f,g) \, + \, u(5,s,f,g) \, + \, u(6,s,f,g)}\)Where:\(\hspace{25px} u(r,s,f,g) = p(r \mid s) \times p(g \mid r) \times p(f \mid r)\)And where:\(\hspace{25px} p(r \mid s)\) is the probability of a selected race given surname\(\hspace{25px} p(g \mid r)\) is the probability of a selected census block of residence given race\(\hspace{25px} p(f \mid r)\) is the probability of a selected first name given race\(\hspace{25px} g\) is Census Block\(\hspace{25px} f\) is First Name\(\hspace{25px} s\) is Surname\(\hspace{25px} r\) is RaceAnd where:\(\hspace{25px} 1 \text{ is } r =\) Hispanic\(\hspace{25px} 2 \text{ is } r =\) White\(\hspace{25px} 3 \text{ is } r =\) Black\(\hspace{25px} 4 \text{ is } r =\) Asian or Pacific Islander\(\hspace{25px} 5 \text{ is } r =\) American Indian / Alaska Native\(\hspace{25px} 6 \text{ is } r =\) Multi RacialReferences
[2] Ioan Voicu “Using First Name Information to Improve Race and Ethnicity Classification”. Statistics and Public Policy (2018) 5:1, 1-13, https://www.tandfonline.com/doi/full/10.1080/2330443X.2018.1427012 -
get_probabilities
(first_names, surnames, zctas)¶ Obtain a set of BIFSG probabilities for first_name/surname/ZCTA series
This method first takes the data and checks to see if the data is formatted appropriately. It triggers the _get_surname_probs(), _get_first_name_probs(), and _get_geocode_probs() helper functions to merge the probabilities for the inputs with their looked-up values. It then runs the _combined_probs() helper function to actually conduct the data calculation and obtain the BIFSG probabilities. It finally runs the _adjust_frame() method to concatenate the inputs and outputs in a single convenient frame.
Parameters: - first_names (pd.Series) – A series of first names to use for the BIFSG algorithm
- surnames (pd.Series) – A series of surnames to use for the BIFSG algorithm
- zctas (pd.Series) – A series of ZIP/ZCTA codes for the BIFSG algorithm
Returns: Dataframe of BIFSG probability results
Return type: pd.DataFrame
surgeo.models.first_name_model module¶
Module containing the FirstNameModel class.
-
class
surgeo.models.first_name_model.
FirstNameModel
¶ Bases:
surgeo.models.base_model.BaseModel
Provides a way to look up race percentages by first name.
This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the first names that are input.
Notes
The manner in which the first name data file was created can be found in the “fetch_first_names” Jupyter notebook.
The first name probability dataframe for this model is generated from the prob_race_given_first_name_harvard.csv file.
-
get_probabilities
(names)¶ Obtain race probabilities for a set of first names.
Parameters: names (pd.Series) – names to which to attach race probability data Returns: Dataframe of race probability results Return type: pd.DataFrame
-
surgeo.models.geocode_model module¶
This module contains the GeocodeModel class
-
class
surgeo.models.geocode_model.
GeocodeModel
(geo_level='ZCTA')¶ Bases:
surgeo.models.base_model.BaseModel
Provides a way to look up race percentages by ZIP/ZCTA code
This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the ZIPs/ZCTAs that are input.
Notes
ZIP Code Tabulation Areas (ZCTAs) are approximations for US Postal ZIP codes. While ZIP codes change, ZCTAs are static for a given census cycle. They are not identical.
The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.
This does not use the same Geocode data as the Surgeo class. This model uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White). The SurgeoModel uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code).
-
get_probabilities
(zctas)¶ Obtain race probabilities for a set of ZIP codes or ZCTAs.
Parameters: zctas (pd.Series) – ZIPs/ZCTAs to which to attach race probability data Returns: Dataframe of race probability results Return type: pd.DataFrame
-
get_probabilities_tract
(geo_df)¶ Obtain race probabilities for a set of State, County, Tract.
Parameters: geo_df (pd.DataFrame) – DF of [‘state’,’county’,’tract’] codes to retrun probabilities for Returns: Dataframe of race probability results Return type: pd.DataFrame
-
surgeo.models.surgeo_model module¶
Module containing Surgeo BISG class
-
class
surgeo.models.surgeo_model.
SurgeoModel
(geo_level='ZCTA')¶ Bases:
surgeo.models.base_model.BaseModel
Subclass for running a Bayesian Improved Surname Geocode model.
This class:
- Loads the appropriate surname and geocode lookup dataframes upon instantiation;
- Exposes a public get_probabilities() function to compute race probabilities based on proxy data (namely surnames and ZIP codes); and,
- Contains a number of helper functions for cleaning ZCTA/names, multiplying probabilities, checking input values, and obtaining ZCTA/name data components.
Notes
The surname probability dataframe for this model is identical to that used for the SurnameModel (prob_race_given_surname_2010.csv); the geocode probability dataframe for this model is not the same as that used for the GeocodeModel. This model uses the prob_zcta_given_race_2010.csv file, which has the percentage of a particular race that falls within that ZCTA (e.g. .002% of all White US citizens live within this ZIP code). The GeocodeModel uses the prob_race_given_zcta_2010.csv file, which has the race percentages for a given ZCTA (e.g. 90% of ZCTA 63144 is White).
The manner in which the geography data file was created can be found in the “fetch_geography” Jupyter notebook.
This is based of the following general formula from Elliott et al [3].
\(q(i \mid j,k) = \Large \frac{u(i,j,k)}{u(1,j,k) \, + \, u(2,j,k) \, + \, u(3,j,k) \, + \, u(4,j,k) \, + \, u(5,j,k) \, + \, u(6,j,k)}\)Where:\(\hspace{25px} u(i,j,k) = P(i \mid j) \times r(k \mid i)\)And where:\(\hspace{25px} P(i \mid j)\) is the probability of a selected race given surname\(\hspace{25px} r(k \mid i)\) is the probability of a selected census block of residence given race\(\hspace{25px} k\) is Census Block\(\hspace{25px} j\) is Surname\(\hspace{25px} i\) is RaceAnd where:\(\hspace{25px} 1 \text{ is } i =\) Hispanic\(\hspace{25px} 2 \text{ is } i =\) White\(\hspace{25px} 3 \text{ is } i =\) Black\(\hspace{25px} 4 \text{ is } i =\) Asian or Pacific Islander\(\hspace{25px} 5 \text{ is } i =\) American Indian / Alaska Native\(\hspace{25px} 6 \text{ is } i =\) Multi RacialReferences
[3] Elliott, M.N., Morrison, P.A., Fremont, A. et al. Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Method (2009) 9: 69. https://link.springer.com/article/10.1007/s10742-009-0047-1 -
get_probabilities
(names, geo_df)¶ Obtain a set of BISG probabilities for name/ZCTA series
This method first takes the data and checks to see if the data is formatted appropriately. It triggers the _get_surname_probs() and _get_geocode_probs() helper function to merge the probabilities for the inputs with their looked-up values. It then runs the _combined_probs() helper function to actually conduct the data calculation and obtain the BISG probabilities. It finally runs the _adjust_frame() method to concatenate the inputs and outputs in a single convenient frame.
Parameters: - names (pd.Series) – A series of names to use for the BISG algorithm
- geo_df (Union[pd.Series, pd.DataFrame]) – A series of target ZIP/ZCTA codes or State County Tract for the BISG algorithm
Returns: Dataframe of BISG probability results
Return type: pd.DataFrame
surgeo.models.surname_model module¶
Module containing the SurnameModel class.
-
class
surgeo.models.surname_model.
SurnameModel
¶ Bases:
surgeo.models.base_model.BaseModel
Provides a way to look up race percentages by surname.
This class uses a get_probabilities() method to provide a simple mechanism for obtaining race data. It is created using a simple join of a race data table and the surnames that are input.
Notes
The manner in which the surname data file was created can be found in the “fetch_surnames” Jupyter notebook.
The surname probability dataframe for this model is generated from the prob_race_given_surname_2010.csv file.
-
get_probabilities
(names)¶ Obtain race probabilities for a set of surnames.
Parameters: names (pd.Series) – names to which to attach race probability data Returns: Dataframe of race probability results Return type: pd.DataFrame
-
Module contents¶
This contains the First Name, Geocode, Surname, Surname-Geocode, and First Name-Surname-Geocode models