Title: | Predict the Race and Gender of a Given Name Using Census and Social Security Administration Data |
---|---|
Description: | Predicts the most common race of a surname and based on U.S. Census data, and the most common first named based on U.S. Social Security Administration data. |
Authors: | Jacob Kaplan [aut, cre] |
Maintainer: | Jacob Kaplan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0.1 |
Built: | 2024-10-27 03:03:49 UTC |
Source: | https://github.com/jacobkap/predictrace |
A dataset containing almost 100,000 first names and the proportion of people with that first name that are female and male.
first_names_gender
first_names_gender
A data frame with 99,444 rows and 4 variables:
The person's first name
Probability that the first is male
Probability that the first name is female
The most likely gender based on the probability of each gender
...
https://www.ssa.gov/oact/babynames/limits.html
A dataset containing over 167 thousands surnames and the number of people of each race with that surname. Citation for this data: Tzioumis, Konstantinos (2018) Demographic aspects of first names, Scientific Data, 5:180025 [dx.doi.org/10.1038/sdata.2018.25].
first_names_race
first_names_race
A data frame with 4,251 rows and 8 variables:
Surname
The most likely race based on the probability of each race
Probability that the surname is American Indian
Probability that the surname is Asian
Probability that the surname is Black
Probability that the surname is Hispanic
Probability that the surname is White
Probability that the surname is two or more races
...
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ
The surname data comes from the United States Social Security Administration (SSA). This data has the number of people with that name that are identified as female or male so the probability female/male is the proportion of all people with that name that are female/male. SSA data is available annually from 1880-2019, this aggregates all years together.
predict_gender(name, probability = TRUE)
predict_gender(name, probability = TRUE)
name |
String or vector of strings of the first name that you want to know the gender of. |
probability |
If TRUE (default) will provide columns for each race with the probability that the first name is of that gender If FALSE, will only return the name, the match-name from the SSA data, and the most likely gender. |
A data.frame with three or nine columns: The first column has the name as inputted, the second column has the cleaned up name (no spaces or punctuation, all lowercase), the third column tells the likely gender of the first name (if there are multiple genders with the same probability of a match, it will be a string with each race separated by a comma). If the parameter probability is false, these three columns are all that is returned. Otherwise, columns 4-5 tell the specific probability that the surname is female or male.
predict_gender("tyrion") predict_gender(c("harry", "ron", "hermione", "DEAN", "NEVILLE", "Cho")) predict_gender("franklin", probability = FALSE) predict_gender("jacob", probability = FALSE) predict_gender("jacob", probability = TRUE)
predict_gender("tyrion") predict_gender(c("harry", "ron", "hermione", "DEAN", "NEVILLE", "Cho")) predict_gender("franklin", probability = FALSE) predict_gender("jacob", probability = FALSE) predict_gender("jacob", probability = TRUE)
The surname data comes from the United States Census. The first name data comes from Tzioumis (2018, <dx.doi.org/10.1038/sdata.2018.25>)
predict_race(name, probability = TRUE, surname = TRUE)
predict_race(name, probability = TRUE, surname = TRUE)
name |
String or vector of strings of surname or first name that you want to know the race of. |
probability |
If TRUE (default) will provide columns for each race with the probability that the surname is of that race. If FALSE, will only return the name, the match-name from the Census data, and the most likely race. |
surname |
If TRUE (default) will return the race based on the inputted name being a surname. If FALSE, will return the race based on the inputted name being a first name. |
A data.frame with three or nine columns: The first column has the name as inputted, the second column has the cleaned up name (no spaces or punctuation, all lowercase), the third column tells the likely race of the surname or first name (if there are multiple races with the same probability of a match, it will be a string with each race separated by a comma). If the parameter probability is false, these three columns are all that is returned. Otherwise, columns 4-9 tell the specific probability that the surname or first name is each race.
predict_race("franklin") predict_race(c("franklin", "Washington", "Jefferson", "Sotomayor", "Liu")) predict_race("franklin", probability = FALSE) predict_race("jacob", probability = FALSE, surname = FALSE) predict_race("jacob", probability = TRUE, surname = FALSE)
predict_race("franklin") predict_race(c("franklin", "Washington", "Jefferson", "Sotomayor", "Liu")) predict_race("franklin", probability = FALSE) predict_race("jacob", probability = FALSE, surname = FALSE) predict_race("jacob", probability = TRUE, surname = FALSE)
A dataset containing over 167 thousands surnames and the number of people of each race with that surname.
surnames_race
surnames_race
A data frame with 167,408 rows and 8 variables:
Surname
The most likely race based on the probability of each race
Probability that the surname is American Indian
Probability that the surname is Asian
Probability that the surname is Black
Probability that the surname is Hispanic
Probability that the surname is White
Probability that the surname is two or more races
...
https://www.census.gov/topics/population/genealogy/data/2010_surnames.html https://www.census.gov/topics/population/genealogy/data/2000_surnames.html