Package 'predictrace'

Title: Predict the Race and Gender of a Given Name Using Census and Social Security Administration Data
Description: Predicts the most common race of a surname and based on U.S. Census data, and the most common first named based on U.S. Social Security Administration data.
Authors: Jacob Kaplan [aut, cre]
Maintainer: Jacob Kaplan <[email protected]>
License: MIT + file LICENSE
Version: 2.0.1
Built: 2024-10-27 03:03:49 UTC
Source: https://github.com/jacobkap/predictrace

Help Index


Surnames and number of people of each race with that first name

Description

A dataset containing almost 100,000 first names and the proportion of people with that first name that are female and male.

Usage

first_names_gender

Format

A data frame with 99,444 rows and 4 variables:

name

The person's first name

probability_male

Probability that the first is male

probability_female

Probability that the first name is female

likely_gender

The most likely gender based on the probability of each gender

...

Source

https://www.ssa.gov/oact/babynames/limits.html


Surnames and number of people of each race with that first name

Description

A dataset containing over 167 thousands surnames and the number of people of each race with that surname. Citation for this data: Tzioumis, Konstantinos (2018) Demographic aspects of first names, Scientific Data, 5:180025 [dx.doi.org/10.1038/sdata.2018.25].

Usage

first_names_race

Format

A data frame with 4,251 rows and 8 variables:

name

Surname

likely_race

The most likely race based on the probability of each race

probability_american_indian

Probability that the surname is American Indian

probability_asian

Probability that the surname is Asian

probability_black

Probability that the surname is Black

probability_hispanic

Probability that the surname is Hispanic

probability_white

Probability that the surname is White

probability_2races

Probability that the surname is two or more races

...

Source

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ


Find the gender of a first name

Description

The surname data comes from the United States Social Security Administration (SSA). This data has the number of people with that name that are identified as female or male so the probability female/male is the proportion of all people with that name that are female/male. SSA data is available annually from 1880-2019, this aggregates all years together.

Usage

predict_gender(name, probability = TRUE)

Arguments

name

String or vector of strings of the first name that you want to know the gender of.

probability

If TRUE (default) will provide columns for each race with the probability that the first name is of that gender If FALSE, will only return the name, the match-name from the SSA data, and the most likely gender.

Value

A data.frame with three or nine columns: The first column has the name as inputted, the second column has the cleaned up name (no spaces or punctuation, all lowercase), the third column tells the likely gender of the first name (if there are multiple genders with the same probability of a match, it will be a string with each race separated by a comma). If the parameter probability is false, these three columns are all that is returned. Otherwise, columns 4-5 tell the specific probability that the surname is female or male.

Examples

predict_gender("tyrion")

predict_gender(c("harry", "ron", "hermione", "DEAN", "NEVILLE", "Cho"))
predict_gender("franklin", probability = FALSE)
predict_gender("jacob", probability = FALSE)
predict_gender("jacob", probability = TRUE)

Find the race of a surname or first name

Description

The surname data comes from the United States Census. The first name data comes from Tzioumis (2018, <dx.doi.org/10.1038/sdata.2018.25>)

Usage

predict_race(name, probability = TRUE, surname = TRUE)

Arguments

name

String or vector of strings of surname or first name that you want to know the race of.

probability

If TRUE (default) will provide columns for each race with the probability that the surname is of that race. If FALSE, will only return the name, the match-name from the Census data, and the most likely race.

surname

If TRUE (default) will return the race based on the inputted name being a surname. If FALSE, will return the race based on the inputted name being a first name.

Value

A data.frame with three or nine columns: The first column has the name as inputted, the second column has the cleaned up name (no spaces or punctuation, all lowercase), the third column tells the likely race of the surname or first name (if there are multiple races with the same probability of a match, it will be a string with each race separated by a comma). If the parameter probability is false, these three columns are all that is returned. Otherwise, columns 4-9 tell the specific probability that the surname or first name is each race.

Examples

predict_race("franklin")

predict_race(c("franklin", "Washington", "Jefferson", "Sotomayor", "Liu"))
predict_race("franklin", probability = FALSE)
predict_race("jacob", probability = FALSE, surname = FALSE)
predict_race("jacob", probability = TRUE, surname = FALSE)

Surnames and number of people of each race with that surname.

Description

A dataset containing over 167 thousands surnames and the number of people of each race with that surname.

Usage

surnames_race

Format

A data frame with 167,408 rows and 8 variables:

name

Surname

likely_race

The most likely race based on the probability of each race

probability_american_indian

Probability that the surname is American Indian

probability_asian

Probability that the surname is Asian

probability_black

Probability that the surname is Black

probability_hispanic

Probability that the surname is Hispanic

probability_white

Probability that the surname is White

probability_2races

Probability that the surname is two or more races

...

Source

https://www.census.gov/topics/population/genealogy/data/2010_surnames.html https://www.census.gov/topics/population/genealogy/data/2000_surnames.html