Mary and John: using first name to predict sex in the US works quite well
Posted on Mon 11 February 2019 in data science
Someone’s first name is a good clue of their sex. Mary is probably female. John is probably male. How good a clue exactly? The short story: in the US, first name is a very good clue of sex at birth, but it varies by at least name and year.
I use SSA US “national” data, which has name, sex, and number of births for the years 1880–2017 for the entire country; 348,120,517 births in total.
This data only has binary sex (F or M). I use the word “sex” deliberately. “Gender” refers to a cultural identity, while “sex” refers to biological characteristics, and more to the point, what someone put on an application for a social security number. I don’t want to exclude people with non-binary gender identity, but sex at birth is tracked in a binary way by the SSA, which is the data I could think of to use.
Results
Suppose we assume the more common sex for a name is always true. An example: the name Pat has 66,854 births in this dataset, of which 40,123 are female. Suppose we say Pat is always female. We’re right about 60% of the time. Not very accurate.
However, if we assume the more common sex for a name is always true, that is correct for 98% of births in the dataset. Near perfect for social data. If we look at the accuracy of that label by year, it trends down, but stays over 95%.
Our previous example (Pat) shows that some names are more mixed-sex than others. Below is a graph by birth year for fraction of “Pat” births who were female, with only the years with at least 100 Pats.
Pretty fun. It started out more male (only 30% female in the 1910s and 1920s), went almost all female (up to 80% in 1940), then has veered back towards male. The last point is 1973, after which “Pat” has fewer than 100 births in a year. Note this is only exactly “Pat”, it doesn’t include “Patrick” or “Patricia”.
Below are the top 30 most mixed-sex names by year, with fraction of births that are female. A name near 100% is mostly female, and near 0% is mostly male. There are only dots for at least 100 births in that year. Also, it only shows data since 1920 because there are more interesting movements there.
Examples:
- Alva was always in the middle, and became rare after 1964
- Amari became popular in 1994
- “Baby” and “Unknown” are likely not real names. We’d expect them to be around 50%, and they are, mostly
- Some names started out more male, and became more female, such as Riley, Stevie, Blair, Quinn, Mckinley, Emery, Ivory, Jaylin. For Stevie, a friend guessed Stevie Nicks, who joined Fleetwood Mac in 1975. For Blair, that really turned female in 1980–1982 that same friend guessed Blair in “The Facts of Life”, a television show on 1979–1988.
- Some names started out more female, and became more male, such as Robbie. Robbie is heading male in 1950, but moves that way faster in 1960, possibly due to Robbie in My Three Sons, yet another tv series, that started in 1960 (pointed out by yet another of my friends!).
- Some have stayed in the middle, or veered back and forth, such as Frankie, Jackie, Justice, Kerry, Remy
- Jaime is interesting. A friend guessed that it is a combination of a Spanish name for a boy that stuck around, and a newer variant for an English name “Jamie” that came charging in around 1975, then lost steam. Another friend guessed Jaime Sommers, the (fictional) bionic woman, who first appeared in “The Six Million Dollar Man”, a popular tv show, and got her own show 1976–1978.
I don’t have earth-shattering conclusions. (Popular culture influences names?) But, I enjoy looking at data. It does seem maybe names go towards female more often than towards male? I could count that more precisely with more work.
One more graph: the same one, but for the names that are most mixed-sex as of 1970.
I still find it fascinating that these names can have such different shapes. Riley, Raleigh, Harley, Emory, Leighton, Emerson went from male to female; Jan went from female to male (again, this might be a combination of multiple names with the same spelling), and Jody may be headed there.
In conclusion: overall, first names predict sex well, but there are also some mixed-sex names, and some that change over time.
Discussion
So why should you care? (The first and only question to answer for any presentation.)
First, I validated the idea, going from a guess (“first name likely tells us about sex”) to a strong indication with data (98% accuracy overall). Of such small steps is science made. Second, I described a method (grab SSA data, look at most common sex per name) that is easy to implement and easy to interpret. Third, I dug into the data to find some fun examples.
Is there danger in publishing this result? (A question every data scientist should ask.) The answer is rarely obvious. This method is a tool. It allows someone to take a dataset with first names and add gender with fairly high accuracy in the US. So, it may enable analyses that were not possible before.
Like any tool, it can be used for good or ill. It might help a researcher look at prevalence of men and women in the news to press for gender equality. Or, it might help a company spy on people. (I published related privacy risks in 2006 in “You are what you say: privacy risks of public mentions”.)
Or it might be misused. For example, this method may be less accurate for names from disadvantaged populations like women or minorities. Looking at a few examples it looks pretty accurate, but that’s not science. I have no systematic evidence.
I think this is interesting and may be a valuable tool for good. Moreover, the cat is out of the bag. If I search for “predicting gender from name” I see services, programming APIs, code repositories, and published results. It seems likely that 10 times as many people have already done this in private companies where it wouldn’t show up publicly.
For all these reasons, I’m willing to publish it.
Next steps
How could we make this model more accurate? We might be able to use birth year, though we might not have that information in every setting.
Better yet, we could use spelling. Michael is likely male, Michaela is likely female. The ending of the word might tell us something. Also, names like “Patrick” and “Patricia” are related: they both shorten to “Pat”. But, this requires work.
Here is perhaps the real lesson: do the simple thing (usually counting), see there is a useful result, then move forward. Perfect is the enemy of good.
Maybe I’ll do another post using this data.
Thanks to Melissa, John, and Cole for great feedback and name examples.