In this project, we used nothing but a person’s name to predict their gender.
The database that we use doesn’t have the gender information for the customers. And gender is a critical demographic feature for any analysis, any actionable segmentation, or any marketing efforts, maybe. Going back and collecting the data was not an option because the userbase was already large enough. It was decided that this data point would be captured, but what about the records that were already there in the database? The obvious and straightforward solution was to label each name with the gender by manual efforts. we even tried that for a few records and quickly realized that this wasn’t the optimal solution. It was a lot of manual work and the scope for human error was high, too. So after exploring a couple of options, we decided to implement a seemingly ridiculous idea.
we would train an AI model to predict the gender of a person by using just their name. It’s quite easy, for us humans. Give me a random name and I’ll tell you whether it’s a male or female, instantly and with ~100% accuracy. But making a machine do the same thing, hmm, that’s a bit challenging now.
With this article, we will take you through the process we followed in this project.
Gathering the data
We spent quite some time looking for the data. While data for English names was available, Indian names labelled with gender were hard to find. But after spending some time looking for the data, we were fortunate enough to stumble upon a dataset that had just what we needed. Indian names, along with their gender.
Here is an excerpt from the dataset we’d use.
Cleaning the Data
Before trying to extract any features from the names, it makes sense to separate the first name from the full name as it doesn’t have any impact on our target, Gender. Now you might ask “How do you know that?”. Well, in India, and anywhere on the planet for that matter, the last name is often irrespective of a person’s gender. It’s just a family name given to all members of the family (And just for the record, this is called domain knowledge!). So, we created a new feature called ‘firstname’ from the data. And let’s also drop the ‘race’ feature as it is irrelevant (All the names are of the Indian race).
We now have to make sure that the ‘firstname’ feature is free from any noise.
We will first get rid of the punctuation, Non-English letters, and digits if any. We will also remove any white spaces that might have crawled in by mistake
Exploring the data
As it is famously known, Exploratory Data Analysis (EDA) takes up the majority portion of the time while building any ML/AI model. And the time spent on EDA is definitely worthwhile, as it helps you understand the relationships your data has within itself and to create any new features if you need any. In our case, there are not many variables to find relationships among them. So, we must explore the data so that we can find and extract important features.
The first obvious feature we can think of from the above data is the length of the name. So, we calculated the length of every first name and added that as a new feature, and named it ‘NameLen’.
Now that we have added the Name length feature, it’s time that we see whether it actually has any impact on the target variable, in this case, Gender. So let’s visualize the average first name lengths of both genders.
The above bar chart says that the male names are slightly lengthier than the female names, even though the difference is not that significant. We got a feature but it’s not that impactful. So our quest continues.
In general, most of the female names end with a vowel. This could be a useful feature but we haven’t proved that yet. So let’s create a new feature called ‘isVowel’. This will be a binary feature, indicating whether or not the last name ends with a vowel. As this is a binary feature, it will have only two values, 0 and 1. 1 indicates that the last letter was a vowel and 0 indicates that it wasn’t. We will be creating this feature based on the ‘firstname’.
Just to remind you, if in case you forgot, there are five vowels in the English alphabets (a, e, i, o, u).
Now that we have the feature ready, let’s go ahead and visualize this and see whether there is any relationship with the target variable.
(f,1) indicates the instances where the female name had a vowel in its last letter, and (f,0) indicates the instances where the female name did not have a vowel in its last letter. And the same is with the (m,1) and (m,0). It is clearly visible that there are a majority of instances where the last letter of a female name ended with a vowel and the majority of male names did not have a vowel in its last letter.
The ‘isVowel’ feature hence is an impactful one. This indicates that a lot of information regarding gender can be extracted from the suffixes of Indian names. We will now create similar features, but this time, we will use the last three letters instead of just one last letter.
In natural language processing, this is called n-grams and because we are using the last three letters, these are called tri-grams.
But why just three? Because while working with n-grams, the dimensionality increases like crazy when n is increased. So by considering only the tri-grams, we are reducing the feature space dimension by a great deal!
For creating these tri-grams, we will make use of the CountVectorizer from the scikit-learn package.
Using a custom lambda function as an analyzer for the CountVectorizer, we can extract the tri-grams from just the suffixes.
Here is a snapshot of the trigrams. The tri-grams is actually a sparse matrix. Notice the dimensionality of these features at the end (30172 rows x 1592 columns)
We need to save this TriGram model so that we can re-use the same features with the test data. Now that the tri-grams are ready, we can join these features with the previous data.
Once we join the tri-grams with the previous data frame, we have all the features. Let us label encode the target variable gender. This converts the values from (f,m) to (0,1) .
Once the tri-grams are joined with the data frame, it will look something like this.
We can now go ahead and remove the ‘name’ and ‘firstname’ features as we have already extracted all the information from them. I won’t go into much detail regarding the training, testing, and splitting as it is the regular process that you’d probably already know.
We will finally use a Support Vector Machine (SVM) classifier to train the data.
We can then save the model as a pickle and use it on the test data to predict the gender.
The model performed really well with an accuracy of 85%.
The accuracy and performance could be further improved by increasing the size of the data and maybe 4-grams can be used for the features, given we have the necessary resources. This model can help a great deal to reduce the size of the registration process by hosting this as a web app and taking just the name of the user as input and adding the gender via an API call to the model. One less pain point for the customer!
Authors: Sairam Rathod, Gaurav Shilimkar, Pankaj Shendurkar, Sudesh Raina, Atharv Tayde.
Guide: Prof.Rupali Tornekar.
We hope you found this blog interesting, feel free to drop your queries in the comments below. Stay tuned for more!