Handling Categorical Data

10/15/2018

There are columns where data is divided into categories such as Male-Female, Red-Blue-Green, 0-1 etc. In our titanic dataset, we have four columns that are categorical including the target variable 'Survived'.

Such columns cannot be given to the model because firstly they are in string format. Now you would say that we could replace every unique value with an integer starting from 0. But even if they are converted in numeric datatype or are already in it, we still cannot take it as the training dataset. The reason being, the values getting different numbers would not be the same to the model.

Let me explain you with an example. Take the column Embarked and replace each entry with an integer starting from 0. Let value S:0, C:1, Q:2. If this were given to any model to train on, the model would conceive the value 2 for Q as higher than S and C. The goal is to get each and every value an equal weight for same value. So to do that a simple solution is to convert all values to 0's and 1's and make as many columns as there are unique values.

So if we were to do that on Embarked column, we would make three columns(Not Really! Explained below). So each column would have either 1 or 0 as a value

As you can see, three columns are made named Embarked_C, Embarked_Q and Embarked_S each containing 0's and 1's. Now you will notice that only one column cell would be 1 in any row in those three columns. This is because it is a categorical data and there could be only one value at any time.

Now one could say that is there really need of three columns in this case. Well the answer is NO. We only need two columns in this case as they would be enough for determining what the value of the original column would be. For example, in the first entry, we can say that as Embarked_C and Embarked_Q have 0 values, the value of Embarked_S ought to be 1 as there has to be a value present. (Note: We need to get rid of NaN values before this step)

So the same idea came to the people who designed the models and they coded them in such a way that it would require (n-1) columns for a column with n unique values. So we need to make sure that we remove one column from this so that we do not face this problem. This problem is named as Dummy Variable Trap as the new columns that we are making are called dummy variable and

To do all this explained above, there are two common methods adopted.

1. Label Encoder and One Hot Encoder

In this, the LabelEncoder function of the sklearn.preprocessing library is used to give numbers starting from 0 and OneHotEncoder of the same library is used to spread out the values of a single column into each unique value columns.

With LabelEncoder you will get something like this:

On doing the info function run, you will find that there are 2 values that are NaN in the Embarked column. So before doing the LabelEncoder we need to either remove the whole entry(As there are only two of them) or replace it with the most frequent value 'S'.

Now after Label Encoding, we have to convert integer values to 1's and 0's in different columns. So to do that we need to apply OneHotEncoder function to it.

from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(categorical_features=[11])
dataset1=ohe.fit_transform(dataset1).toarray()

After running this code we would get the individual columns for each unique entry.

2. pd.get_dummies method

This is an easier method than the above one as it directly gets us to the final step of removing the extra column.

Thus the final step comes of removing the extra column which takes only one line but care should be taken as it varies on which method you take to get there.

If you applied LabelEncoder and OneHotEncoder, you will get dummy variables on the left side of the dataframe and if you take get_dummies method, you will get them on the right side of the dataframe.

So for former method you need to run this code:

dataset1=dataset1[ : , 1 : ]

And for later:

dataset1=dataset1[ : , : -1]

In the above code, the first colon is for rows and the second for columns. So in our first example, the 0th column is removed and in our second column, our last column. Thus at this point, the categorized Embarked column is ready to be given to the model.

Hope you like this post and do tell if you find it useful. Everybody stay Awesome!

Total Hits: