Handling Missing Numerical Values

10/14/2018

There are times that the dataset does contains empty fields of value and it occurs much oftenly than we think. This mostly happens because at the time of collecting the data, some fields that are not mandatory are left out. Take for example, the values of cabin in the titanic dataset. You will find a lot of empty values in that column specifically those of class 3 passengers because at that time the poor were not given that much importance or for some other such reason.

We cannot give this half filled dataset to the model to train on as most of the models accept only numerical inputs and an empty field is NaN(Not a Number). So to get rid of these NaN values, we need to either delete the rows in which they are or fill them up with some value. This decision is made by the number of empty cells in the data. This could be easily checked with a line of code:

As we can see there are 177, 687 and 2 in columns Age, Cabin and Embarked respectively. If the empty cells would have been less then, we would have chosen to just drop them as it would not have made a difference but as there is a big percentage of missing values, we will be filling them with appropriate values.


1. Deleting whole rows

This is the easy way to deal with this problem as we just need to drop the rows that have the NaN values. And yes there is a function with dataframe for it which drops the rows containing them. This function has a lot of parameters that can be played to salvage as much data as possible.

dataset.dropna()

What this will do is it will delete all rows that has any number of NaN values as by default, the value of parameter axis is 0 and parameter 'how' is any. 

By using the subset parameter, we could select columns to consider for checking of the NaN values which could be helpful when there are a lot of columns with empty fields. But as said it is the easy way so there are going to be some drawbacks attached to it.

The main hold out on this method is that precious data is lost as the whole entry is dropped. If there is a significant percentage of data missing then this method should not be opted as there will not be enough dataleft to train the model on.

Thus here comes the second method where we could try to fill the empty fields as good as possible.


2. Imputing the data


Imputer is a method in the sklearn.preprocessing library which there are three strategies to fill the empty values. We could either fill them by the mean, median or by modes of all the available values. Now as you would have noticed, we will need numerical values for this method as there is going to be calculations performed on it so string datatype columns cannot be imputed by this method.

Now by default, the strategy is set to mean which says that the empty values will be filled up with the average of all the other values. We can give only the selected columns to be imputed to the imputer function which would be helpful when we have non numerical data with empty fields.

from sklearn.preprocessing import Imputer

im=Imputer(missing_values="NaN")

im=im.fit(dataset[ : , 5 : 6])

dataset[ : , 5 : 6]=im.transform(dataset[ : , 5 : 6])

This code would initialize the im variable with the imputer function which is further is used to fit the 5th column (starting from 0) that is the Age column and store the calculated mean in it. This variable is further used to transform the same column and store the values in that very column.

Thus if the missing column values are numerical then it is not that hard to fill them up. Of course, the new values are just an approximation to the real values but it is the best one can do without running any side models to predict them.

Still there are a lot of barriers to be crossed till we train the model on the dataset. Next up is filling up missing categorical values.

Hope you like this post and do tell if you find it useful. Everybody stay Awesome! 

Total Hits: hit counter
Jay Nankani - Blog
All rights reserved 2018
Powered by Webnode
Create your website for free! This website was made with Webnode. Create your own for free today! Get started