Feature Scaling

10/17/2018

This is usually the last part on data preprocessing as after this step, the data is not really legible and no human can spot any pattern in it. The basic goal of this step is to get all of data between -1 and 1.

The original data is not necessarily between -1 and 1. It is usually spread on wide ranges as not all the features are categorical. This is basically for the features whose values are huge compared to other feature values. 

The purpose of doing it: Most of the machine learning models are based on Euclidian distance. Euclidian distance is the distance between two points in a two dimension plane with X and Y coordinates.

Let's take the example of the titanic dataset. In that data you will find Age and Pclass column. What we will do is try to find the euclidian distance between two values on both columns.

square((Age1)-(Age2))=square(35-22)=169

square((Pclass1)-(Pclass2))=square(3-1)=4

As you can see, there is a lot of difference between both the values and the greater value will dominate the smaller one. So the model will consider Age more that the Plass column which is not true as Pclass-Survived has more correlation between them than Age-Survived. And there is also a logical reason to it as the first class were given chance over third and second class to get on the lifeboats. Similarly one can say the same about Gender as women were prioritized to get on the lifeboats.

Now that we have learned why we need to do this, Implementing the code is easy.

from sklearn.preprocessing import StandardScaler

sc_X=StandardScaler()

dataset1=sc_X.fit_transform(dataset1)

After Running this code, the data will look somewhat like this.

As you can see, nothing can be interpreted from the data by a human but if the same were to be given to a machine learning model, it might be able to generate some worthwhile results. Usually after this, the data is ready to be given to the model to train on.

Hope you like this post and do tell if you find it useful. Everybody stay Awesome! 

Total Hits: hit counter
Jay Nankani - Blog
All rights reserved 2018
Powered by Webnode
Create your website for free! This website was made with Webnode. Create your own for free today! Get started