Splitting the dataset

10/16/2018

The train_test_split is a function in the sklearn.cross_validation library but this will be removed from version 0.20 and will be available in sklearn.model_selection library.This function is basically used to split the dataset in two parts one for training another for testing.

You might be thinking why this is done. Why don't we just give the whole dataset to the model to train on as there would be more data to train on? The splitting of data is necessary because before applying this model to make prediction on unknown data, we need to make sure that the predicted value is considerably accurate. As we have the target values given for the data in the dataset, we could use a small percentage on the whole dataset to test the model.

You might also think that this could be done without splitting the dataset and testing the model on the same data that we trained it on. Well that would not guarantee the accuracy of the model as the model has already seen the data and knows its answer. It would be like giving an exam on previously seen questions. Also it might not be able to give an answer on the data that it has not seen as it could be overfitted on that data. (More on overfit later)

Due to all these reasons, it is suggested to train the model on a part of data and test it on the other part on the data.

You might have noticed in the start the cross_validation word. What it means is the whole training dataset (After the split) is further divided into parts. Each part is given to validate in different iterations making it train and test differently on every iteration.

This process is not necessary but it is recommended as the we make sure the model is not learning the data and the results by heart and will be able to predict the results on any unseen data. Also the splitting made multiple times give more reliability on the accuracy obtained.

Now that we have understood the working of the splitting, implementing it would be easy. Here is the code to split the dataset.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split( X , y , train_size=0.8)

Now our dataset is 80% training data and 20% test data. You can set the train test ratio anyway but make sure that training data is more than the testing data. It is usually 80:20 as there is enough data to train and test on.

Hope you like this post and do tell if you find it useful. Everybody stay Awesome!

Total Hits: