Importing Dataset

10/13/2018

To get any predictions or any patterns, we need some data to feed it to the models. Now luckily in today's world there is a lot of data available online which could be manipulated to get some interesting results. But the question is how to get the data for your code.

The data that is available is usually stored in csv or json files. CSV means Comma Separated Values which we can view in Excel. But to manipulate it and to get results from it, we need that data in a Dataframe variable. To do that we have a function in the Pandas library which will allow us to read any csv file.

You might be thinking that where to get these csv datasets online. Well there is a website dedicated to solely this purpose. Kaggle is the place where you will find hundreds and thousands of datasets ready to be downloaded. For starter we will be downloading the Titanic Survival Prediction dataset. This is the "Hello World" Program for Machine Learning. Let us download the dataset from https://www.kaggle.com/c/titanic/data.

On downloading, you will find three files. The train.csv file will be given to our model to learn about the data and derive on rules to prediction. The test.csv will be the one on which we will test the derived rules validity and accuracy.

To get train.csv in a Dataframe, we will need pandas library imported in our program because pandas has a function read_csv which will return a dataframe variable.

import pandas as pd
dataframe=pd.read_csv('train.csv')

On running this code, you will get a variable named dataframe which will have he train dataset. Yes it was as simple as that single line of code!

Note: If you are using Spyder, set the File explorer path to wherever the dataset in downloaded and if you are using Jupyter, you will need to store the dataset in the location where the default documents are stored.

Now a Dataframe variable itself has a lot of functions which reveal some intrinsic properties about the dataset. Usually to check whether the dataset is imported properly or not, head() function is used which willdisplay the first 5 rows of data.

Function info() will list out the names of the columns and and their data types.

Function corr() will give you a table of values that tells us how correlated these features are. This table will be only for the numerical features.The more closer to 1 the value is, the more correlated they are and the more near to -1 the values are the more inversely correlated the fields are. If the value is near 0, the features are independent of each other.

As we can see 'Pclass' is closely related to 'Survived' which makes sense because the first class passengers were given chance over second and third class passengers to board the life boats.

Hence, the dataset can be imported in our code this easily with the help on Pandas library but we cannot fit this data to our model. What models do is they do calculations on the dataset so we need all values in numerical type. So we will need to fill up empty values and and convert string values to float or int values.

Hope you like this post and do tell if you find it useful. Everybody stay Awesome!

Total Hits: