How to Split Data into Training and Testing Sets? With Examples

In the field of machine learning, it is common practice to divide a dataset into two different sets. These sets are training set and testing set. It is preferable to keep the training and testing data separate.

Why should we split our dataset?

If we don’t split the dataset into training and testing sets, then we end up testing and training our model on the same data. When we test on the same data we trained our model on, we tend to get good accuracy.

However, this doesn’t mean that the model will perform as good on unseen data. This is termed as overfitting in the world of machine learning.

Overfitting is the case when your model represents the training dataset a little too accurately. This means that your model fits too closely. 

Overfitting is an undesirable phenomenon when training a model. So is underfitting.

Underfitting is when the model is not even able to represent the data points in the training dataset.

How to split a dataset using sklearn?

Let’s see how can we use sklearn to split a dataset into training and testing sets. We will go over the process step by step.

1. Import the dataset

Let’s start by importing a dataset into our Python notebook. In this tutorial, we are going to use the titanic dataset as the sample dataset. You can import the titanic dataset from the seaborn library in Python.

Titanic Dataset

2. Form input and output vectors from the dataset

Before we move on to splitting the dataset into training and testing sets, we need to prepare input and output vectors out of the dataset.

Let’s treat the ‘survived‘ column as output. This means that this model is going to be trained to predict whether a person survived will survive or not.

Output :

output

Output

We also need to remove ‘survived‘ column from the dataset to get the input vector.

Output :

output

Input

3. Deciding the split ratio

The split ratio represents what portion of the data will go to the training set and what portion of it will go to the testing set. The training set is almost always larger than the testing set.

Most common split ratio used by data scientists is 80:20.

A split ratio of 80:20 means that 80% of the data will go to the training set and 20% of the dataset will go to the testing set.

4. Performing the split

To split the data we will are going to use train_test_split from sklearn library.

train_test_split randomly distributes your data into training and testing set according to the ratio provided.

We are going to use 80:20 as the split ratio.

We first need to import train_test_split from sklearn.

To perform the split use :

We have mentioned test size as 0.2, this means that the training size would be 0.8 giving us our desired ratio.

5. Verify by printing the shapes of training and testing vectors

To verify the split, let’s print out the shapes of different vectors.

Output :

Complete code

The complete code for this tutorial is given below :

Conclusion

This tutorial was about splitting data into training and testing sets using sklearn in python. We also discussed concepts like overfitting, underfitting to understand the need for splitting the data.

By admin

Leave a Reply

%d bloggers like this: