One-Hot encoding is a technique of representing categorical data in the form of binary vectors. It is a common step in the processing of sequential data before performing classification.
One-Hot encoding also provides a way to implement word embedding. Word Embedding refers to the process of turning words into numbers for a machine to be able to understand it.
It is common to make word embeddings out of a corpus before inputting it to an LSTM model. Making word embeddings out of a corpus makes it easier for a computer to find relationships and patterns between words.
In this tutorial, we are going to understand what exactly is One-Hot Encoding and then use Sklearn to implement it.
Let’s start by taking an example.
Working of One-Hot Encoding in Python
Consider the following sequence of words.
1 |
['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ] |
This is a sequential data with three categories.
The categories in the data above are as follows :
- Python
- Java
- C++
Let us try to understand the working behind One-Hot Encoding.
One-Hot Encoring has a two step process.
- Conversion of Categories to Integers
- Conversion of Integers to Binary vectors
1. Conversion of Categories to Integers
Let us convert the three categories in our example to integers.
Now we can use these integers to represent our original data as follows :
1 |
[2 1 2 2 0 0 1 2 0 1] |
You can read this data with the conversion table above.
Let’s move to the second step now.
2. Conversion of Integers to Binary vectors
This is not your usual Integer to Binary conversion. Rather in this conversion we only set the value index corresponding to the integer as one and all the other entries are set to zero in the vector.
Let’s see what we mean by this :
C++ | 0 | [1, 0, 0] |
Java | 1 | [0, 1, 0] |
Python | 2 | [0, 0, 1] |
We can represent the data in our example as :
1 2 3 4 5 6 7 8 9 10 |
[[0. 0. 1.] [0. 1. 0.] [0. 0. 1.] [0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [0. 1. 0.] [0. 0. 1.] [1. 0. 0.] [0. 1. 0.]] |
Our original sequence data is now in the form of a 2-D Matrix. This makes it easier for a machine to understand it.
Python Code for Implementing One-Hot Encoding using Sklearn
Let’s move to the implementation part of One-Hot Encoding. We are going to use Sklearn for implementing the same.
We are going to follow the same two-step approach while implementing as well.
The steps are as follows:
- Use LabelEncoder to convert categories into integers.
- Use OneHotEncoder to convert the integers into One-Hot vectors (binary vectors).
Before we move further, let’s write the code for declaring the array with data in our example.
1 2 3 |
import numpy as np data = ['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ] vals = np.array(data) |
1. Using LabelEncoder to convert Categories into Integers
We will first use LabelEncoder on the data. Let’s import it from Sklearn and then use it on the data.
The code for the same is as follows :
1 2 3 4 |
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() integer_encoded = label_encoder.fit_transform(vals) print(integer_encoded) |
Output :
The output comes out as:
1 2 |
<span style="color: #008000;"><strong>[2 1 2 2 0 0 1 2 0 1] </strong></span> |
2. Using OneHotEncoder to convert Integer Encoding into One-Hot Encoding
Now let’s convert the integer encoding to One-Hot encoding.
OneHotEncoder only works on data that is in column format. To use the integer encoding from LabelEncoder we will have to reshape the output before providing it as an input to OneHotEncoder.
That can be done with the following lines of code :
1 2 |
integer_encoded_reshape = integer_encoded.reshape(len(integer_encoded), 1) print(integer_encoded_reshape) |
Output :
1 2 3 4 5 6 7 8 9 10 11 |
<span style="color: #008000;"><strong>[[2] [1] [2] [2] [0] [0] [1] [2] [0] [1]] </strong></span> |
Now we can use this data to make One-Hot vectors.
1 2 3 4 |
from sklearn.preprocessing import OneHotEncoder onehot_encoder = OneHotEncoder(sparse=False) onehot_encoded = onehot_encoder.fit_transform(integer_encoded_reshape) print(onehot_encoded) |
Output :
1 2 3 4 5 6 7 8 9 10 11 |
<span style="color: #008000;"><strong>[[0. 0. 1.] [0. 1. 0.] [0. 0. 1.] [0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [0. 1. 0.] [0. 0. 1.] [1. 0. 0.] [0. 1. 0.]] </strong></span> |
Complete Code
Here’s the complete code for this tutorial :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # data data = ['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ] vals = np.array(data) # Integer Encoding label_encoder = LabelEncoder() integer_encoded = label_encoder.fit_transform(vals) print(integer_encoded) #reshaping for OneHotEncoder integer_encoded_reshape = integer_encoded.reshape(len(integer_encoded), 1) # One-Hot Encoding onehot_encoder = OneHotEncoder(sparse=False) onehot_encoded = onehot_encoder.fit_transform(integer_encoded_reshape) print(onehot_encoded) |
Conclusion
This tutorial was about One-Hot Encoding in Python. We understood how it works and used Skelarn to implement Label Encoding and One Hot Encoding.