Previously, we discussed how chatbots work. In this part, we’ll begin with the implementation of a retrieval-based intent classification chatbot. We begin with understanding what intent is and how the classification works. In the upcoming tutorials, we’ll use the intent to respond to queries better.
So, first let’s start with what intent is.
What is Intent in Programming?
The intent is a programming framework in an operating system that helps users to organize the functions of various tasks to accomplish a mission.
- Intent is a communicating object that provides a facility in the Software development environment to execute late runtime linking between the code in various applications.
- In launching operations, the most relevant application is where it can be seen as the glue between activities:
- Intents include a system of inter-application communications that facilitates coordination and reuse of components.
An Intent is fundamentally a passive structure of knowledge providing an implicit explanation of an action to be done.
“The purpose is to turn on the lamp, and to do so, you execute the operation of flipping the switch to the position on.”
Steps to Create a Simple Chatbot
The dataset and code can be found on my Github: https://github.com/arkaprabha-majumdar/simple-chatbot/
1. Preparing the Dataset
First, let’s unzip the dataset folder and get into it. Also, we import the necessary libraries like Python Pandas here.
!unzip "/content/MachineLearningContest.zip" %cd /content/drive/MyDrive/MachineLearningContest import pandas as pd
2. Read Input data
The input data is in json format, so let’s read it and display the top 5 entries using the head() method:
intents_data = pd.read_json("intents.json") intents_data.head()![]()
2. Creating Test Dataset
Then we will also read the testing data queries using pandas:
test_data = pd.read_excel("TestingData.xlsx") test_data.head(10)![]()
As you can see in the second column, we have a Python dictionary of multiple elements of varying length:
5. Splitting Dataset for Better Intent Classification
So we’ll need to divide the row cells into multiple rows. Let’s create four lists:
id_rows = [] keys = [] values = [] intent = []
And then run two for loops to put all the data in the corresponding lists. Study the loop for the scope of each operation:
for row in range(intents_data.shape[0]): for key in intents_data['variations'][row].keys(): id_rows.append(intents_data["id"][row]) keys.append(key) values.append(intents_data['variations'][row][key]) intent.append(intents_data['intent'][row])
Now we are ready to combine it back together into a dataframe:
df = pd.DataFrame({"id":id_rows,"query_key":keys,"query_val":values,"intent":intent})
If we view the dataset right now, we can see that the queries are separated into multiple columns having a common “id” value:
df![]()
6. Label Encoding
Now we’re ready to work on this dataset. So the first thing we’ll do is label encoding.
What is Label Encoding?
In machine learning, we typically deal with datasets in the form of terms that contain several labels (categorical data).
Label Encoding refers to the translation of symbols into numerical form in order to transform them into a form that can be read by the computer.
Machine learning algorithms will then settle about how to run certain marks in a better way.
In supervised learning, it is an important pre-processing step for the structured dataset.
7. Encoding Intent
For this we use sklearn:
from sklearn import preprocessing label_encoder = preprocessing.LabelEncoder()
Then we put the label transformed column into the dataframe:
df["intent_num"] = label_encoder.fit_transform(df['intent'])![]()
8. TF-IDF Vectorization
Once the intents have been encoded, now we need to convert the query texts into word vectors.
Basically, based on a vocabulary of N words, we encode each sentence using 1 where the word occurs, and 0 if it doesn’t.
In NLP, transforming text into a meaningful vector (or array) of numbers is important.
from sklearn.feature_extraction.text import TfidfVectorizer Tfd = TfidfVectorizer(stop_words="english",max_df=0.7) Tfd_train=Tfd.fit_transform(df['query_val']) print(Tfd.get_feature_names())
The full vocabulary, if you’re curious, is:
['aadhaar', 'aadhar', 'able', 'aboout', 'ac', 'accident', 'account', 'activate', 'activated', 'active', 'activity', 'add', 'address', 'advantages', 'allincall', 'allowed', 'allows', 'alowed', 'amazing', 'amb', 'annual', 'answer', 'app', 'application', 'applied', 'apply', 'applying', 'appointment', 'approved', 'approver', 'asking', 'atm', 'auto', 'autopay', 'autosweep', 'avail', 'available', 'average', 'away', 'awesome', 'bad', 'balance', 'bank', 'banking', 'benefits', 'billers', 'billpay', 'bills', 'birth', 'block', 'blocked', 'bond', 'bonds', 'book', 'booking', 'bot', 'bound', 'branch', 'browser', 'bye', 'byee', 'byeee', 'byeeee', 'came', 'cancel', 'card', 'carry', 'case', 'cash', 'cd', 'change', 'charge', 'charged', 'charges', 'cheat', 'check', 'checkboook', 'checker', 'checking', 'cheque', 'chequebook', 'clarify', 'close', 'closed', 'cnr', 'code', 'collateral', 'collect', 'coming', 'communication', 'complete', 'completed', 'completing', 'compulsory', 'confused', 'connection', 'consent', 'consumer', 'contact', 'cost', 'create', 'credit', 'credited', 'crn', 'current', 'cvv', 'cya', 'date', 'days', 'deactivate', 'debit', 'debited', 'debiting', 'debt', 'deception', 'declaration', 'deduct', 'deducted', 'define', 'demand', 'demat', 'deposit', 'deposited', 'detailed', 'details', 'did', 'didn', 'didnt', 'difference', 'different', 'difficulty', 'digital', 'disable', 'district', 'documents', 'does', 'doing', 'don', 'dont', 'download', 'draft', 'dreamdifferent', 'dropped', 'dth', 'duplicity', 'duration', 'earlier', 'edge', 'elaborate', 'electricity', 'email', 'employee', 'entities', 'error', 'estatement', 'excellent', 'experience', 'explain', 'expoan', 'facility', 'fantastic', 'fatca', 'fd', 'features', 'fee', 'fees', 'fetaures', 'finance', 'fixed', 'fkyc', 'folio', 'foreclose', 'forgot', 'forgotten', 'frame', 'fraud', 'fraudulent', 'free', 'freeze', 'frozen', 'fund', 'funding', 'funds', 'gets', 'getting', 'given', 'going', 'gold', 'good', 'goodbye', 'got', 'group', 'guide', 'happened', 'haven', 'helful', 'hello', 'helloo', 'hellooo', 'help', 'helpful', 'hey', 'hi', 'hii', 'hiii', 'hiiii', 'history', 'hoax', 'home', 'horrible', 'id', 'ifsc', 'im', 'image', 'important', 'imps', 'imt', 'income', 'increase', 'india', 'information', 'initiate', 'installment', 'insurance', 'international', 'invest', 'investing', 'investment', 'investments', 'issues', 'joint', 'joke', 'journey', 'just', 'kidding', 'kind', 'kindly', 'know', 'kyc', 'larceny', 'legit', 'let', 'life', 'like', 'limit', 'limitations', 'limited', 'limits', 'link', 'list', 'lite', 'loan', 'loans', 'location', 'locked', 'login', 'long', 'lost', 'low', 'maintain', 'maintenance', 'make', 'maker', 'mandatory', 'marry', 'maturity', 'mb', 'mean', 'meaning', 'meant', 'medium', 'method', 'mf', 'middle', 'minimum', 'miserable', 'misplaced', 'mobile', 'money', 'monthly', 'mpin', 'mutual', 'nearest', 'necessary', 'necessay', 'necessity', 'need', 'needed', 'needs', 'neft', 'net', 'netflix', 'new', 'nice', 'nominee', 'normal', 'notice', 'number', 'numbers', 'offers', 'online', 'open', 'opened', 'opening', 'opt', 'optin', 'option', 'optout', 'outside', 'outstanding', 'oversees', 'package', 'paid', 'pan', 'passbook', 'password', 'pathetic', 'pay', 'payment', 'payments', 'pdc', 'pep', 'perfect', 'perform', 'performing', 'period', 'phone', 'physical', 'pin', 'pl', 'place', 'plan', 'poen', 'points', 'policy', 'possible', 'post', 'pre', 'premium', 'prepaid', 'prime', 'priority', 'problems', 'procedure', 'process', 'processing', 'proess', 'profile', 'proof', 'pros', 'protect', 'provide', 'provision', 'queries', 'query', 'rate', 'rates', 'ratio', 'rd', 'reach', 'receive', 'received', 'recent', 'recharge', 'recieve', 'recover', 'recurring', 'redeem', 'redeeming', 'reflect', 'related', 'replace', 'replacement', 'report', 'representative', 'request', 'require', 'required', 'requirement', 'reset', 'restart', 'restricted', 'robbed', 'saving', 'savings', 'saying', 'says', 'scam', 'score', 'secure', 'set', 'share', 'shop', 'sip', 'smart', 'solution', 'soon', 'sovereign', 'specify', 'start', 'statement', 'statements', 'status', 'steal', 'steps', 'stole', 'stolen', 'stop', 'suggest', 'summary', 'sweep', 'sweepin', 'switched', 'systematic', 'tada', 'taken', 'takes', 'tell', 'tellme', 'term', 'thank', 'theft', 'things', 'throught', 'time', 'today', 'track', 'transaction', 'transactions', 'transfer', 'type', 'unable', 'unblock', 'understand', 'unhelpful', 'update', 'upgrade', 'upi', 'urgent', 'use', 'user', 'using', 'vary', 'vdc', 'video', 'vidoeo', 'view', 'virtual', 'visa', 'visit', 'visited', 'vpa', 'want', 'waste', 'ways', 'wish', 'withdraw', 'withdrawal', 'wonderful', 'work', 'working', 'ya', 'yo']
We’ll continue this implementation in the next part: Retrieval-based Intent Classification in Chatbots 3/4
Ending Note
If you liked reading this article and want to read more, follow me as an author. Until then, keep coding!