In this tutorial, we’ll move on to understanding factors in R programming. One operation we perform frequently in data science is the estimation of a variable based upon the model we built. We are sometimes required to estimate the price of a share or a house, and sometimes we need to estimate what color car is likely to be sold the fastest.
Variables in data science fall under two categories – continuous and categorical. Continuous variables are those that can take numerical values including floating points. Prices of houses or shares, quantifiable variables like age, weight or height of a person are all continuous variables.
On the other hand, categorical variables take a set of fixed values that can be represented using a set of labels. Examples for this category as marital status, gender, the color of the vehicle, the highest educational degree of a person and so on.
Categorical variables are represented using the factors in R.
Creating Factors in R
Factors can be created using a factor()
function.
1 |
factor(x=vector, levels, labels, is.ordered=TRUE/FALSE) |
The first argument to factor function is the vector x of values that you wish to factorize. Note that you cannot create a factor using a matrix. X should always be a single-dimensional vector of character strings or integer values.
Secondly, you need to supply the list of levels you need in the factor. Levels is a vector of unique values used in the factor. This is an optional argument.
The third argument is labels. Sometimes when you encode the variables as a vector of integers, you need to specify what integer represents what label. You could use 0 and 1 to represent male and female, but you need to specify that using these labels. So basically this is the key for looking up the factors.
Finally, you have a Boolean valued argument is.ordered. Sometimes you may wish to retain the order amongst the factors used. For example, you may encode the month of joining using integers 1 to 12, to represent months from January to Decemeber. In these cases, you need to specify ordered to TRUE.
Let us look at examples of factors now.
1 2 3 4 5 6 7 8 |
#Encode the genders of people into a vector first #These might be extracted from a dataset usually. > genvector <- c("Male","Female","Female","Male","Male","Female") #Create a factor from this vector > genfact <- factor(genvector) > genfact [1] Male Female Female Male Male Female Levels: Female Male |
Notice how the levels are automatically obtained from the vector’s unique values here. Let us try another example where we define male and female as 0 and 1 using labels.
1 2 3 4 5 6 7 |
#Define a vector with 0 for Male and 1 for Female. > genvector2 <- c(0,1,1,0,0,1) #Assign labels Male and Female to 0 and 1 when creating a Factor. > genfact2 <-factor(genvector2,levels=c("0","1"),labels=c("Male","Female")) > genfact2 [1] Male Female Female Male Male Female Levels: Male Female |
Observe that the labels you have defined are displayed instead of 0 and 1 defined in the factor.
Ordering in Factors in R Programming
Let us work another example using the ordering of factor levels. Let us first define a vector representing the month of joining for 8 employees.
1 |
> moj <- c("Jan","Jun","May","Jan","Apr","Dec","Nov","Sep") |
Now, there is no way for the compiler to know that May comes before Jun in the order of months. So the following code throws FALSE.
1 2 |
> moj[2]>moj[3] [1] FALSE |
To impose ordering, we need to define a vector with all the months in order first.
1 |
> ordermonths <-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec") |
Now create a factor for our data using our moj vector, set the levels to ordermonths and set the argument ordered to TRUE.
1 |
> factormoj <- factor(x=moj, levels=ordermonths, ordered=TRUE) |
Now factormoj displays as follows.
1 2 3 |
> factormoj [1] Jan Jun May Jan Apr Dec Nov Sep 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec |
The compiler now knows the ordering among the months. Let us check if it knows that May comes before June.
1 2 |
> factormoj[2]>factormoj[3] [1] TRUE |
Modifying Factors
Each element of factor can be assigned a value individually using indexing, just like we index vectors. Let us modify a value from the genfactor we created earlier in the tutorial.
We’ll continue with the same variable from before, genfact to make things easier for you.
1 2 3 4 5 6 7 8 9 10 |
> genfact [1] Male Female Female Male Male Female Levels: Female Male > genfact[1] [1] Male Levels: Female Male > genfact[1]<-"Female" > genfact [1] Female Female Female Male Male Female Levels: Female Male |
Adding New Levels to Factors
To add a new level to a factor, which hasn’t been defined earlier, you just need to modify the levels vector in the following manner. Let’s try this on our existing genfact variable.
1 2 3 4 |
> levels(genfact) <- c(levels(genfact),"Other") > genfact [1] Female Female Female Male Male Female Levels: Female Male Other |
You can now modify the factors to the newly defined level “Other” as well.
1 2 3 4 |
> genfact[3] <- "Other" > genfact [1] Female Female Other Male Male Female Levels: Female Male Other |