Categorical Variable Encoding For Machine Learning
Working with real-world data you might often see categorical data in your dataset. For example, you may find a feature name sex, in that it contains only male & female. Here male and female are two categories in the dataset.
Machine learning algorithms are not well capable with these type of string data, it assumes to have a numerical data type for better performance. This is why all the categorical variables should be transforming into the numerical format. This is known as categorical encoding & it’s a very important part of feature engineering.
Categorical encoding can be done in different ways. E.g
Traditional Techniques: These techniques are usually used for Categorical Variables Encoding.
- One Hot Encoding
- Count or Frequency Encoding
- Ordinal or Label Encoding
Monotonic Relationship: These encoding techniques create a monotonic relationship with the target & the categories; thus, its name is monotonic relationship techniques. This technique can be done with the following methods:
- Ordered Label Encoding
- Mean Encoding
- Weight of Evidence
Alternative Techniques: This technique including
- Binary Encoding
- Feature Hashing
In this post, we’ll discuss the first two types of categorical encoding techniques that are mean traditional & monotonic relationship techniques in detail.
One Hot Encoding:
This technique consists of encoding each categorical variable with a set of Boolean variables that takes “0 or 1” indicating if a category is present or not. Most of the time it is considered to take (k-1) the number of observations for the model where K is the number of total observations of categorical encoded variables. But in some special cases like tree-based algorithms, doing feature selection by recursive algorithm it can be taken the K number of observations for the model.
Count or Frequency Encoding:
This is another great technique to encode the categorical variables. In this, you need to calculate the number of each category present in the feature & replace those with the calculated number & sometimes with the percentage they are in the feature. For example, if a feature contains 5 categories and the total observation is 100, Let category A present 10 times, then all the A will be replaced by 10 or the percentage of A present which is 0.1.
In this Categorical Variable Encoding technique, the categories are replaced by the digits 1 to n. Let you have five categories A, B, C, D, E in the set, for label encoding you first mark 1 to 5 for each category like A-1, B-2, C-3, D-4, E-5 then replace all the A with 1, All the B with 2, and so on.
Monotonic Relationship Technique
Ordered Integer Encoding:
In this Categorical Variable Encoding technique, the categories are ordered according to the target means assigning a number to the category from 1 to k. Where K is the number of distinct categories. For example, let a number of A in the feature has 30 times, B 20 times, C 15 times so in this case, you need to replace A with 1, B with 2, C with 3.
In this technique, the categories are replaced by the average target value of the category. For say, if category A occurs at 30% of the total then it will be replaced by 0.3 for all A observations.
Probability Ratio Encoding:
In this technique, for each category, the mean of the target is calculated to equal 1 that is the probability of being 1 is p(1), and then the probability of not being a target is 0. So, the probability is p(0). After that, the ratio of happening and not happening is calculated p(1)/p(0). With this ratio, all the categorical value is replaced.
Weight of Evidence:
This technique is mostly used in financial or loan data. Weight of Evidence is nothing but the logarithmic function of probability ratio which means the natural logarithm of the probability of target is being 1 divided by the probability of target is 0. The equation is;
Then the all the categorical values should be replaced with the value found from the equation.
These are the basic encoding techniques used to encode the categorical variables for any dataset in feature engineering.