Missing data imputation Feature engineering

Machine learning algorithms always prefer to learn from a complete dataset, but in real life, there may not any fully complete dataset you may get. Every dataset has some missing values, which means there are some cells in the dataset where there are no values or the value is NA. This is not desired for a perfect model. This missing data point affects the performance of the machine learning model.

So, how to get rid of this problem? Here come the Missing Value Imputation techniques which will help to get a complete dataset by filling those missing data points with statistical estimates. There are several techniques for imputing missing values & get a complete dataset for your model.

Most of the time a dataset two types of variables Numerical & Categorical. So missing value imputation techniques are a little bit different for these two cases.

For Numerical variable we can use:

    • Mean/median imputation
    •  Arbitrary value imputation
    •  End of tail imputation

For Categorical Variable we can use:

    • Frequent category imputation or mode imputation
    • Adding a ‘missing’ category

There are some techniques that can be used for both numerical & categorical variables. Those techniques are:

    • Complete Case Analysis
    • Adding a ‘missing’ indicator
    • Random sample imputation

 

These techniques are very useful to get a complete dataset for your machine learning model. But there have some limitations with these imputation techniques. Now we’ll discuss each technique separately so that it may easier to understand them. We will start sequentially from numerical variables,

 

Mean/Median imputation:

In this technique, you can fill the NA values with a mean of given data or median of the given data. This is one of the commonly used techniques to get a complete dataset. But you should not use this if the missing values are more than 5% in the data else the model won’t perform better for higher distortion.

Remember mean & median imputation affect the distribution of the data. You can use mean or median when the data is normal distribution there will be no effect of this but if the distribution is skewed then it is better to use Median imputation, it will distort less for skewed distribution.

Arbitrary Value Imputation:

This is another easiest way to impute the missing data points. For this, you need to select an arbitrary value that is not close or too similar to the value of mean or median. Generally, 0, 99,999, 9999, -1 etc. are used for this task. Remember this technique is very easy but it may create distortion in the distribution and also it may create outliers as the arbitrary value is more or less than the average values. This also makes a distortion of variance & covariance, so it is not desired to use this technique always.

End of Tail Distribution:

This technique is very useful since it used statistical estimates. In this, you need to calculate a value which is statically proven then impute all the missing point with this value. Depending on data distribution you need to calculate this value with the different equation;

For normal distribution the equation is, value = Mean ± 3 x Standard Deviation

For a skewed distribution, the equation is, Value=Mean±1.5 IQR

Where IQR is the interquartile Range

 

Frequent Category Imputation:

This technique is commonly used for categorical variables but it can also be used for numerical variables. In this technique, you need to find which observations occurs more in the dataset or what is most occurring observations or mode then you need to fill all the missing value with that value. These imputation techniques make distortion to the distribution & you need not use this technique if the missing value is more than 5%. Sometimes it is also called mode imputation.

 

Missing Category Imputation:

If the number of missing observations is huge or almost 40-50% this technique is very useful. In this technique, you need to create a new category “Missing” with the missing values. This category works along with the other category in the dataset. The advantage of this technique is it distorts less to the distribution of the data.

 

Random Sample Imputation:

This is one of the easiest techniques to impute missing data. In this method, you need to take an observation from the given & then impute with this. The problem of this technique is that, if you iterate many times to generate random observation every time you will get a new observation & this result the model will predict differently each time. For getting rid of this problem, you can use ‘seeding’ which may help to get rid of this problem.

 

Complete Case Analysis:

I think this is the easiest way to get a complete dataset. This is also called the list-wise deletion. In this method, you need to delete the rows which have missing observations. This will tend to give a smaller dataset from the original and if the missing observation is more than 5% this method is not useful to use and in many cased it may lose lots of useful information. Less than 5% observation is ok for this technique.

 

These techniques are most useful for getting a complete dataset but remember most of the techniques distort the distribution which is not desired for linear models. For this variable transformation is used to keep the convenience of the distribution. The “End of tail distribution” & “Missing category imputation” are very useful to keep the distribution constant not exact but they work better than others.

Hope you have a basic understanding of each imputation techniques but if you have any question regarding this, let us know through comment or contact form so that we can help you.

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!