It is common when collecting data for some entries to be absent. This can have a significant impact on any attempt to gain useful information from these data, hence methods have been developed in order to make it possible to gain useful insights into data of this kind. A simple method for this is to simply discard any record which contains a missing entry, however this can lead to such a small sample that it is not useful for obtaining reliable information. In addition to this, there may be reasons why certain groups of people do not want to supply certain information, hence this approach can result in those certain groups of people being ignored.
In order to use data with absent entries, these absent entries are often filled in, this is called imputing them. There are a variety of different methods which can be used for this. Some of which are explained below.
Unconditional mean imputation: The simplest method is to take the average value of a variable which has missing entries, and use this as the value for all those which are missing. Whilst convenient, this can lead to distortions in the data.
Conditional mean imputation: Unconditional mean imputation can be improved upon by identifying a variable which seems to have a connection with the one with missing values and group the records according to this variable. The average value within each group for the variable with missing values is then calculated and used to fill in the missing values their respective group. Distortions in the data are still present here.
Regression imputation: This method involves identifying a variable which has a connection to the one with missing values, and effectively plotting them and calculating a line of best fit for their relationship. This line is then used to predict missing values. Distortions in the data are still present here.
Stochastic regression imputation: This method involves performing regression imputation as mentioned above, but moving every imputed value by a random amount. This is intended to reflect the randomness in the data and prevent the previously mentioned distortion.
In order to reflect that there is some uncertainty in imputation, when using a method with some randomness to it (such as stochastic regression imputation) it can be useful to perform regression multiple times to gain multiple data sets. These data sets are then studied separately, and the averages of these are found with an estimate of how uncertain these averages are. This is called multiple imputation and can be useful because it gives an idea of how accurate the method used is.
When studying data, the selection of which variables to study is important. There are well established methods for this, however with missing data things are not quite as straight forward. Methods for dealing with this range from simply performing the standard method on the imputed data to altering the chances of variables being selected based on how much of them are missing.
Overall this is a wide area with a range of methods associated with it, only a few of which have been mentioned here. It is important to keep researching in this area in order to make collected data as useful as possible.
About the author