If you work with data, you have almost certainly encountered issues regarding missing data. The most commonly used statistical methods often only work with complete sets of observations and it can be a huge pain when you realize that you can only utilize a fraction of your data because of missing data points here and there. Indeed, just 5% random missing data in a huge dataset can dramatically reduce the number of complete observations to as low as 25% of the original number of observations! Scary! Thankfully, there are numerous algorithms that can help you tackle this issue, but often, the most commonly used methods, such as mean and median imputation, offer suboptimal solutions and sophisticated algorithms often need a lot of consideration and background knowledge. The authors wrote the missCompare package to help you through your missing data problem. The initial hypothesis of the authors was that datasets with different parameters (size, distributions, correlations, etc.) will benefit differently from various missing data imputation algorithms.
This tutorial will navigate you through the missCompare package and its functions. So… let’s jump right into it!
First, let’s begin by installing and loading the missCompare package!
To walk you through the missCompare missing data imputation pipeline,
the authors created the missCompare::clindata_miss
dataframe. As the name suggest, clindata_miss contains missing data. You
can load this dataframe into your workspace by:
The data is created by the authors with the purpose of generating a
dataset that resembles real-life data with variables generally available
in a simple, realistic clinical epidemiology dataset. This R dataframe
have realistic variable means, distributions, correlations and missing
data patterns. You can find further information on the dataframe using
?clindata_miss
.
…Some words on alternative imputation methods. Before you get yourself acquainted with missCompare and its functions, it is always a good idea to carefully observe, and do some thinking about your data, your variables/features and your missing data. Before trying fancy methods, there are some cases when deductive imputation can be applied. This is a form of imputation based on logical reasoning rather than statistical methods. Let’s consider four separate cases when this can be applied:
Constant variables: In this scenario, in a prospective dataset, sex information is available at a participant’s first visit in 1998 and third visit in 2005, but missing in 2000. With logical deduction, it can be concluded that biological sex is constant, therefore the missing data point is also Female
ID | Date | Sex | BMI |
---|---|---|---|
ID13 | 01-01-1998 | Female | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | Female | 27.0 |
Prospective changes: In this scenario, in a prospective dataset, age information is available at a participant’s first visit in 1998, but missing in 2000 and 2005. With logical deduction, it can be concluded that missing age information is 25 and 30 for the missing two years
ID | Date | Age | BMI |
---|---|---|---|
ID13 | 01-01-1998 | 23 | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | NA | 27.0 |
Cumulative variables: In this scenario, in a prospective dataset, there is information available on how many times the participant has been pregnant. This information is available at a participant’s last visit in 2005, but missing in 1998 and 2000. With logical deduction, it can be concluded that if the participant has never been pregnant in 2005, than she has never been pregnant earlier either, therefore the missing numbers are 0 and 0
ID | Date | Number of pregnancies | BMI |
---|---|---|---|
ID13 | 01-01-1998 | NA | 25.8 |
ID13 | 01-01-2000 | NA | 27.7 |
ID13 | 01-01-2005 | 0 | 27.0 |
Dependent variables / known formula: In some cases, there could be a known relationship between variables. Such case is when the variables height, weight and BMI are present in the data, and some values are randomly missing. As BMI is calculated as weight (kg) /height2 (m2), knowing the formula it can be deducted that the values below are 70 kg and 31.2 kg/m2. A similar scenario in clinical datasets would be missing laboratory lipid values, where LDL cholesterol values are calculated by the Friedewald formula from triglycerides, total cholesterol and HDL cholesterol values. Another classic scenario is linearly dependent variables, where the sum of two variables produce a third variable.
ID | Weight_kg | Height_meters | BMI |
---|---|---|---|
ID13 | NA | 1.65 | 25.8 |
ID15 | 74 | 1.78 | 23.4 |
ID16 | 89 | 1.69 | NA |
All of the above scenarios need prior knowledge of the data, which is “easy” for humans, but is very hard to compute into any single missing data imputation algorithm. Therefore, it is always suggested that the variables and their relationships are thoroughly understood before jumping into hardcore data imputation using a fancy algorithm.
Also, it is highly recommended to check data ranges. By running a
simple summary()
on your data you can check minimum and
maximum values for each numeric variables and browse through all
categories for factors. Oftentimes it happens that a variable - a blood
biomarker, for instance - has a minus value as its minimum. While this
can be OK in some cases (e.g. the variable is already normalized), but
sometimes this can be a data input error or missing data points can be
coded as minus values. In these cases, it is important that the minus
values are set to NA.
Another important aspect that should be considered are outliers. This tutorial is not meant to delve too deep into this, but it is suggested that outliers are visualized for each variables and are excluded if they are too extreme or represent non-sense values. E.g. an age value of 109 is definitely an outlier, but physiologically possible, but 309 is probably due to a data input error, therefore it should be set to NA.
The missCompare pipeline is comprised of the following steps:
missCompare::clean()
missCompare::get_data()
missCompare::simulated()
missCompare::all_patterns()
. These patterns are:
missCompare::MCAR()
- missing data occurrence randommissCompare::MAR()
- missing
data occurrence correlates with other variables’ valuesmissCompare::MNAR()
-
missing data occurrence correlates with variables’ own valuesmissCompare::MAP()
-
a combination of the previous three, where the user can define a pattern
per variablemissCompare::impute_simulated()
missCompare::impute_data()
missCompare::post_imp_diag()
.Now let’s look into these, step by step!
The first step before starting anything imputation related is getting to know your data and cleaning it if necessary. This is not an “automated” process, nor should it be. You always need to explore your data before missing data imputation - it is entirely up to you what to keep and what to drop from your dataset before the imputation process. After the cleaning steps, missCompare will essentially reconstruct your dataset with no missing data, and in order to achieve this, your dataframe cannot have variables of type character or factor (in a later stage you will be able to use your original factor variables, but to extract metadata at this step, a conversion is necessary).
Let’s take clindata_miss and run missCompare::clean()
.
In one step, you can achieve multiple cleaning steps:
cleaned <- missCompare::clean(clindata_miss,
var_removal_threshold = 0.5,
ind_removal_threshold = 0.8,
missingness_coding = -9)
#> Variable(s) sex, education converted to numeric.
#> Variable(s) PPG removed due to exceeding the pre-defined removal threshold (>50%) for missingness.
#> 1 individual(s) removed due to exceeding the pre-defined removal threshold (>80%) for missingness.
The cleaning function outputs a dataframe called
cleaned
. In the next step, where you extract important
metadata from your dataframe, you will continue working with this
object.
You will get to know important details of your cleaned dataframe using the following line:
If there are non-numeric variables in the data, a warning will appear: Warning! Variable(s) sex, education is/are not numeric. Convert this/these variables to numeric using missCompare::clean() and repeat the missCompare::get_data() function until no warnings are shown.
As we cleaned clindata_miss in the previous step, the function should
run without any warnings and errors. The output object,
metadata
is a list of the following elements.
metadata$Complete_cases
- the number of complete
observations with no missing data. In this case, there are 1349
observations from the original 2500 with no missing data. Note,
that with only ~7% missing data, the number of complete observations is
dramatically decreased to around half of the original data. As many
methods (e.g. regression, principal component analysis, etc.) require
complete sets of observations, this decrease in complete observations
presents a very strong case for missing data imputation.metadata$Rows
and metadata$Columns
give
you the dimensions of the dataframe, 2499 rows and 11
columns, in our case.metadata$Corr_matrix
- gives you the
correlation matrix of all variables in the dataframe.
The function assesses pairwise Pearson correlation coefficients using
complete sets of pairs between any two variables. For example there are
2312 complete BMI-waist pairs
(sum(!is.na(cleaned$BMI) & !is.na(cleaned$waist))
) - in
this case correlation will be calculated based on these 2312 value
pairs, even though the number of complete observations (no NAs for any
variables) is 1349. This way all of the data will be utilized to obtain
the most accurate correlation coefficients.metadata$Fraction_missingness
and
metadata$Fraction_missingness_per_variable
return the the
total fraction of missingness in the dataframe and the fraction of
missingness per variable, respectively. In our case, the total
fraction of missingness is 6.9% after removing one variable,
PPG with >50% missing. Inspecting the fraction of missingness per
variable it is also apparent that the true missing data fraction of TG
and HDL-C after converting -9s to NAs are 10.6% and 13.7%,
respectively.metadata$Total_NA
and
metadata$NA_per_variable
return the total number of missing
values in the dataframe and per variable.metadata$MD_Pattern
- The missing data pattern is
calculated using mice::md.pattern (?mice::md.pattern
).metadata$NA_Correlations
- Pairwise correlation
point-biserial correlation coefficients between variable pairs. Namely,
for each calculation, variables are converted to TRUE/FALSE based on
missing data status (missing or not missing) and these boolean vectors
and correlated to the native variables. The resulting correlation matrix
has rows with variable names + _is.na - the rows represent the converted
variables, while the columns are the original variables.metadata$NA_Correlation_plot
- A correlation plot based
on the previous object, metadata$NA_Correlations
.
metadata$min_PDM_thresholds
- This object is a small
dataframe that informs about the various min_PDM (minimum number of
observations per distinct missing data pattern) values. min_PDM will
need to be defined in further functions, so it is informative to get a
sense on what is the proportion of missing data patterns that should be
used in the simulation framework. In the table below we see that if this
number is too high (only those missing data patterns are considered that
describe 100 or more observations) than more than 80% of observations
with other missing data patterns will be ignored in the simulation
framework. On the other hand, if this number is set too low, e.g. below
5, than this may result in retaining almost all missing data patterns
that are applicable to only a few observations resulting in a very
complicated pattern that the algorithm might fail to apply to your
dataset. Therefore, this number will need to be set carefully later.
This table shows approximate percentages of observations with missing
data retained with setting min_PDM in later steps.#> min_PDM_threshold perc_obs_retained
#> 1 5 82
#> 2 10 78
#> 3 20 75
#> 4 50 53
#> 5 100 19
#> 6 200 0
#> 7 500 0
#> 8 1000 0
metadata$Vars_above_half
- In case there are variables
exceeding 50% missingness, this object will list these variables and the
function will output a warning: Warning! Missingness exceeds 50%
for variable(s) PPG. Although the pipeline will function with variables
with high missingness, consider excluding these variables using
missCompare::clean() and repeating function until no warnings are
shown.metadata$Matrix_plot
- Matrix plot (ggplot2 object)
visualizing missing data distribution. Missing data are
represented by gray color, while data is colored by value. There are
options to sort the dataframe by missingness or leave the dataframe
unsorted. This plot (hopefully) gives you a visual description of how
data is missing together and what fraction of the data is missing.metadata$Cluster_plot
- Dendrogram (ggplot2 object)
visualizing the hierarchical clustering of missing
data. On the plot, the clades (bifurcations) closer to Height 0
(the bottom of the plot) represent closer relationships in terms of
missing data - i.e. those variables in one group are more likely to be
missing together compared to the rest. In our case, blood pressures SBP
and DBP are missing together often and so as the three lipid traits, TC,
TG and HDL-C. With respect to the other variables, waist and BMI have a
closer relationship compared to the rest.Having learned the characteristics of your data, the next step is to
assemble a framework in which you can test the performance of various
missing data algorithms on simulated data. This can be achieved in one
step using missCompare::impute_simulated()
, but let’s look
under the hood and see how this function is operating in the background.
You can access these functions in case you would like to play with
parameters.
First, missCompare::simulate()
will create a dataset
that resembles your original dataframe, but with no missing data. The
dimensions of the simulated dataframe and all correlations between
variables will match the original data - essentially, the only major
difference between the simulated dataset and your original data is that
here, all variables are normally distributed with a custom set mean and
standard deviation. Let’s observe:
simulated <- missCompare::simulate(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
meanval = 0,
sdval = 1)
Note how the first three arguments are elements of the output list
(metadata
) from the previous function
missCompare::get_data()
.
The output simulated
is a list, where the first item is
the simulated matrix and the next two objects are samples from the
correlation matrix of the original and the simulated data.
With this done, missing data can be spiked in in the simulated data using various patterns (MCAR, MAR, MNAR and MAP). This will be done with respect to the missing data fraction for each variable in the original dataframe.
MCAR missingness is simple, NAs are spiked in randomly. Please note
that in all functions below, min_PDM is set to 10. As it was shown
before when looking at outputs from
missCompare::get_data()
, this will result in the retention
of the most important missing data patterns in which approximately 78%
of all observations will be accounted for.
missCompare::MCAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
MAR and MNAR missingness follow similar logic. With MNAR, NAs are spiked in based on the variable’s own values, while with MAR, NAs are spiked in based on all other variables’ values.
missCompare::MAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
missCompare::MNAR(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10)
MAP missingness is a possible combination of any of the previous three missing data pattern. You can define what pattern you assume per variable. In our case, let’s say that all variables are missing randomly, except for education (the last variable), which is missing not at random. This can be defined as:
missCompare::MAP(simulated$Simulated_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10,
assumed_pattern = c(rep("MCAR", 10), "MNAR"))
All patterns can be run with a single line using
missCompare::all_patterns()
.
Again, missCompare::impute_simulated()
does all these in
one step, and more:
1. it simulates the dataset with no missing values;
2. spikes in NAs in MCAR, MAR, MNAR and - if you define a custom pattern
- MAP patterns;
3. imputes missing data using a curated list of 16 algorithms;
4. compares computation time, evaluates imputation accuracy by
calculating RMSE, MAE and KS values between the imputed and the original
data points.
This all can be achieved by a single command! This means you DON’T
have to run missCompare::simulate()
,
missCompare::MCAR()
, missCompare::MAR()
,
missCompare::MNAR()
, missCompare::MAP()
or
missCompare::all_patterns()
. The command goes as (note that
we defined assumed_pattern = NA
- the default - so MAP
pattern will be skipped):
missCompare::impute_simulated(rownum = metadata$Rows,
colnum = metadata$Columns,
cormat = metadata$Corr_matrix,
MD_pattern = metadata$MD_Pattern,
NA_fraction = metadata$Fraction_missingness,
min_PDM = 10,
n.iter = 50,
assumed_pattern = NA)
The n.iter
argument controls how many times the process
will repeat itself (spiking in missing data, imputation and assessment
of imputation accuracy, that is).
When this is completed (depending on the size of you data, it can take some time!), you should obtain graphs that give you visual aid on which algorithms perform in the most desired fashion. The less the error, and the more the imputed values’ distributions match the original values’ distributions, the better - i.e. lower RMSE, MAE and KS D statistic values are better! Computation time of course can also play a role in your decision on what to use eventually!
Let’s look at some plots:
Please note that RMSE and MAE measures correlate strongly. It is interesting to consider that there are two different main imputation strategies - one group of methods focus more on prediction, while the others perform imputation per se. To give an intuitive explanation, while predictive algorithms set up models that try to predict missing data based on all the other data available and thereby often ignore the uncertainty of missing values, multiple imputation algorithms focus more on variables distributions.
Hopefully, by now you have a good idea on which method you would like
to use to impute missing data for your dataset - you will be able to do
this using missCompare::impute_data()
. Please note that
while some algorithms can safely handle factors (random replacement,
mice, mi, missForest, Hmisc, VIM), some do not (mean and median
imputation, missMDA algorithms; pcaMethods algorithms; Amelia II). If
you have factors and would like to use one of the methods that are able
to handle them, you can directly use your original dataset here (without
string variables). In case your choice is another method or you don’t
have factors in your original data, you can safely used the dataframe
output by missCompare::clean()
.
Let’s consider a scenario where you choose missForest to impute your data. You would like to consider the original factor variables, you do not want to scale your data and you would like to create 10 imputed dataframes in order to set up a multiple imputation framework. Note, that setting up a multiple imputation framework is only a viable option with probabilistic models (e.g. you can impute you data with Mean imputation a billion times, but the NAs will always be replaced with exactly the same mean values!).
imputed <- missCompare::impute_data(clindata_miss,
scale = F,
n.iter = 10,
sel_method = c(14)) # 14 is the code for missForest
The 10 imputed dataframes will be available under
imputed$missForest_imputation[[1]]
,
imputed$missForest_imputation[[2]]
,
imputed$missForest_imputation[[3]]
, etc…, while all the
other elements in the list will be empty (e.g. nothing in
imputed$median_imputation
).
Let’s see what happens when you similarly define 10 copies, but choose a non-probabilistic method - Mean imputation in this case.
imputed <- missCompare::impute_data(cleaned,
scale = T,
n.iter = 10,
sel_method = c(3)) # 3 is the code for mean imputation
#> [1] "Mean imputation - in progress"
You will see that the algorithm runs only 1 iteration for
Mean imputation, and there is only one object, under
imputed$mean_imputation[[1]]
. There are no
imputed$mean_imputation[[2]]
,
imputed$mean_imputation[[3]]
, etc… created at all.
Well done! You have imputed your data. Almost there, only one step remains - to assess how the imputation affected your dataset.
The function missCompare::post_imp_diag()
will help you
assess how the imputation affected important parameters of your
data.
Let’s run this with the results of the Mean
imputation algorithm
(imputed$mean_imputation[[1]]
):
The diag
list contains several interesting assessments.
The Histograms
and Boxplots
lists should give
an idea on the distributions of the original and the imputed values.
Perhaps the limitations of mean imputation are the most apparent on
these plots. For example, let’s inspect SBP levels (scaled to mean 0 in
the data!):
This, of course will look different for almost any other models, e.g. Amelia II:
The Barcharts
list only contains elements if there were
factors in the original data.
The Statistics
list shows the mean and SD values for the
original and imputed values per variables and calculates Welch’s t test
P values (testing for mean difference) and Kolmogorov-Smirnov test
statistic D (testing for unequal distributions) on the two groups of
values. Low Welch’s t test P values and high Kolmogorov-Smirnov test
statistics reflect worse results.
#> $age
#> Mean_original SD_original Mean_imputed SD_imputed Welch_ttest_P
#> 2.543198e-16 1.000000e+00 1.627856e-01 9.957892e-01 1.787111e-01
#> KS_test.D
#> 1.008829e-01
#>
#> $sex
#> Mean_original SD_original Mean_imputed SD_imputed Welch_ttest_P
#> 1.368959e-16 1.000000e+00 -7.973387e-02 9.996829e-01 5.068421e-01
#> KS_test.D
#> 4.349677e-01
The Variable_clusters_orig
and
Variable_clusters_imp
plots visualize the clusters of
variables before and after the imputation. The clusters should not
change too much!
Great, these two do look very similar with only very small differences.
Finally, the Correlation_plot
shows bootstrapped
correlation coefficients from the original data and the imputed data.
You can specify how many iterations you would like to run to obtain
bootstrapped correlation coefficients and 95% confidence intervals using
the n.boot
argument in the function. In this vignette this
is set to 5, which is far too low. The blue line has intercept 0 and
slope 1 and correlation coefficients (represented by the dots and the
red line) should align this line as much as possible. With mean and
median imputations, the line is often “turned clockwise”, with a lower
slope than 1 (regression to the mean).
#> `geom_smooth()` using formula = 'y ~ x'
Well done! You completed the tutorial and by now, you should ideally have one or more imputed datasets of your data. Awesome job! We hope that you achieved what you were set out to do and you are ready for the next steps in your statistical analysis! Please let us know if you have any issues here - missCompare GitHub - or contact the authors if you have any feedback!
– authors of missCompare