Mean Imputation in SAS Using the Hash Object

In the blog post Replace Missing Values With Mean in SAS, I demonstrate how to do mean imputation in SAS using Proc Stdize. This is the traditional way to do mean imputation in SAS. In this blog post, I will present an alternative using the SAS hash object. In the examples to come, I will use the simple example data set below.

data have;
input ID v;
datalines;
1 2
1 3
1 4
2 1
2 2
2 3
3 6
3 4
3 5
;

Calculate Means With the SAS Hash Object

Before we can impute missing values with group means, we must know how to compute the mean values in the data step using a hash object. Let us see a simple example. Consider the code below. First, I declare the hash object h. I use the ID variable as key because this is the variable that dictates the groups of interest. In the data variables, I specify ID, n, s, and m. Here, n represents the number of occurrences. S represents the sum of the variable v of interest. M represents the Mean value of interest.

Next, I use a DoW Loop to read the entire data set. For each observation, I perform a lookup using the Find() Method. Now, the current value of the mean value for the current ID is in the PDV. I do not want to consider missing values when I calculate the mean values. Therefore, I check if v is missing or not. If it is not, I add 1 to n and v to s. Then, I re-calculate the mean value m and replace the mean values in the hash object using the Replace Method(). This takes care of both the case where the mean value already exists in the hash object and the case where it does not. The logic of the replace method is this. If the key does not already exist in h, add it. If is does, replace it.

Finally, I use the Output Method, to output the means to a SAS data set.

data _null_;
   dcl hash h(ordered : "A");
   h.definekey("id");
   h.definedata("id", "n", "s", "m");
   h.definedone();
 
   do until (z);
      set have end = z;
      call missing(n, s);
      rc = h.find();
      if v then do;
         n + 1;
         s + v;
         m = divide(s, n);
         h.replace();
      end;
   end;
 
   h.output(dataset : "mean_hash(drop = n s)");
run;

You can verify that the mean values are correct with the Proc Summary step below. The two code snippets create the same results.

proc summary data = have nway;
   class id;
   var v;
   output out=mean_summary(drop = _:) mean=;
run;

Mean Imputation Using the Hash Object

Now, we know how to compute mean values using the SAS hash object. Next, let us take it a step further and do the mean imputation. I change the example data above in two ways. I insert a missing value of v in three observations. Furthermore, I remove the sort order to demonstrate that the method handles unsorted data.

data have;
input ID v;
datalines;
2 2
1 2
2 .
3 5
1 4
2 1
3 4
1 .
2 3
1 3
3 .
3 6
;

The code is not that different from the example above. Once we know how to compute the statistical sizes, the impute part is not difficult. We simply add another DoW Loop. Then we read the input data again. If we encounter a missing value of v, we do a simple lookup in h for the mean value of interest. We have constructed the data in the hash object so that we can do a simple lookup to do the imputation.

Run the code below and verify that the correct values are imputed in the data.

data want(drop = s n rc);
   dcl hash h(ordered : "A");
   h.definekey("id");
   h.definedata("id", "n", "s", "v");
   h.definedone();
 
   do until (z1);
      set have end = z1;
      call missing(n, s);
      rc = h.find();
      if v then do;
         n + 1;
         s + v;
         v = divide(s, n);
         h.replace();
      end;
   end;
 
   do until (z2);
      set have end = z2;
      if v = . then h.find();
      output;
   end;
run;

Summary

In this post, we explore how to do mean imputation in SAS using the hash object. We learn that once the statistical sizes are placed in the hash object, the imputation is a mere lookup away. It does not take a certified SAS programmer to see the flexibility of this approach. Obviously, the technique does not limit itself to mean values. I encourage you to calculate other statistical sizes and impute them in the data.

If you want to learn more about computing statistical sizes using the hash object read the article Data Aggregation Using the SAS Hash Object and chapter 8 in the hash object bible Data Management Solutions Using SAS Hash Table Operations.

You can download the entire code from this post here.