Remove Duplicate Observations In SAS with the Hash Object
In the blog post Remove Duplicate Observations in SAS, I demonstrate a few common approaches to removing duplicate observation from a SAS data set. However, common to both approaches is that they require a sorting algorithm to invoke. As we have seen multiple times on the blog, sorting is time and CPU costly. Therefore, we should avoid sorting when possible. This post demonstrates an efficient alternative to remove duplicate values from a SAS data set with the SAS hash object.
In the examples to some, I will make use of the following made up data set.
data testdata(drop=i j); array vars var1-var5; do i=1 to 10e6; do j=1 to dim(vars); vars[j]=rand('integer', 1, 10); end; output; end; run;
PROC SORT Nodupkey Approach
The most common way to delete duplicates in SAS is to use PROC SORT with the Nodupkey Option. Below, I do so and specify the _ALL_ keyword in the By Statement. This means that I delete all observations that has exact duplicates in all variables.
proc sort data=testdata out=test1 nodupkey; by _ALL_; run;
However, even though I run the Sort Procedure in Multiple threads, the proces takes about 10 seconds. Let us see if we can beat that.
Hash Object Approach
Now, instead of a procedure, I use a data step. I first declare a hash object with all the relevant variables as key variables. Now, the trick is a very simple application of Using the Hash Object as a Dynamic Placeholder.
I use the Set Statement to read the input data. Next, I use the Check() Method to invoke a hash search and check if the key has been encountered before. If not then I simply add it to the hash object and output the observation. If it hash already been encountered, it will be present in the hash object. Consequently, the Check Method() will return a zero value and nothing happens.
data test2; if _N_ = 1 then do; declare hash h(hashexp:20); h.defineKey('var1', 'var2', 'var3', 'var4', 'var5'); h.defineDone(); end; set testdata; if h.check() ne 0 then do; output; h.add(); end; run;
The result from the above data step is an unduplicated version of the input data set. Furthermore, no sort algorithm is invoked. This means that the resulting data set is in the original order, which may be desirable. The above proces tool about 5 seconds to run. Half of the original elapsed time.
This post provided a hash object alternative to the usual approaches to removing duplicate values from a SAS data set. We saw that quite a lot of time was saved with the hash object approach. This is due to the fact that no sorting is required. This serves as a nice example of just how flexible the hash object is, when we do not only think of it as a mere lookup table.
You can download the entire code from this example here.