Count Distinct Values in SAS With the Hash Object

Count distinct values in SAS makes most programmers think of Proc SQL. However, there are other ways to count distinct values within groups in SAS. In this post, we investigate how to do so in the data step using the hash object.

In the examples to come, we use the example data below.

data have;
input id x;
datalines;
1 1
1 2
1 2
2 3
2 3
3 4
3 5
3 6
;

Count Distinct With Proc SQL

As a reference point, let us see what the usual count distinct looks like in Proc SQL. Simply select the group variable id and use the Count (distinct) function to get the desired result. Remember to use the Group By Clause and specify ID here as well.

proc sql noprint;
   create table want as
   select id, 
          count(distinct x) as unq
   from have
   group by id;
quit;

Count Distinct With the Hash Object in SAS

Next, let us see how to do the same thing with the SAS hash object. In this approach, I specify 2 hash objects, h and u. H is keyed by only id. U is keyed by id and x. Now, we simply read the entire input data set using a DoW Loop. First, I look up the input value in h with the Find() Method. If the ID is found, the unique count (unq) is read into the PDV. If not, we set unq to missing. Next, we check whether the id and x combination has been encountered before. If it has, nothing happens. If it has not, then the distinct count must go up by 1 and be added to u. Finally, we use the Replace() Method to update the hash object h.

With this approach, we store the results in the hash object h itself. Therefore, the last thing we do is to output the content of h to the result data set.

data _null_;
   dcl hash h (ordered: "A");
   h.definekey ("id");
   h.definedata ("id", "unq");
   h.definedone ();
 
   dcl hash u ();
   u.definekey ("id", "x");
   u.definedone ();
 
   do until (z);
      set have end = z;
      if h.find() ne 0 then call missing (unq);
      if u.check() ne 0 then do;
         unq = sum (unq, 1);
         u.add();
      end;
      h.replace();
   end;
 
   h.output (dataset: "count_distinct");
   stop;
run;

This approach is similar to the approach in the article Data Aggregation Using the SAS® Hash Object by Don Henderson and Paul Dorfman.

A Hash of Hashes Approach

Next, let us see an alternative to the approach above. Here, I use the Hash of Hashes Technique. The program snippet is split into 4 overall parts:

First, we declare and instantiate the hash object to contain (point to) other hash objects. We call this HoH. Furthermore, we declare an Iterator Object and link it to HoH.
Next, we declare the hash object h. Notice the absence of parentheses. This means that it is a declaration only. We do not create an instance of h yet. All we do is create a hash type variable in the PDV. You can read more in the blog post Declare and Instantiate the SAS Hash Object.
In part three, we first use a DoW Loop to read the input data sequentially. Next, we look up the current value of ID in HoH. If it is not found, we create an instance of h for that id and Add it to HoH. Notice that h is key variable in HoH. Consequently, we create a separate hash object instance of h for each distinct value of ID. Finally, I use the Ref() Method on h (which has id and x as key variables). This is important because the Ref() Method adds the item if not already found. And if it is already found in h, it is not unique and we do not want to count it.
Finally, we use the iterator i and traverse the entire HoH Object. I know that in each hash object that HoH points to, there are only as many items as there are unique values of x. Remeber, that we call the Ref() Method above. Therefore, I can use the Num_Items Attribue to five me the distinct counts. For each traversal step, I simply output to the result data set.

data want;
   dcl hash hoh (ordered : "Y");
   hoh.definekey("id");
   hoh.definedata("h", "id");
   hoh.definedone();
   dcl hiter i ("hoh");
 
   dcl hash h;
 
   do until (lr);
      set have end=lr;
      if hoh.find() ne 0 then do;
         h = _new_ hash (multidata : "Y");
         h.definekey ("id", "x");
         h.definedone();
         hoh.add();
      end;
      h.ref();
   end;
 
   do while (i.next() = 0);
      unq = h.num_items;
      output;
   end;
run;

Finally, you can see more examples of the Hash of Hashes technique in the blog post A few Hash of Hashes Examples in SAS.

Summary

I this post, we investigate how to count distinct values in the data step using the hash object in SAS. We see two different approaches. One using two objects and one using the Hash of Hashes technique. We verify that the results are the same as using the classic Proc SQL approach.

Also, read the related posts Find Min and Max Values with the SAS Hash Object, Mean Imputation in SAS Using the Hash Object and Calculate Sums With The Hash Object in SAS.

You can download the entire code from this post here.