Top N By Group with Hash of Hashes in SAS

In the blog post 3 Ways to Select Top N By Group in SAS, we learn a few ways to create top 10 lists by group in SAS. One of the approaches uses a hash object and creates the result in a single pass of the data. However, two requirements must be present for this to work. The data must be sorted (or at least grouped) by the By-Variable. Also, we must have enough memory to contain all data for the largest by-group. Today, we relax both of those assumptions. We will see how the hash-of-hashes approach comes in handy relaxing both assumptions. If you are not at all familiar with the hash og hashes technique, read the article A SAS Hash Object Of Hash Objects (Hash Of Hashes) first. The examples in this post rely heavily on the theory from there. I will not thoroughly explain the technique in the example to come.

Like in the original post, I will use the sashelp.iris data set. However, sashelp.iris is grouped by species. For demonstration, we shuffle the data so that no sort/group order exists in the data.

proc sql;
   create table iris as 
   select * from sashelp.iris
   order by rand('uniform');
quit;

Top 10 By Group – A Hash of Hashes Approach

First, let us see an example of how to find the data with the 10 largest sepal lengths for each species in the (unsorted) iris data set. The code below does exactly that. Shortly put, we create as many hash object instances of h as there are distinct species in the iris data. We use the HoH hash object to point to the instances using the species variable. For each instance of h, we instantiate a hash iterator as well. When we read an observation from iris, we place it in the appropriate object. Finally we loop through all the instances of h. For each instance we use the corresponding iterator to traverse the first 10 items and copy the data values into the pdv. Each time this happens, we output. This gives us the desired result.

data want1;
   dcl hash hoh(ordered : "Y");
   hoh.definekey("species");
   hoh.definedata("h", "hi", "species");
   hoh.definedone();
   dcl hiter i("hoh");
 
   do until(z);
      set iris end = z;
      if hoh.find() ne 0 then do;
         dcl hash h(dataset : "iris(obs=0)", multidata : "Y", ordered : "D");
         h.definekey("sepallength");
         h.definedata(all : "Y");
         h.definedone();
         dcl hiter hi("h");
         hoh.add();
      end;
      h.add();
   end;
 
   do while(i.next() = 0);
      do _N_ = 1 by 1 while(hi.next() = 0 & _N_ <= 10);
         output;
      end;
   end;
 
run;

The approach above works. Also, we have successfully relaxed the assumption of grouped input data. However, it is quite memory vulnerable. Eventually, all the input data will reside in memory. We can relax this bit as well since we are only interested in 10 items for each species. In the code below, we add five lines of code after the h.add() statement. Here, I check if more than 10 items exist in the active instance of h. If so, I want to remove the one with the smallest values of SepalLength. I copy that value into the PDV with the statements hi.last() and hi.next(). This works because the h instance is ordered by SepalLength. I call the hi.next() method immediately to avoid Locked By Iterator Problems.

Now, a simple Remove() Call will do right? Unfortunately no. In recent versions of SAS, the Remove() Method removes all items with a specified key. We want to remove only one. Furthermore, if we want the output to be identical to the one above, we must consider exactly which item to remove. If we want the code to be identical to the code above, we must remove the item added first. Therefore, I call the Find() Method. This hash the internal property of setting the entire item list for which the PDV dwells on. Next, I traverse the entire item list until the pointer dwells on the last item in the list. I do so with the Has_Next() Method, which returns a non-zero value to the variable we specify in the result argument tag if a next item does not exist. Ie. we dwell on the last item.

This means that only a maximum of 11 items exist in each instance of h at any time.

data want2;
   dcl hash hoh(ordered : "Y");
   hoh.definekey("species");
   hoh.definedata("h", "hi", "species");
   hoh.definedone();
   dcl hiter i("hoh");
 
   do until(z);
      set iris end = z;
      if hoh.find() ne 0 then do;
         dcl hash h(dataset : "iris(obs=0)", multidata : "Y", ordered : "D");
         h.definekey("sepallength");
         h.definedata(all : "Y");
         h.definedone();
         dcl hiter hi("h");
         hoh.add();
      end;
      h.add();
 
      if h.num_items > 10 then do;
         _N_ = hi.last();
         _N_ = hi.next();
         _N_ = h.find();
 
         do while(1);
            _N_ = h.has_next(result : next);
            if not next then leave;
            _N_ = h.find_next();
         end;
         _N_ = h.removedup();
      end;
   end;
 
   do while(i.next() = 0);
      do _N_ = 1 by 1 while(hi.next() = 0 & _N_ <= 10);
         output;
      end;
   end;
 
   drop next;
 
run;

Comparing Performance

For a small data set like iris, the memory management is not an issue. However, for larger data it has an enormous impact. Let us put the two code snippets above to the test with the test data below. The data has 100 distinct groups k and 100Mio observations.

data have;
   do _N_ = 1 to 1e7;
      k = rand('integer', 1, 100);
      d = rand('integer', 1, 1e6);
      output;
   end;
run;

First, let us use the initial approach. Without memory management. I use the technique presented in How Much Memory Does SAS Hash Object Occupy? to measure the memory impact.

data want1;
   dcl hash hoh(ordered : "Y");
   hoh.definekey("k");
   hoh.definedata("h", "hi", "k");
   hoh.definedone();
   dcl hiter i("hoh");
 
   before = input(getoption('xmrlmem'),20.);
 
   do until(z);
      set have end = z;
      if hoh.find() ne 0 then do;
         dcl hash h(multidata : "Y", ordered : "D");
         h.definekey("d");
         h.definedata("k", "d");
         h.definedone();
         dcl hiter hi("h");
         hoh.add();
      end;
      h.add();
   end;
 
   after = input(getoption('xmrlmem'),20.);
   hashsize = before-after;
   put hashsize= sizekmg10.2;
 
   do while(i.next() = 0);
      do _N_ = 1 by 1 while(hi.next() = 0 & _N_ <= 10);
         output;
      end;
   end;
 
   drop before after hashsize;
run;

This approach takes up about 1GB of memory and about 50 seconds in wall clock time on my system.

Next, let us manage memory.

data want2;
   dcl hash hoh(ordered : "Y");
   hoh.definekey("k");
   hoh.definedata("h", "hi", "k");
   hoh.definedone();
   dcl hiter i("hoh");
 
   before = input(getoption('xmrlmem'),20.);
 
   do until(z);
      set have end = z;
      if hoh.find() ne 0 then do;
         dcl hash h(multidata : "Y", ordered : "D");
         h.definekey("d");
         h.definedata("k", "d");
         h.definedone();
         dcl hiter hi("h");
         hoh.add();
      end;
      h.add();
 
      if h.num_items > 10 then do;
         _N_ = hi.last();
         _N_ = hi.next();
         _N_ = h.find();
 
         do while(1);
            _N_ = h.has_next(result : next);
            if not next then leave;
            _N_ = h.find_next();
         end;
         _N_ = h.removedup();
      end;
   end;
 
   after = input(getoption('xmrlmem'),20.);
   hashsize = before-after;
   put hashsize= sizekmg10.2;
 
   do while(i.next() = 0);
      do _N_ = 1 by 1 while(hi.next() = 0 & _N_ <= 10);
         output;
      end;
   end;
 
   drop before after hashsize next;
run;

This takes up a mere 6MB of memory and about 8 seconds in wall clock time. A colossal difference. Especially in memory consumption. You can verify that the results are identical with the code below.

proc compare base = want1 comp = want2; run;

Summary

In this post we learn how to create top N lists by group with the hash of hashes technique. The hash of hashes technique is handy in this case because it lets us manage memory consumption and relax the original assumption of grouped or sorted input data. Thus, we can create the desired top N lists by group in a single pass, with limited memory consumption and with unsorted data.

If you want to learn more about the hash of hashes technique, read chapter 9 in the hash object bible Data Management Solutions Using SAS Hash Table Operations.

You can download the entire code from this post here.