In the blog post An Array Hashing Scheme in SAS, I walk through an array hashing method first presented by Paul Dorfman in the article direct addressing techniques of table look-up. Even though the article is more than 20 years old, it still has valid points. Also, the techniques in it outperform the modern canned hash and sort-merge techniques steadily. However, when it comes to hashing with SAS arrays, the syntax is tricky and long. I spent quite some time figuring it out. Luckily, Richard A. DeVenezia has done a tremendous amount of work putting together a set of macros, which does the work for us. The macro calls even resemble hash object methods well. In this article, I will demonstrate a few examples of how to use the macros.

In the examples to some, I will use the example data below. I create the example data with Proc Plan.

proc plan seed=123;
  factors k=1e6 / noprint;
  output out=k;
run;quit;
 
data small;
   set k;
   d=rand('integer', 1, 1e7);
run;
 
data large;
   call streaminit(123);
   do _N_=1 to 1e8;
      k=rand('integer', 1, 1e6);
      output;
   end;
run;

The Array Hashing SAS Macros

All the array hashing macros are available in the link here. If you are familiar with the canned hash methods, you will quickly see that the macro calls resemble them well. For example the DeclareHash macro resembles the Declare Hash Statement and so on. The HashSize Macro is different though. In the array hashing, we must choose a load factor to determine how much of the array we fill up. The load factor will in turn determine a fitting prime for the number of array elements.

Another distinction from the usual hash in SAS is that we need to choose between policy=LP / CL in the DeclareHash Macro. This argument determines how to deal with collisions. In other work, if a key hashes to the same array address, what to do with it? Shortly put, in the LP (Linear Probing) approach, if a key hashes to an array entry that is occupied, we simply step go to the left until we find an open slot. The CL (Coalesced Linking) approach is different. Here, we create a separate array of the same size as the key array. The array elements serve as pointers. If we encounter an occupied slot, we find an open slot somewhere in the array and insert a pointer to that slot in the chain array. This is all very well explained and illustrated in the article by Paul Dorfman at the top.

A Simple Lookup Example

Next, let us see an example of how to use the array hashing macros. I want to perform a simple lookup of the data variable d from the numeric key variable k. This example is similar to the one I present in my first array hashing post. You can see a link at the top. I divide the code into four sections. I will go through each section below.

  1. First, I determine the size of my hash array. I do so with the HashSize Macro. All this step does is finding the number of elements for the array.
  2. Next, I declare the hashing array. I do so with the DeclareHash Macro. Here I specify the name and size of the list. Furthermore, I specify the collision policy. Here, I set it to CL. Finally, I set the key variable to and the data variable to d. Notice how well the macros resemblance the usual hashing methods.
  3. Now, I read the small data set in a DoW Loop. Here, I use two macros. First, I use the hashAddKey macro to find the correct slot in the array to insert the data. Next, I use the hashSetData macro to actually insert the data value. The macros handle collisions internally. In this case, we use Coalesced Chaining to handle them.
  4. Here, I read the large data set, from which I want to look up the data values. I do so with the hashFetch_CL macro. This macro follows the trail in the chain array, which has been created in the declareHash macro because we choose policy=CL. The macro creates a return code rc which is zero if we find a slot, which holds the key we look for. Notice that this is in line with the hash Check() and Find() Methods. Finally, I use the hashGetData macro to copy the data from the array to the PDV and output the final data set.
%let hash_size = %hashSize (load=0.8, data=small);   /* 1 */
%put &hash_size.;
 
data arrayhash (keep=k d);
 
   %declareHash     (h, &hash_size, policy=CL);      /* 2 */
   %declareHashKey  (h, k);
   %declareHashData (h, d);
 
   do until (lr1);                                   /* 3 */
      set small end=lr1;
      %hashAddKey  (h, rc=rc);
      %hashSetData (h, d);
      *temp = catq('d', '|', of h_keys[*]);
      *put temp=;
   end;
 
   do until (lr2);                                   /* 4 */
      set large end=lr2;
      call missing (d);
      %hashFetch_CL(h, rc=rc);
      if rc=0 then d = %hashGetData (h, d);
      output;
   end;
 
run;

The SAS code above runs in less than 30 seconds on my system. For comparison, the canned hash approach below does the same job in about 90 seconds. Quite the performance improvement.

A Side Note

The SAS code above is mainly meant as a template. However, if you want to get a better grasp of the process, run the code on the smaller data sets below. Also, uncomment the two commented lines in step 3. This will print the content of the key array in the log. Finally, delete the keep= option in the data statement. Then notice the variables, that are created to insert data and links in the relevant arrays.

data small;
   do k=2, 4, 6, 8;
      d = k;
      output;
   end;
run;
 
data large;
   do k = 1 to 10;
      output;
   end;
run;

Summary

In this post, I demonstrate how to do array hashing in SAS using a set of macros. All the macros are extremely well written by Richard A. DeVenezia. I simply provide an example of how to use a few of them. I highly encourage you to play around with the macros and see how they perform compared to the usual hash object. Also, do not miss reading the article that first describes this method. The link is at the top.

For other array posts, read Why Array Should be Temporary if Possible and 4 Techniques to Sort an Array in SAS.

You can download the entire SAS code from this post here. You will see all the array definitions at the bottom of the link.