An Array Hashing Scheme in SAS

The SAS Hash Object was introduced in SAS 9.2. Before that, hashing techniques did exist. However, programmers has to implement them with custom code and temporary arrays. Today, I will demonstrate how to implement a hashing lookup in the Data Step using temporary arrays. We will see that in some circumstances, the array hashing technique outperforms the usual hash object in terms of speed.

Example Data

In the examples to come, I will use the following example data. I will lookup data values from a small data set Small. I will do so from a larger SAS data set Large. Small and Large share the key variable k. Both k and d are numeric.

/* Small data set */
proc plan seed=123;
  factors k=1e6 / noprint;
  output out=k;
run;quit;
 
data small;
   set k;
   d=rand('integer', 1, 1e7);
run;
 
/* Large data set */
data large;
   call streaminit(123);
   do _N_=1 to 1e8;
      k=rand('integer', 1, 1e6);
      output;
   end;
run;

Find the Right Array Size

Before we do the actual hashing, we have to determine the size of the arrays to hold keys and data. Remember, the SAS Hash Object is Dynamic. It grows and shrinks at run time. The temporary SAS array does not have this advantage. Therefore, we have to find some suiting number of elements. Without going too much into details about this, there are a few things we need to consider. Obviously, the array must have the ability to store all key values. Also, we want to emulate the hashing process of allocating keys somewhat equally to different locations. We do this by Universal Hasing, aka. the Modulo Hashing Technique. I will not get too theoretical, but the technique requires us to find a fitting prime number to use both for the array size and the Mod Function. You get an idea of why in this discussion on StackOverflow.

There are many approaches to how to find a “good prime”. Some may look it up in a table. Here, I use the following small program to find it. This is from the article at the bottom of the post.

%let load = 0.8;
data _null_;
   do p=ceil(p/&load) by 1 until (j = up + 1);
      up = ceil(sqrt(p));
      do j=2 to up until (not mod(p,j)); end;
   end;
   call symput('hsize',left(put(p,best.)));
   stop;
   set small nobs=p;
run;
 
%put &hsize.;

The Lookup

We structure the lookup as follows. First we define the arrays. Then I use a Double DoW Loop to load and lookup the data from Small. Finally, we have a traverse part that. I define three arrays. Key, link and data. The key and data arrays are self explanatory. We use the link array to handle collisions. When a key value hashes to an address in the key array that is already occupied, we use the link array to allocate the key to another deliberate location in key. We will see how later. Before I go into the DoW Loops, I define the three temporary arrays with the number of elements determined by the prime number above. In this case 1250003 elements with base zero. This is very important. We will see why below.

The First Dow Loop

  1. In the first SAS DoW Loop we read the Small data set. I use the Mod Function to determine the address to point to in the arrays. The Mod Function returns the remainder when dividing k by hsize. This method is a common hashing technique. Consequently, we consider Mod as a hashing function that allocates the key values somewhat equal in the temporary array.
  2. Now, we look in link. If this element is missing, no key has hashed to this address and we can go to (6). If not, we have to find another address for it.
  3. In the case of a collision, we traverse link to see if the key has already resulted in a collision before. Consequently, it will have been sent to another address in link. If this is the case, the traverse part will return found=1. See the Traverse Section below for details. If found=1 then the program goes to (6) because we do not need to find a new address for k.
  4. If found=0 then we need to find a new address for k. That means there has been a collision and does not already exist in the array. We do this by iterating though link from the right to left.
  5. When we encounter an empty address, we let the h‘th position in link point to to the r‘th element by setting link[h]=r. Now, the name link makes sense. In the case of a collision, we let the elements of link point to the new addresses of the key value positions. This process is very well illustrated in the appendix of the article at the bottom.
  6. Finally, we have found an appropriate hashing address h. Naturally, we insert d in data and k in key. More interestingly, we set link[h]=0. Consequently, a 0 element in link represents the end of a same-hash-address sequence of keys.

The Second DoW Loop

  1. The Second SAS DoW Loop represents the actual lookup. Step one in the second DoW Loop is identical to step 1 in the first DoW Loop.
  2. If link[h] is not missing, then we know that some key value from Small hashed to this address. However, we do not know if this particular from large is there. Therefore, we traverse link following the points inserted in (5) above. If the key value is found, then the traverse part returns found=1. See the traversal section for details.
  3. If the key value is found, then (2) returns its address. All that is left to do is access this address in the data array by data [h] and the lookup is complete.

The Traversal Of Link Array

Since we traverse the link array twice, once in each SAS DoW Loop, it is convenient to write this code only once. Then we access this code with a Link Statement. The traverse part is a simple If-Then block. If is found in the h‘th element of key then the lookup is successful and we set found=1. This is applicable in both the load part and the lookup part. If is not found in the h‘th element of key, we set h to the value of the h‘th element of link. We do this because we know that this points to the next potential lookup key in the key array. We then use yet another Link Statement and compare the next key value. This way, we keep traversing the path of potential keys with the directions given in the link array until link holds a zero. Remember, the zero represents the end of a same-hash-address sequence of keys.

data arrayhash(keep=k d);
   array key  {0 : &hsize.} _temporary_;
   array link {0 : &hsize.} _temporary_;
   array data {0 : &hsize.} _temporary_;
   r=&hsize.;
 
   do until (lr1);
      set small end=lr1;
      h=mod(k, &hsize.) + 1;                    /*  1  */
      found=0;
      if link [h] > . then do;                  /*  2  */
         link traverse;                         /*  3  */
 
         if found then continue;
 
         do r=r by -1 until (link [r]=.);       /*  4  */
         end;
 
         link [h] = r;                          /*  5  */
         h        = r;
      end;
 
      link [h] = 0;                             /*  6  */
      key  [h] = k;
      data [h] = d;
   end;
 
   do until (lr2);
      set large end=lr2;
 
      found=0;
      h=mod(k, &hsize.) + 1;                    /*  1  */
 
      if link[h] > . then link traverse;        /*  2  */
 
      if found then d = data[h];                /*  3  */
      else d=.;
      output;
   end;
 
   traverse:
 
   if k = key [h] then found=1;
   else if link [h] ne 0 then do;
      h=link [h];
      link traverse;
   end;
run;

Summary

In this post, I have demonstrated how to implement universal hashing in the SAS Data Step. I do so by using temporary arrays and the Mod Function with a well-chosen prime number as second argument. Performance-wise, this technique outperforms the usual hash object substantially measured in run-time. Not surprisingly, direct-addressing is faster. However, the hashing scheme above does not require numerical keys to fit directly as array references. You can see the entire code that produces similar results and time yourself in the code at the bottom.

In this post, I have tried to keep things simple. If you want to learn more about hashing with SAS arrays, see part 2 of the article Direct addressing techniques of table look-up by Paul Dorfman. I build most of the content in this post on the theory and code from there. Though the article is from 1998, it is worth a read. You will see alternatives to how to deal with collisions, illustrations that help you understand the technique and much more. In fact, the article is on my top 10 all-time favorites article list.

You can download the entire SAS code from this post here.