In the blog post An Array Hashing Scheme in SAS, we learn about hashing without the SAS hash object. Instead, we use temporary arrays to store keys and data. Furthermore, we use the Mod Function as a primitive hash function and design collision resolutions when two keys hash to the same address. In the original post, we assume that only a single, numeric, integer key variable and a single data variable is present. Today, we relax that assumption and design array hashing techniques with multiple keys and data variables.

The content in this post is largely based on the code in the original article Table Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing by Paul Dorfman. I encourage you to read the entire article. One of my absolute favorites.

In the examples to come, I will use the two data sets large and small below. There are 2 key variables k1, k2 and 2 data variables d1 and d2.

data large(keep=k1 k2) small(keep=k1 k2 d1 d2);
   call streaminit(123);
   array r {5000000} _temporary_ (1 : 5000000);
   h = 5000000;
   do _N_ = 1 to 5000000;  
      i  = rand ("integer", h);
      k1  = r [i];
      k2 = rand('integer', 1, 1e4);
      d1 = rand('integer', 1, 1e8);
      d2 = rand('integer', 1, 1e8);
      output large;
      if rand('uniform') < .5 then output small;
      r [i] = r [h];
      h = h-1;
   end;
   stop;
run;

Array Hashing With Multiple Data Variables

First, let us consider the case of multiple data variables. This is fairly easy to do with the SAS hash object. See the code below. Here, I look up d1 and d2 from small on the key variable k1 in large. The code snippet below takes about 9 seconds to run on my local machine.

data hash1(keep=k1 d1 d2);
 
   if _N_ = 1 then do;
      dcl hash h(dataset : 'small');
      h.definekey('k1');
      h.definedata('d1', 'd2');
      h.definedone();
   end;
 
   set large;
   d1 = .; d2 = .;
 
   rc = h.find();
run;

Next, let us do the same lookup using array hashing. First off, let us find a fitting prime. This is in line with the approach in the previous post and the original article.

%let load = 0.5;
data _null_;
   do p=ceil(p/&load) by 1 until (j = up + 1);
      up = ceil(sqrt(p));
      do j=2 to up until (not mod(p,j)); end;
   end;
   call symput('hsize',left(put(p,best.)));
   stop;
   set small nobs=p;
run;
 
%put &hsize.;

Now, I have the size of the arrays. Next, we do the actual lookup. The case of multiple data variables is not complicated. All we need to do is to define an extra array to hold the additional data values. Naturally, we create as many variables as there are data variables to look up. Now, all that is left is to insert/retrieve the data values into two arrays instead of one. See the code snippet below. This runs in about 2.5 seconds on my machine. A quite sizable reduction in run time.

data arrayhash1(keep=k1 d1 d2);
   array key   {0 : &hsize.} _temporary_;
   array data1 {0 : &hsize.} _temporary_;
   array data2 {0 : &hsize.} _temporary_;
   array link  {0 : &hsize.} _temporary_;
 
   r=&hsize.;
 
   do until (lr1);
      set small end=lr1;
      h=mod(k1, &hsize.) + 1;                 
      found=0;
      if link [h] > . then do;               
         link traverse;                      
         if found then continue;
         do r=r by -1 until (link [r]=.);    
         end;
         link [h] = r;                       
         h        = r;
      end;
 
      link  [h] = 0;                          
      key   [h] = k1;
      data1 [h] = d1;
      data2 [h] = d2;
   end;
 
   do until (lr2);
      set large end=lr2;
      found=0;
      h=mod(k1, &hsize.) + 1;                 
      if link[h] > . then link traverse;     
      if found then do;
         d1 = data1[h];  
         d2 = data2[h];
      end; 
      else call missing(d1, d2);
      output;
   end;
 
   stop;
   traverse:
 
   if k1 = key [h] then found=1;
   else if link [h] ne 0 then do;
      h=link [h];
      link traverse;
   end;
run;

Finally, verify that the two data steps above create the same result.

proc compare base=hash1 comp=arrayhash1;run;
%put &sysinfo.;

Array Hashing With Multiple Key Variables

Next, let us take a look at hashing with multiple key variables. First, let us do so with the SAS hash object. All, we need to do is to add the second key variable to the Definekey Method call. See the code below. This code takes about 8.5 seconds to run on my system.

data hash2(keep=k1 k2 d1 d2);
 
   if _N_ = 1 then do;
      dcl hash h(dataset : 'small');
      h.definekey('k1', 'k2');
      h.definedata('d1', 'd2');
      h.definedone();
   end;
 
 
   set large;
   d1 = .; d2 = .;
 
   rc = h.find();
run;

Next, let us create the same result with the array hashing technique. There are two major differences from the single-key example above. The most important thing is the way we hash the key value to an array position. Since we have more than one key value, we can not simply use the modulo method like above. Perhaps we could concatenate the character representation of the key values and hash that. However, there are two problems with that approach. First off, concatenation in SAS is very slow. And since the goal is speed, that is no good. Secondly, the key values may concatenate into an integer lying beyond SAS integer precision.

Instead, we hash the first key with the Mod Function. Then, we multiply by the second key, and use the Mod Function again. This is an application of Horner’s Algorithm. Also, it is the recommended approach in the original article. This modification is done both in the load – and lookup phase below.

Finally, when traversing the array at the bottom, we need to check if there is a match on both key variables. This is a simple modification. The code below runs in just over 2 seconds on my system. Again, a sizable run-time reduction.

data arrayhash2(keep=k1 k2 d1 d2) / stack = 100;
   array key1  {0 : &hsize.} _temporary_;
   array key2  {0 : &hsize.} _temporary_;
   array data1 {0 : &hsize.} _temporary_;
   array data2 {0 : &hsize.} _temporary_;
   array link  {0 : &hsize.} _temporary_;
   r=&hsize.;
 
   do until (lr1);
      set small end=lr1;
      h = mod(mod(k1, &hsize.) * 1e4 + k2, &hsize.);              
      found=0;
      if link [h] > . then do;               
         link traverse;                      
         if found then continue;
         do r=r by -1 until (link [r]=.);    
         end;
         link [h] = r;                       
         h        = r;
      end;
 
      link  [h] = 0;                          
      key1  [h] = k1;
      key2  [h] = k2;
      data1 [h] = d1;
      data2 [h] = d2;
   end;
 
   do until (lr2);
      set large end=lr2;
      found=0;
      h = mod(mod(k1, &hsize.) * 1e4 + k2, &hsize.);              
      if link[h] > . then link traverse;     
      if found then do;
         d1 = data1[h];  
         d2 = data2[h];
      end; 
      else call missing(d1, d2);
      output;
   end;
 
   stop;
   traverse:
 
   if k1 = key1 [h] and k2 = key2 [h] then found=1;
   else if link [h] ne 0 then do;
      h=link [h];
      link traverse;
   end;
run;

Again, we verify that the two results are identical.

proc compare base=hash2 comp=arrayhash2;run;
%put &sysinfo.;

Summary

In this post, we take another look at hashing with temporary arrays in the SAS data step. We expand the scope of array hashing to handle multiple key and data variables. Also, we compare the run times of the technique to the SAS hash object. The result is a quite sizable reduction in run time compared to the usual hash object. However, the technique is quite advanced. Therefore, the approach should be used only when run time is of the very essence. Also, we can only do this when all key variables are numeric and integer. In a future blog post, I will relax this assumption as well.

Besides the original paper (link above in the post), the technique is further elaborated in the two SUGI papers Hashing Rehashed and HASHING: GENERATIONS.

For alternative lookup methods using arrays, read Binary Searching an Array in SAS and Use Binary Search of Array in SAS Table Lookup.

You can download the entire code from this post here.