In the blog post An Array Hashing Scheme in SAS, I demonstrate a hashing technique in SAS using temporary arrays. The technique was developed in SAS by Paul Dorfman before SAS introduced the hash object. Evens so, under the right circumstances, the array hashing is quite a lot faster than the usual hash object. The ‘right circumstances’ means a single, integer key. In this post, we will explore how to relax this assumption and do array hashing with character keys.

In the examples to come, I will use the following example data. Large has 10 Mio observations. The data set small hash about 1 Mio observations.

data large(keep = k) small;
   call streaminit(123);
   length k $ 20;
   do _N_ = 1 to 1e7;
      k = uuidgen();
      d = rand('integer', 1, 1e6);
      output large;
      if rand('uniform') < .1 then output small;
   end;
run;

Array Hashing in SAS With Character Key Values

As usual with the Modulo Hashing approach, we begin by choosing a load factor. This is the portion of the array that will be filled. From this load factor, we calculate an appropriate prime.

%let load = 0.5;
data _null_;
   do p=ceil(p/&load) by 1 until (j = up + 1);
      up = ceil(sqrt(p));
      do j=2 to up until (not mod(p,j)); end;
   end;
   call symput('hsize',left(put(p,best.)));
   stop;
   set small nobs=p;
run;
 
%put &hsize.;

The case of general character keys is addressed in the last part of the original article Table Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing. The author lists a few approaches. However, none of them seem to be very fast in SAS but one. Using an appropriate informat, that will map general character keys into numeric integers. It turns out, the Pib Informat fulfills just that. A SAS-L discussion suggests that using the Pib6 Informat is the way to go.

This means that we have to edit the code in two places. Namely when determining the array slot h to insert into/retrieve from. We do exactly that in the code below. Compare to the original code and verify the changes.

The code below succesfully looks up the value based on the character key. It does so in about 5 seconds on my system.

data arrayhash(keep=k d) / stack = 100;
   array key  {0 : &hsize.} $20 _temporary_;
   array link {0 : &hsize.}     _temporary_;
   array data {0 : &hsize.}     _temporary_;
   r=&hsize.;
 
   do until (lr1);
      set small end = lr1;
      h = mod(input(k, pib6.), &hsize.) + 1; 
      found = 0;
 
      if link [h] > . then do;                  
         link traverse;                         
         if found then continue;
         do r=r by -1 until (link [r] = .);       
         end;
         link [h] = r;                          
         h        = r;
      end;
 
      link [h] = 0;                             
      key  [h] = k;
      data [h] = d;
   end;
 
   do until (lr2);
      set large end = lr2;
      found = 0;
      h = mod(input(k, pib6.), &hsize.) + 1;      
      if link[h] > . then link traverse;        
      if found then d = data[h];                
      else d = .;
      output;
   end;
 
   stop;
 
   traverse:
 
   if k = key [h] then found = 1;
   else if link [h] ne 0 then do;
      h = link [h];
      link traverse;
   end;
run;

Compare to the SAS Hash Object

Let us compare the run time above to the regular SAS hash object. The code below runs in about 6 seconds. Not much of a difference there.

data hash(drop=rc);
   if _N_=1 then do;
      declare hash h(dataset:'small', hashexp:20);
      h.definekey('k');
      h.definedata('d');
      h.definedone();
   end;
 
   set large;
   d=.;
 
   rc=h.find();
run;
 
/* Verify that the two results are identical */
proc compare base=arrayhash comp=hash; run;

Summary

In this post, we generalize the array hashing technique to handle character keys. The best way to do so is to map the character key to a numeric integer using the Pib6 Informat. We compare an example lookup to the regular SAS hash object with little difference in run time. The conclusion here is probably to use array hashing only when strictly needed. And not with character keys.

As related posts, read A SAS Macro Approach to Array Hashing and Another Word on Array Hashing.

For more information read the original article and Hashing Rehashed and HASHING: GENERATIONS by Paul Dorfman.

You can download the entire code from this post here.