In the blog post Binary Searching an Array in SAS, we explore how to perform a binary search of a temporary array in SAS. We compare different approaches and learn that the binary search is much faster than a simple sequential search. Today, let us use the binary search techniques in a table lookup setting. In the examples to come, the explanations will be superficial. For a more thorough explanation, read the blog post above.

In the examples to come, I will use the following example data sets. The data set big has 5 Mio observations and small has about half of that.

data big(keep=key) small(keep=key data);
   call streaminit(123);
   array r {5000000} _temporary_ (1 : 5000000);
   h = 5000000;
   do _N_ = 1 to 5000000;  
      i  = rand ("integer", h);
      key  = r [i];
      data = rand('integer', 1, 1e8);
      output big;
      if rand('uniform') < .5 then output small;
      r [i] = r [h];
      h = h-1;
   end;
   stop;
run;

A Match Merge Example Using Arrays and Binary Search

Let us perform a match merge using temporary arrays and binary search. The first thing we must know is the number of observations in the small input data set. We need this number to size the temporary arrays properly. This is a pretty simple task with the code snippet below. Also, the code is very fast because nobs is set at compile time and read from the descriptor portion of the data set. This, no data is read below.

data _null_;
   set small nobs=n;
   call symputx('n', n);
   stop;
run;
 
%put &n.;

Next, let us consider the actual lookup. First, we set up the temporary arrays using the number found above. Next, we read the small data set and read all key value into k and all data values into d. Now, for a binary search to work the arrays must be sorted by k and the data values accordingly. This is easily done with the QSort macro presented in the blog post 4 Techniques To Sort An Array in SAS.

Now, we read the big data set in the second DoW Loop. The lines of code that perform the binary search are exactly the same as in the original post at the top. Also, the logic is in line with most binary search code you will find online.

data want1(keep=key data);
   array k {&n} _temporary_;
   array d {&n} _temporary_;
 
   do x = 1 by 1 until (z1);
      set small end = z1;
      k[x] = key;
      d[x] = data;
   end;
 
   %qsort (Arr=k d, By=k); 
 
   do until (z2);
      set big end = z2;
 
      l = lbound(k);
      h = hbound(k);
      x = .; 
      data = .; 
 
      do until (l > h);
         m = floor((l + h) * .5);
         if      key < k[m] then h = m - 1;
         else if key > k[m] then l = m + 1;
         else do;
            data = d[m];
            leave;
         end;
      end;
 
      output;
 
   end;
 
run;

The code above runs in about 9 seconds on my system. Let us compare to the regular hash object approach.

data want2;
   if _N_ = 1 then do;
      dcl hash h(dataset : 'small');
      h.definekey('key');
      h.definedata('data');
      h.definedone();
   end;
 
   set big;
   data = .;
 
   _N_ = h.find();
run;

The hash object code above runs in about 12 seconds. Not much of a time saving. Especially considering the complexity of the binary search code. Finally, you can verify that the results are identical with the code below.

proc compare base=want1 comp=want2;run;

Summary

In this post we put the binary search technique to work in a practical sense. Namely, we use it to do match merging of two SAS data sets. We learn that while this has interesting properties, the time saving is minimal. Especially when we consider the complexity of the code compared to the regular hash approach.

You can download the entire code from this post here.