Run Time Effect Of The Hash Object HASHEXP Argument

The SAS Hash Object is a very efficient tool to look up data and much more. When we use a hash object, we expect it to be fast. Therefore we should tweak the different hash object options and argument tags to our advantage. This post presents the HASHEXP argument tag. I will briefly explain the theoretical purpose of the HASHEXP tag. Following, I will demonstrate how important it is by demonstrating how run time is decrease when we increase the HASHEXP value.

The HASHEXP Argument Tag

When a hash object is declared, a set of AVL trees are constructed behind the scenes. An AVL tree is a tree structure filled with key values that can be searched for a given key value using a binary search algorithm. The HASHEXP argument determines the number of AVL trees that the hash object creates. By the nature of a binary search, the more values in a tree, the slower the search. Therefore it is desirable to have as small a number of key values in a given tree.

The value of HASHEXP is set in the hash object arguments in the Declare Statement. It accepts an integer value between 0 and 20. The number of AVL trees is determined as 2^{\text{HASHEXP}}. This means that a hash object can contain between 1 and 1048576 trees for HASHEXP equal to 0 and 20 respectively. The default value is 8, which results in 256 trees. As we will see in a minute, the number of search trees has an enormous impact on run time when searching for a key value in a hash object.

When the hash object encounters a key value, it runs a hash function on the key value. This hash function determines which of the 2^{\text{HASHEXP}} trees the key is assigned to. The hash function is designed so that the same key value can not end up in different binary trees. This is the reason why we can Group Variable Values With The Hash Object In SAS. Two of the most famous hash functions are the MD5 Function and the class of SHA Functions. In a future post, I will discuss there in further detail.

Case Study: Increasing the HASHEXP Argument

As an example, let us see how increasing the HASHEXP tag affects run time. First, we create some example data. This consists of an arbitrary ID variable and a name. The example data is for demonstration purposes only.

data MyData(drop=i);
   length ID $50 first_name $20 last_name $20;
 
   array first_names{20}$20 _temporary_ ("Paul", "Allan", "Bob", "Michael", "Chris", "David", "John", "Jerry", "James", "Robert",
                                       "Karen", "Betty", "Helen", "Sandra", "Sharon", "Laura", "Michelle", "Angela", "Melissa", "Amanda");
   array last_names{20}$20 _temporary_ ("Smith", "Johnson", "Williams", "Jones", "Brown", "Miller", "Wilson", "Moore", "Taylor", "Hall",
                                      "Anderson", "Jackson", "White", "Harris", "Martin", "Thompson", "Robinson", "Lewis", "Walker", "Allen");
 
   call streaminit(123);
 
   do i=1 to 10e6;
      ID=uuidgen();
      first_name=first_names[ceil(rand("Uniform")*20)];
      last_name=last_names[ceil(rand("Uniform")*20)];
      output;
   end;
run;

Next, let us run several search algorithms for some arbitrary ID value. We run each algorithm on the same data set for a single value of the key variable ID. The only difference between the runs is the value of HASHEXP. I perform 21 runs, one for each possible value of HASHEXP. To do this, I use CALL EXECUTE logic. Finally, I use the LOGPARSE Macro to save the run time data in a SAS data set for later analysis.

options fullstimer;
options msglevel=i;
 
options nonotes nosource;
proc printto log="InsertPathHere\MyLog.log";
run;
options notes source;
 
%passinfo;
 
/* Hash Object Search */
data callstack;
   length string $500;
   do exp=0 to 20;
   string=compbl(cats(
      "data Test;
      if 0 then set MyData;
      if _N_=1 then do;
         declare hash h(dataset:'MyData', hashexp:", exp, ");
         h.definekey('ID');
         h.definedata(all:'Y');
         h.definedone();
      end;
 
      rc=h.find(key:uuidgen());
 
      run;"
      ));
   output;
   call execute(string);
   end;
run;
 
proc printto;
run;
 
%logparse(InsertPathHere\MyLog.log,PerfStat,,,append=NO);

SAS Hash Object HASHEXP argument Effect

I have plotted the results to the right. The graph indicates a sizable run time reduction due to the increase in the number of search trees. For the smallest possible value 0, which corresponds to a single search tree, the run time is about 19 seconds. The default value 8 results in a run time of 17 seconds. From HASHEXP=12 and up, we see a drastic drop in run time. For the largest possible value of search trees 20, we see a run time of merely 8 seconds.

So if the HASHEXP argument tag reduces run time, why don’t we simply set it to max by default? Well, as usual, there is a catch. On my platform (Windows 10, 64 bit), setting up the hash object with a single tree requires a little less than 1 MB memory. That is, the mere hash object with no data inserted. In contrast, setting up the hash object with the maximum number of trees requires 18 MB memory. So it depends on whether you want to spend the extra ‘fee’ when you set up the hash object. However, in most cases, this amount of memory is of little consequence. The maximum value of HASHEXP can usually be set safely.

Summary

In this post, I have demonstrated how we can reduce run time when searching for key values in a hash object. I introduce the HASHEXP argument tag and what it does. Next a case study shows that setting the value to max has an enormous effect on the elapsed time of the search algorithm. We cut the run time in less than half.

Next week, I write about how and why the hash search algorithm scales better than the index search algorithm in the post Comparing SAS Hash Object And Index Search Algorithm.

You can download the entire code from this example here.