When a SAS Hash Object encounters a key value, it runs the value through a Hash Function. The hash function assigns the key value to a bucket called an AVL Tree. This post demonstrates an example of how this distribution takes place and why it plays a crucial role in the performance of the hash object.

In the following, I will make use of the example data below. The data is for demonstration purposes only.

data MyData;
   array first_names{20} $20 _temporary_ ("Paul", "Allan", "Thomas", "Michael", "Chris", "David", "John", "Jerry", "James", "Robert",
                                          "William", "Richard", "Bob", "Daniel", "Paul", "George", "Larry", "Eric", "Charles", "Stephen");
   array last_names{20}$20 _temporary_ ("Smith", "Johnson", "Williams", "Jones", "Brown", "Miller", "Wilson", "Moore", "Taylor", "Hall",
                                        "Anderson", "Jackson", "White", "Harris", "Martin", "Thompson", "Robinson", "Lewis", "Walker", "Allen");
   call streaminit(123);
   do ID=1 to 10e6;
      first_name=first_names[ceil(rand("Uniform")*20)];
      last_name=last_names[ceil(rand("Uniform")*20)];
      output;
   end;
 
   format ID z7.;
run;

A Bucket Distribution Simulation in SAS

Shortly put, an AVL Tree is a self-balancing binary search tree. Each time a new key value is put into the tree, it balances itself so that during a search, it always has roughly the same distance to each value. You can see a visualization of the AVL principle here.

Let us look at an example of how the distribution into AVL trees could look. Be aware though, that the following is not the actual underlying process of the SAS hash object. Rather, it is an example of how a hash function can be used to distribute the key values (almost) evenly without knowing the values beforehand.

In the data below, I read the MyData data set from above. For each ID, I run the MD5 Hash Function. This gives me a 32 byte hexadecimal string. Next, I use the Rank Function to find the numerical position in the ASCII collating sequence of symbols of the first character of the MD5 output. Finally, I map this position to a number between 1 and 256. Remember, that 256 (2**8) is the default number of AVL trees in a hash object. The number of search trees is controlled with the HASHEXP Option.

Each time that a tree receives an ID, I increment the corresponding variable value in the num array by 1. I only output from the last iteration. Consequently, you can see in the AVL data set, that the MD5 hash function does a pretty good job of distributing the key values evenly on the binary search trees. The variation from the most frequent to the least frequent tree is less than 0.1%.

data AVL(keep=num:);
   set MyData end=eof;
   Tree=1+mod(rank(md5(ID)), 256);
   array num{256} (256*0);
   num[Tree]+1;
   if eof then do;
      diff=(max(of num:)-min(of num:))/1e6;
      output;
      put diff;
   end;
run;

Key Values Are Sorted Within Buckets

For a search tree to be able to perform a binary search, the key values must be sorted inside the tree. Consequently, the key values are sorted inside each AVL Tree in the hash object. This is fairly easy to show. In the code below, I simply create a hash object h with a single key value k. I set up 4 binary search trees in h, which I control by setting the HASHEXP Argument Tag to 2 (2**2=4).

I add the values k=1 to k=100 to the hash object and output the content to the SAS data set test. If you open the test data set, you will see 4 ascending sequences of values. This is prove of the 4 buckets in the hash object, which all contain sorted key values for the binary search to take place.

data _null_;
   declare hash h(hashexp:2);
   h.definekey('k');
   h.definedone();
   do k=1 to 100;
      h.add();
   end;
   h.output(dataset:'test');
run;

Summary

In this post, we have taken a peek under the hood of the hash object. We have seen a demonstration of how key values are distributed into binary search trees. Furthermore, we have learned that in each tree, the hash object sorts the key values in ascending order. This must take place for the binary search to be possible. I have previously written blog posts about the performance perks of the hash object in the posts Comparing SAS Hash Object And Index Search Algorithm and SAS Hash Object Lookup Example.

As usual, I want to plug the best learning material on the SAS Hash Object out there: the book Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study by Don Henderson and Paul Dorfman.

You can download the entire code from this post here.