## An Array Hashing Scheme in SAS

The SAS Hash Object was introduced in SAS 9.2. Before that, hashing techniques did exist. However, programmers has to implement them with custom code and temporary arrays. Today, I will demonstrate how to implement a hashing lookup in the Data Step using temporary arrays. We will see that in some circumstances, the array hashing technique outperforms the usual hash object in terms of speed.

#### Example Data

In the examples to come, I will use the following example data. I will lookup data values **d **from a small data set **Small**. I will do so from a larger SAS data set **Large**. **Small** and **Large **share the key variable **k**. Both **k** and **d** are numeric.

/* Small data set */ proc plan seed=123; factors k=1e6 / noprint; output out=k; run;quit; data small; set k; d=rand('integer', 1, 1e7); run; /* Large data set */ data large; call streaminit(123); do _N_=1 to 1e8; k=rand('integer', 1, 1e6); output; end; run; |

#### Find the Right Array Size

Before we do the actual hashing, we have to determine the size of the arrays to hold keys and data. Remember, the SAS Hash Object is Dynamic. It grows and shrinks at run time. The temporary SAS array does not have this advantage. Therefore, we have to find some suiting number of elements. Without going too much into details about this, there are a few things we need to consider. Obviously, the array must have the ability to store all key values. Also, we want to emulate the hashing process of allocating keys somewhat equally to different locations. We do this by *Universal Hasing*, aka. the *Modulo Hashing Technique*. I will not get too theoretical, but the technique requires us to find a fitting prime number to use both for the array size and the Mod Function. You get an idea of why in this discussion on StackOverflow.

There are many approaches to how to find a “good prime”. Some may look it up in a table. Here, I use the following small program to find it. This is from the article at the bottom of the post.

%let load = 0.8; data _null_; do p=ceil(p/&load) by 1 until (j = up + 1); up = ceil(sqrt(p)); do j=2 to up until (not mod(p,j)); end; end; call symput('hsize',left(put(p,best.))); stop; set small nobs=p; run; %put &hsize.; |

#### The Lookup

We structure the lookup as follows. First we define the arrays. Then I use a Double DoW Loop to load and lookup the data from **Small**. Finally, we have a traverse part that. I define three arrays. **Key**, **link** and **data**. The key and data arrays are self explanatory. We use the link array to handle collisions. When a key value hashes to an address in the key array that is already occupied, we use the link array to allocate the key to another deliberate location in **key**. We will see how later. Before I go into the DoW Loops, I define the three temporary arrays with the number of elements determined by the prime number above. In this case 1250003 elements with base zero. This is very important. We will see why below.

#### The First Dow Loop

- In the first SAS DoW Loop we read the
**Small**data set. I use the Mod Function to determine the address to point to in the arrays. The Mod Function returns the remainder when dividing**k**by**hsize**. This method is a common hashing technique. Consequently, we consider Mod as a hashing function that allocates the key values somewhat equal in the temporary array. - Now, we look in
**link**. If this element is missing, no key has hashed to this address and we can go to (6). If not, we have to find another address for it. - In the case of a collision, we traverse
**link**to see if the key has already resulted in a collision before. Consequently, it will have been sent to another address in**link**. If this is the case, the traverse part will return found=1. See the Traverse Section below for details. If found=1 then the program goes to (6) because we do not need to find a new address for**k**. - If found=0 then we need to find a new address for
**k**. That means there has been a collision and**k**does not already exist in the array. We do this by iterating though link from the right to left. - When we encounter an empty address, we let the
**h**‘th position in**link**point to to the**r**‘th element by setting link[h]=r. Now, the namemakes sense. In the case of a collision, we let the elements of**link****link**point to the new addresses of the key value positions. This process is*very*well illustrated in the appendix of the article at the bottom. - Finally, we have found an appropriate hashing address
**h**. Naturally, we insert**d**in**data**and**k**in**key**. More interestingly, we set link[h]=0. Consequently, a 0 element in**link**represents the end of a same-hash-address sequence of keys.

#### The Second DoW Loop

- The Second SAS DoW Loop represents the actual lookup. Step one in the second DoW Loop is identical to step 1 in the first DoW Loop.
- If link[h] is not missing, then we know that some key value from
**Small**hashed to this address. However, we do not know if this particular**k**from large is there. Therefore, we traverse**link**following the points inserted in (5) above. If the key value is found, then the traverse part returns found=1. See the traversal section for details. - If the key value is found, then (2) returns its address. All that is left to do is access this address in the
**data**array by data [h] and the lookup is complete.

#### The Traversal Of Link Array

Since we traverse the link array twice, once in each SAS DoW Loop, it is convenient to write this code only once. Then we access this code with a Link Statement. The traverse part is a simple If-Then block. If **k **is found in the **h**‘th element of **key** then the lookup is successful and we set found=1. This is applicable in both the load part and the lookup part. If **k **is *not* found in the **h**‘th element of **key**, we set **h** to the value of the **h**‘th *element *of **link**. We do this because we know that this points to the next potential lookup key in the **key **array. We then use yet another Link Statement and compare the next key value. This way, we keep traversing the path of potential keys with the directions given in the **link** array until **link **holds a zero. Remember, the zero represents the end of a same-hash-address sequence of keys.

data arrayhash(keep=k d); array key {0 : &hsize.} _temporary_; array link {0 : &hsize.} _temporary_; array data {0 : &hsize.} _temporary_; r=&hsize.; do until (lr1); set small end=lr1; h=mod(k, &hsize.) + 1; /* 1 */ found=0; if link [h] > . then do; /* 2 */ link traverse; /* 3 */ if found then continue; do r=r by -1 until (link [r]=.); /* 4 */ end; link [h] = r; /* 5 */ h = r; end; link [h] = 0; /* 6 */ key [h] = k; data [h] = d; end; do until (lr2); set large end=lr2; found=0; h=mod(k, &hsize.) + 1; /* 1 */ if link[h] > . then link traverse; /* 2 */ if found then d = data[h]; /* 3 */ else d=.; output; end; traverse: if k = key [h] then found=1; else if link [h] ne 0 then do; h=link [h]; link traverse; end; run; |

## Summary

In this post, I have demonstrated how to implement universal hashing in the SAS Data Step. I do so by using temporary arrays and the Mod Function with a well-chosen prime number as second argument. Performance-wise, this technique outperforms the usual hash object substantially measured in run-time. Not surprisingly, direct-addressing is faster. However, the hashing scheme above does not require numerical keys to fit directly as array references. You can see the entire code that produces similar results and time yourself in the code at the bottom.

In this post, I have tried to keep things simple. If you want to learn more about hashing with SAS arrays, see part 2 of the article Direct addressing techniques of table look-up by Paul Dorfman. I build most of the content in this post on the theory and code from there. Though the article is from 2001, it is worth a read. You will see alternatives to how to deal with collisions, illustrations that helps you understand the technique and much more.

You can download the entire SAS code from this post here.