Recently, I stumbled upon an interesting SAS discussion at SAS-L. The OP wants to simulate a Set-By structure for a single key in a multi key hash object. That is, we want to traverse a SAS hash object with multiple key variables and detect when the first key changes. This is not as trivial as it may seem. The reason is that behind the scenes, multiple keys are treated as one concatenated key, which SAS runs through the internal hashing function. This is the reason why partial key lookups are not possible in the hash object. You can not have a hash object with k1 and k2 as key variables and check if any object entry exists for just k1. Sure would be nice though.
data _null_; dcl hash h(); h.definekey("k1", "k2"); h.definedata("d"); h.definedone(); k1 = .; k2 = .; d = .; h.add(key : 1, key : 2, data : 3); rc = h.check(key : 1); /* Partial key lookup - Not allowed! */ run;
In this post, I will demonstrate three approaches to accomplish this. Two from the original SAS Listing question at SAS-L and a Hash of Hashes technique. In the examples to come, I will use the example data below.
data have; input k1 k2 d $; datalines; 2 2 b 2 3 c 2 1 a 3 1 a 3 2 b 1 3 c 1 1 a 1 2 b ;
Key Jumper Technique
First off, let us consider an approach using 2 other hash objects to help us. In the example below the ‘original’ hash object is h. We then Declare and Instantiate to other hash objects hh and hhh with only k1 as key variable to help us. In the DoW Loop, I fill up each hash object with data from have.
When I traverse h, I do so in a nested structure of three layers. First, I use a hash iterator to traverse hhh. For each element here, I have a distinct key variable. I use this key in the Do_Over Method Call to traverse hh for all elements of k2 within the current value of k1. Now, I have both keys in order. Therefore, a simple Find() call in the inner-most loop is sufficient to find the correct value of d.
Before each Do_Over loop, I know that a new k1 group is about to begin. This technique was presented by yabwon.
data _null_; dcl hash h(multidata:"Y"); h.definekey("k1", "k2"); h.definedata("k1", "k2", "d"); h.definedone(); dcl hash hh(multidata:"Y"); hh.definekey("k1"); hh.definedata("k2"); hh.definedone(); dcl hash hhh(); hhh.definekey("k1"); hhh.definedone(); dcl hiter i("hhh"); do until(eof); set have end=eof; h.add(); hh.add(); hhh.ref(); end; do while(i.next()=0); put "** New key **"; do while(hh.do_over()=0); rc = h.find(); output; put (k1 k2 d)(=); end; end; stop; run;
DIF Function Technique
Next, let us see how we can use the DIF Function to accomplish the same goal. In the code below, I simply fill up the hash object with data from have. Here, I must specify both the multidata and ordered parts. The method relies on both. In the traversal part, I simply iterate through all elements in the SAS hash object. If the DIF Function returns a non-zero, non-missing value, k1 has changed. This method is simple and easy to understand. However, it is less prone to errors and requires more manual handling for more key variables than 2.
data _null_; if 0 then set have; declare hash h (dataset:'have', multidata : "Y", ordered : "Y"); h.definekey('k1','k2'); h.definedata(all:'Y'); h.definedone(); declare hiter i ('h'); do while (i.next()=0); if dif(k1) then put "** New key **"; put (k1 k2 d)(=); end; stop; run;
Hash of Hashes in SAS
Finally, let us see the most flexible of all the solutions. In my opinion, it is also the technique that is most intuitive. The Hash of Hashes approach. For simplicity, I split up the SAS code in 4 parts:
- First, I declare and instantiate the hoh hash object. This hash object will serve as a pointer to the hash objects it will contain. Note that I specify both h and hi in the data portion. Finally, I create an iterator object for later use.
- Next, I declare the hash object h. Notice that I do not instantiate it yet. This way, all I do is create a hash-type variable in the PDV named h.
- In part three, I read have sequentially. For each value of k1, I check whether it has been encountered before. If it has not, I create a new instance of h and add it to hoh. Either way, I add the current observation to h. This holds because either, the HOH.Find() Method ensures that h points to the correct instance. Or, we just created the appropriate instance of h and added it there.
- Finally, I traverse the hoh hash object. Since each entry in hoh represents a unique value of k1, I know that a new key group begins. Then, I simply use the iterator of the appropriate SAS hash object instance h and traverse all elements of all hash object instances.
data _null_; dcl hash hoh (); /* 1 */ hoh.definekey("k1"); hoh.definedata("h", "hi", "k1"); hoh.definedone(); dcl hiter i ("hoh"); dcl hash h; /* 2 */ do until (lr); /* 3 */ set have end=lr; if hoh.find() ne 0 then do; h = _new_ hash (multidata : "Y"); h.definekey ("k1", "k2"); h.definedata ("k1", "k2", "d"); h.definedone(); declare hiter hi ("h"); hoh.add(); end; h.add(); end; do while (i.next() = 0); /* 4 */ put "** New key **"; do while (hi.next()=0); put (k1 k2 d)(=); end; end; run;
In this post, I demonstrate three techniques to detect single key changes in SAS hash objects with multiple key variables. Of the three, my favorite is the hash of hashes approach. Though all three examples above yield the same result, the hash of hashes technique easily handles more than two key variables. Also, I do not have to use more hash objects than strictly necessary. Thus keeping memory usage to a minimum.
You can download the entire code from this post here.