Quite often, I see a question on the SAS Online Communities where some action must take place, based on previous occurrences of the current key variable value. The answers usually involve sorting, specifying a temporary array, or similar. A hash object is an ideal tool for a task like this. In this post, I will demonstrate how to use the hash object as a dynamic placeholder for variable values.
The reason that I like the hash object for a problem like this is due to its dynamic structure. Unlike a variable or array, the hash object can grow or shrink at run time. This means that we do not need to know beforehand how much space to allocate for the job. The hash object takes care of it along the way.
In the examples to come, I will use the following example data. The data is for demonstration purposes only.
data Example(drop=i); do i=1 to 100; date=today()-i*rand('integer', 1, 100); ID=put(rand('integer', 1, 10), z4.); var1=rand('integer', 1, 100); var2=rand('integer', 1, 100); output; end; format date date9.; run;
Flag 5th occurrence of a SAS variable value
This is a simple example of using the hash object to keep track of previously encountered variable values. Let us assume that we are interested in whether a key value has been encountered exactly five times. It may be tempting to sort the data and create some retained counter variable. However, we want to avoid reading/writing data too many times. And remember, sorting is CPU intense.
Instead, we create a hash object to keep track of the occurrences for us. I specify ID as the key variable of interest and c as the data variable in the hash object, that will keep track of the number of occurrences thus far. I then use the Check() Method to perform a Hash Search for the ID value in the hash object.
Finally, I use if-then logic like this. If the value is not found and therefore has not yet been encountered, I set c to 1 and add the value to the hash object. If the key value has already been encountered, it will appear in the hash object. Therefore, I use the Find() Method to copy the hash variables into the PDV, increase c by one and use the Replace() Method to update the state of the hash variables. Also, I do a check for whether the counter is equal to 5 after having increased. If yes, then I set the flag equal to 1.
data test1(drop=rc); if _N_ = 1 then do; declare hash h(); h.defineKey('ID'); h.defineData('c'); h.defineDone(); end; set Example; flag=0; rc=h.check(); if rc ne 0 then do; c=1; h.add(); end; if rc=0 then do; rc=h.find(); c=c+1; if c=5 then flag=1; rc=h.replace(); end; run;
The example above is simple but very effective. The logic can easily be modified to a variety of problems.
Output observations the sum of a variable exceeds 500
Now, I make things a bit more complicated. Not I want to output observations where the sum of some variable values hash just exceeded 500. The logic is somewhat the same as above. However, here I create a second object from which I output the final observations of interest. I do not want all observations where some variables values until now exceed 500. I want the observation that makes the sum exceed 500.
Basically, I achieve this by applying the same logic again. I create a second hash object ids. If the sum variable sum_var1 at some point exceed 500, I check the ids object if it exists in there. If not, I use the Add() Method to add it to the object along with the relevant data variables. Remember, the date that this occurred first might be important.
data _null_; if _N_ = 1 then do; declare hash h(); h.defineKey('ID'); h.defineData('sum_var1'); h.defineDone(); declare hash ids(); ids.defineKey('ID'); ids.defineData('date', 'var1', 'var2', 'sum_var1'); ids.defineDone(); end; set Example end=lr; rc=h.check(); if rc ne 0 then do; sum_var1=var1; h.add(); end; if rc=0 then do; rc=h.find(); sum_var1=sum_var1+var1; if sum_var1 ge 500 and ids.check() ne 0 then ids.add(); rc=h.replace(); end; if lr then ids.output(dataset:'IdsOver500'); run;
Again, the example is easily extendable to related situations. Therefore, I like to keep examples like this in abbreviations for easy retrieval.
In this post, we have seen examples of using the hash object as a dynamic data structure. This is a nice use of the object because it utilizes the unique feature of being able to grow or shrink dynamically at run time. This fact separates the hash object from the Temporary Array because we do not have to specify the size at compile time. Also, we avoid sorting the data, which is always desirable from a performance point of view. Classic uses of the hash object as a dynamic placeholder is further described in the blog posts Remove Duplicate Observation With the SAS Hash Object and Use the Hash Object as a Cache in PROC FCMP. Also, see a beautiful implementation of its dynamic nature in the post Identify Connected Components in SAS with PROC OPTNET.
For more tricks like this, I recommend the book Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study.
You can download the entire program from this post here.