SAS Hash Object Lookup Example

I have previously written the two blog posts SAS Array Lookup Example and SAS Proc Format Lookup Example. I demonstrate examples of how to perform table lookups with arrays and formats. While these are both valid and efficient lookup techniques, they do have limitations. These limitations are eliminated if we use the extremely powerful SAS Hash Object. In this post, I will briefly demonstrate an example of a simple hash object lookup. Furthermore, I will discuss how the hash object overcomes the limitations of the lookup technique described in the two posts above.

In the examples to come, I will use almost the same example data as in the Format Lookup Example post, generated with the following code

data employees(drop=i);
   length empid $20;
   array first_names{20} $15 _temporary_ ("Paul", "Allan", "Thomas", "Michael", "Chris", "David", "John", "Jerry", "James", "Robert",
                                          "William", "Richard", "Bob", "Daniel", "Paul", "George", "Larry", "Eric", "Charles", "Stephen");
   array last_names{20} $15 _temporary_ ("Smith", "Johnson", "Williams", "Jones", "Brown", "Miller", "Wilson", "Moore", "Taylor", "Hall",
                                        "Anderson", "Jackson", "White", "Harris", "Martin", "Thompson", "Robinson", "Lewis", "Walker", "Allen");
   call streaminit(123);
   do i=1 to 50e6;
      first_name=first_names[ceil(rand("Uniform")*20)];
      last_name=last_names[ceil(rand("Uniform")*20)];
      empid=compress(uuidgen(), '-');
      output;
   end;
run;
 
proc surveyselect data=employees out=temp(keep=empid) seed=123 noprint                
     method=srs
     sampsize=500000;
run;
 
data emphours;
   set temp;
   hours=round(rand('Uniform', 10, 100), 0.01);
   sickdays=ceil(rand("Uniform")*10);
   seniority=ceil(rand("Uniform")*30);
run;
 
proc datasets lib=work nolist;
   delete temp;
run;quit;

The only difference is that the lookup table emphours contains three lookup variables instead of one.

A Simple Hash Object Lookup

The fact that the lookup table emphours contains more than one lookup variable outlines the shortcomings of the other lookup techniques well. They can only lookup one value for a given key (not entirely true, but is takes a lot of effort to lookup multiple values). For example, if we use the format lookup, we would have to create as many formats and as many CNTLIN Data Sets as the number of lookup variables. This is certainly not desirable.

However, we can easily lookup multiple variable values for a given key with the hash object. To do so, we simply declare the hash object in the first iteration of the data step (if _N_=1). I specify empid as the key variable and hours, sickdays and seniority as data variables in the object. Finally I set the data variables as missing. Otherwise they are retained from the previous observation in the data set. Finally, I set the Employees data set and use the FIND() Method to perform the lookup.

data HashLookup(drop=rc);
   if 0 then set work.emphours;
   if _N_ = 1 then do;
      declare hash h(dataset:'work.emphours');
      h.defineKey('empid');
      h.defineData('hours', 'sickdays', 'seniority');
      h.defineDone();
      call missing(hours, sickdays, seniority);
   end;
 
   set Employees;
 
   rc=h.find();
run;

The if 0 then set statement at the top of the program ensures that the relevant variables exist in the PDV. The HashLookup data set contains the variables from Employees and the three data variables from EmpHours. Thus, we successfully performed an efficient table lookup of multiple data values without creating any new data sets or catalogs (formats). This is the power of the hash object.

Summary

We have seen that the hash object easily overcomes the shortcomings of other lookup techniques in SAS. However, we have only seen the tip of the iceberg. The hash object is extremely flexible and efficient as it exists only in memory. I will leave this post as a simple teaser as many more posts on the hash object are to come. Stay tuned!

If you want to learn more about the rich capabilities of the SAS Hash object, I recommend the two books Data Management Solutions Using SASĀ® Hash Table Operations: A Business Intelligence Case Study and SASĀ® Hash Object Programming Made Easy.

I have previously blogged about how to use the SAS Hash Object to Split Datasets By Group.

You can download the entire code from this article here.