It is a common data science task to combine data from multiple data sets in SAS. There are numerous techniques available. The best choice depends on the particular problem and data sets. However, knowing the different approaches is essential if you want to make the best choice. Especially from a performance point of view. This post presents an example of a temporary array lookup technique.

In the example to come, I will use the example data sets Employees and EmpHours. Employees contain employee ids for 10Mio made up employees. EmpHours contains a subset of the employee IDs with an hours variable. The hours variable indicates the hours worked for the appropriate ID. In the following, I will look up the hours worked for each employee in the Employees data set.

Click here to get the code that generates the two example data sets. I encourage you to examine the two data sets before you continue.

The Usual Sort Merge Approach

First, let us look at the most common way to merge two tables like this. To use the MERGE Statement in the Data Step, the two input data sets must be sorted by the appropriate by-variable. Here, we use PROC SORT to sort the two input tables by the common variable empid. Next, I merge the two tables in the data step.

proc sort data=employees;
   by empid;
run;
 
proc sort data=emphours;
   by empid;
run;
 
data SortMerge;
   merge employees emphours;
   by empid;
run;

The SAS data set SortMerge is now enriched with the hours variable from the EmpHours data set. Effectively, this approach requires reading both data sets twice. Once in each Sort Procedure run and once in the data step.

SAS Array Lookup Approach

The sort-merge technique is valid and quite flexible in most contexts. However, we have to think about performance. Especially when we work with large data set. The approach above requires the input data to be sorted. This is CPU costly. If we can do the lookup and avoid the sort, it is to be preferred. This is exactly what an array lookup can do.

We can describe the successful array lookup in three steps:

  1. Define a temporary array large enough to hold all possible key lookup values.
  2. Sequentially access all values to be looked up from the lookup table and store them in the array so that their position matches the numerical value of the key lookup variable.
  3. Sequentially access the master table and exploit the fact that you can look up the value of interest with the numerical value of the key variable directly in the array.

In the context of our example, we perform the three steps in the data step below

  1. First, I define an array EmpHours, large enough to hold all possible values of empid. Empid is a 7 digit reference number. Therefore I need 9999999 entries to be sure to fit them all. I do not want to create new variables in the PDV. Therefore I use the _TEMPORARY_ keyword to make the array temporary.
  2. Next, I access the emphours data set sequentially and place the hours value in the numerical position of empid in the array.
  3. Finally, I access the employees data set sequentially. Here I use the array to look up the hours values. Remember, I wisely placed the hours values in the temporary array such that the numerical position matches their corresponding value of empid. Consequently, I can now use empid to access the appropriate hours value directly in the array.
data ArrayLookup;
   array EmpHours{9999999} _temporary_;
   do until(eof1);
      set emphours end=eof1;
      EmpHours[input(empid, 7.)]=hours;
   end;
   do until(eof2);
      set employees end=eof2;
      hours=EmpHours[input(empid, 7.)];
      output ArrayLookup;
   end;
   stop;
run;

Discussion

The SAS array lookup approach is very efficient for several reasons. Obviously, we save a lot of processing time because we do not sort the data. Also, an array exists in memory, which means no I/O operations of utility files. Finally, the number of I/O operations usually shrinks. In this example, we only read the data half as many times as in the sort-merge approach, which means half as many I/O operations.

However, the array lookup technique has its obvious limitations. Firstly, the common key variable has to be numerical representative because we must be able to access an entry with it. We can not access the 1X4G7D’th entry of an array. Secondly, we can only lookup a single value, because we can only assign one value to each position in an array. Lastly, it must be able to fit all possible key lookup values. Therefore, if your common key variable is a ten-digit value, the array lookup is no good.

In a future blog post, I demonstrate a Format Lookup In SAS. The format lookup overcomes a few of the obstacles in the array lookup. Even further, I demonstrate how to use the hash object to look up data in the blog post SAS Hash Object Lookup Example. As an alternative, check out the bitmap search technique in Bitmapping in SAS.

Summary

There are pros and cons to every lookup technique in SAS. The best technique depends largely on the individual situation. In this post I demonstrate the array lookup technique with an example. In the example, we cut the real time in half compared to the sort-merge technique. Lastly, I discuss the pros and cons of the technique. Unfortunately the speed comes with a cost, as it also has quite a few limitations, where other lookup techniques are much handier.

As a related post, see 8 SAS Array Function You Should Know, How to Store Lagged Values in Temporary Arrays and An Array Hashing Scheme in SAS. Also, I use the array lookup technique in the post about Danish Social Security Numbers.

You can download the entire code from this post here.