## SAS Array Lookup Example

It is a common data science task to combine data from multiple data sets in SAS. There are numerous techniques available. The best choice depends on the particular problem and data sets. However, knowing the different approaches is essential if you want to make the best choice. Especially from a performance point of view. This post presents an example of a temporary array lookup technique.

In the example to come, I will use the example data sets **Employees **and **EmpHours**. **Employees **contain employee ids for 10Mio made up employees. **EmpHours **contains a subset of the employee IDs with an **hours** variable. The **hours** variable indicates the hours worked for the appropriate ID. In the following, I will look up the hours worked for each employee in the **Employees ** data set.

Click here to get the code that generates the two example data sets. I encourage you to examine the two data sets before you continue.

##### The Usual Sort Merge Approach

First, let us look at the most common way to merge two tables like this. To use the MERGE Statement in the Data Step, the two input data sets must be sorted by the appropriate by-variable. Here, we use PROC SORT to sort the two input tables by the common variable **empid**. Next, I merge the two tables in the data step.

proc sort data=employees; by empid; run; proc sort data=emphours; by empid; run; data SortMerge; merge employees emphours; by empid; run; |

The SAS data set **SortMerge ** is now enriched with the **hours** variable from the **EmpHours** data set. Effectively, this approach requires reading both data sets **twice**. Once in each Sort Procedure run and once in the data step.

##### SAS Array Lookup Approach

The sort-merge technique **is** valid and quite flexible in most contexts. However, we have to think about performance. Especially when we work with large data set. The approach above requires the input data to be sorted. This is CPU costly. If we can do the lookup and avoid the sort, it is to be preferred. This is exactly what an array lookup can do.

We can describe the succesful array lookup in three steps:

- Define an array large enough to hold
**all**possible key lookup values. - Sequentially access all values to be looked up from the lookup table and store them in the array so that their position matches the numerical value of the key lookup variable.
- Sequentially access the master table and exploit the fact that you can look up the value of interest with the numerical value of the key variable directly in the array.

In the context of our example, we perform the three steps in the data step below

- First, I define an array
**EmpHours**, large enough to hold**all**possible values of**empid**.**Empid**is a 7 digit reference number. Therefore I need**9999999**entries to be sure to fit them all. I do not want to create new variables in the PDV. Therefore I use the _TEMPORARY_ keyword to make the array temporary. - Next, I access the
**emphours**data set sequentially and place the**hours**value in the numerical position of**empid**in the array. - Finally, I access the
**employees**data set sequentially. Here I use the array to look up the**hours**values. Remember, I wisely placed the hours values in the array such that the numerical position matches their corresponding value of**empid**. Consequently, I can now use**empid**to access the appropriate**hours**value directly in the array.

data ArrayLookup; array EmpHours{9999999} _temporary_; do until(eof1); set emphours end=eof1; EmpHours[input(empid, 7.)]=hours; end; do until(eof2); set employees end=eof2; hours=EmpHours[input(empid, 7.)]; output ArrayLookup; end; stop; run; |

The SAS array lookup approach is **very** efficient for several reasons. Obviously, we save a lot of processing time because we do not sort the data. Also, an array exists in memory, which means no I/O operations of utility files. Finally, the number of I/O operations usually shrinks. In this example we only read the data half as many times as in the sort-merge approach, which means half as many I/O operations.

However, the array lookup technique has its obvious limitations. Firstly, the common key variable has to be numerical representative because we must be able to access an entry with it. We can not access the **1X4G7D’th** entry of an array. Secondly, we can only lookup a single value, because we can only assign one value to each position in an array. Lastly, it must be able to fit **all **possible key lookup values. Therefore, if your common key variable is a ten digit value, the array lookup is no good.

In a future blog post, I demonstrate a Format Lookup In SAS. The format lookup overcomes a few of the obstacles in the array lookup. Even further, I demonstrate how to use the hash object to look up data in the blog post SAS Hash Object Lookup Example.

##### Summary

There are pros and cons to every lookup technique in SAS. The best technique depends largely on the individual situation. In this post I demonstrate the array lookup technique with an example. In the example, **we cut the real time in half** compared to the sort-merge technique. Lastly, I discuss the pros and cons of the technique. Unfortunately the speed comes with a cost, as it also has quite a few limitations, where other lookup techniques are much handier.

You can download the entire code from this post here.