Last week, I wrote about how to Use Arrays To Look Up Data In SAS. There are numerous ways to combine data sets and look up data from multiple sources in SAS. Therefore it is essential to know the different approaches and when to use them. Today, I demonstrate how to use formats as a lookup tool by example. The format lookup is a nice alternative to the array lookup technique because it overcomes some of the restrictions of the array lookup.

In the example to come, I will use the example data sets Employees and EmpHours. Employees contain employee ids for 50Mio made up employees. EmpHours contains a subset of the employee IDs with an hours variable. The hours variable indicates the hours worked for the appropriate ID. In the following, I will look up the hours worked for each employee in the Employees data set.

data employees(drop=i);
   length empid $20;
   array first_names{20} $15 _temporary_ ("Paul", "Allan", "Thomas", "Michael", "Chris", "David", "John", "Jerry", "James", "Robert",
                                          "William", "Richard", "Bob", "Daniel", "Paul", "George", "Larry", "Eric", "Charles", "Stephen");
   array last_names{20} $15 _temporary_ ("Smith", "Johnson", "Williams", "Jones", "Brown", "Miller", "Wilson", "Moore", "Taylor", "Hall",
                                        "Anderson", "Jackson", "White", "Harris", "Martin", "Thompson", "Robinson", "Lewis", "Walker", "Allen");
   call streaminit(123);
   do i=1 to 50e6;
      first_name=first_names[ceil(rand("Uniform")*20)];
      last_name=last_names[ceil(rand("Uniform")*20)];
      empid=compress(uuidgen(), '-');
      output;
   end;
run;
 
proc surveyselect data=employees out=temp(keep=empid) seed=123 noprint                
     method=srs
     sampsize=500000;
run;
 
data emphours;
   set temp;
   call streaminit(123);
   hours=round(rand('Uniform', 10, 100), 0.01);
run;
 
proc datasets lib=work nolist;
   delete temp;
run;quit;

The Sort Merge Approach

First, let us merge the two tables with a Merge Statement in the Data Step. For this approach to work, the input data must be sorted by empid. Therefore, I use PROC SORT to sort the two tables appropriately.

proc sort data=employees;
   by empid;
run;
 
proc sort data=emphours;
   by empid;
run;
 
data SortMerge;
   merge employees emphours;
   by empid;
run;

On my machine, this approach takes 1 minute and 45 to run. Over a minute of this is due to the PROC SORT steps.

The Format Lookup Approach

Now, let us see how we can use a user defined SAS format to look up the data. The first step to do so is to read the lookup data set into a format (or catalog). Therefore, we have to put the data into a format that PROC FORMAT can read. When we read a data set into a SAS format with PROC FORMAT, three variables must exist: start, label and fmtname. Consequently, I rename the input variables accordingly and use the retain statement to create a constant fmtname variable. Furthermore, I want this to be a character format. Therefore, I also specify the type variable. The type variable is not required, but unless explicitly specified, it has a default value of “n” (numeric format). Finally, I want a blank lookup value if the key is not found in the lookup table. Therefore I use the HLO=”O” with blank values for start and label in the last record of the data set.

Now the hard part is over. We have successfully put the data into a format for PROC FORMAT to read. Now we simply run the procedure and use the CNTLIN= Option to specify the input data set.

data fmt;
   set emphours(rename=(empid=start hours=label)) end=eof;
   retain fmtname "HourFmt" type "c";
   output;
   if eof then do;
      start=""; Label="";
      HLO="O"; output;
   end;
run;
 
proc format library=work cntlin=fmt;
run;

Performing The Lookup

Now to the actual lookup. We use the Put Function and specify the user created HourFmt to associate the input empid with the corresponding value of hours from the EmpHours data set. To complete the lookup, I use the Input Function to Convert From Character To Numeric because we want hours to be a numeric variable.

data want;
   set employees;
   hours=input(put(empid, $HourFmt.), 10.2);
run;

On my machine, the format lookup takes about 25 seconds total run time. Including the data step and PROC FORMAT run. That is well below half the run time of the sort-merge approach above.

The format lookup approach is very efficient. However, it is not very well known among SAS programmers. It overcomes the drawback of the array lookup, which can only lookup numeric values. In this post, we saw an example of looking up a mixed alphanumeric value. That is not possible in a direct array lookup. Furthermore, we are not restricted to some upper bound of key values. The format lookup can handle much wider ranges of key lookup values than the array lookup technique.

However, the format lookup has drawbacks of its own. We can still only look up a single value for each key. That is an issue, because in many situations, we want multiple lookup values for a single key value. In the blog post SAS Hash Object Lookup Example, I will demonstrate how to use the SAS Hash Object to overcome this obstacle. Also, see the blog post about Bitmapping in SAS.

Summary

The format lookup is a fast, efficient lookup method when used correctly. As we have seen, it can drastically reduce run times for merging and looking up data from multiple sources in SAS. It overcomes drawbacks of similar lookup techniques but is not without flaws itself.

This post presented an example of how to use the format lookup technique in SAS. However, it can do so much more than presented here. The book The Power Of PROC FORMAT is a great introduction to a procedure not many SAS programmers master. But those who do expand their SAS knowledge massively. I do not hesitate to call the Format Procedure the most underused procedure in the SAS programming language. See the related posts 5 SAS Picture Format Options You Should Know and Writing Multilabel Formats for a glimpse of what PROC FORMAT can do for you.

You can download the entire code from this post here.