By-group processing is part of the game when we work with data in SAS. The usual approach for most SAS programmers goes as follows: Sort the data. Use the By Statement in the Data Step or procedure. Perhaps use first – and last dot notation in the data step to control beginnings and ends of groups. However, as we have seen in the posts Three PROC SORT Options You Should Know and Three Alternatives To PROC SORT, sorting is CPU costly. In this post, I demonstrate how we can take advantage of the hash object to group (not sort) data values and by group process the data with the NOTSORTED Option.
In the example to come, I will use the following example data. It is simply a customer Id and a purchase value. The data is for demonstration purposes only.
data MyData(drop=i); length CustomerID $6 Purchase 8; call streaminit(123); do i=1 to 20e6; ID=put(rand("integer", 1, 1000), z6.); Purchase=rand("integer", 100, 5000); output; end; run;
The Usual Approach in SAS
First, let me present the usual by group approach. We sort the data by the CustomerID variable. Next, we use the By Statement in the data step to initialize the first.CustomerID and last.CustomerID variables. I use them to find the start and end of the relevant groups. Finally, I accumulate the sum variable with the help of the first. and last. variables and output at the end of each group.
proc sort data=MyData; by CustomerID; run; data Accumulate; set MyData; by CustomerID; if first.CustomerID then sum=0; sum=sum+Purchase; if last.CustomerID; run;
The PROC SORT step takes about 6 seconds to run and the data step takes about 2 seconds to run.
Group Data With Hash Object And Use NOTSORTED Option
Now, let us look at an alternative approach. Perhaps the data needs only to be grouped by the variable of interest, and not sorted. If the variable does not have any particular ranking, it may not matter. Does it matter whether CustomerID number 000100 appears before 001000? Probably not. We can take advantage of the grouping mechanism of the hash object to group variable values. Simply read the data set into a hash object and specify the CustomerID as the key. Remember to set the multidata option to Y. Finally, use the Output Method to output the grouped data set.
The GroupedData data set is not sorted. However, it is grouped by the CustomerID variable in a non-predictable manner. Finally, we can use the same data step as above to by group process the data. However, we have to set the NOTSORTED Option to tell the data set that the by variable is not in any chronological order.
data _null_; if 0 then set MyData; if _N_=1 then do; declare hash h(dataset:"MyData", multidata:"Y"); h.definekey("CustomerID"); h.definedata(all:"Y"); h.definedone(); end; h.output(dataset:"GroupedData"); run; data Accumulate; set GroupedData; by CustomerID notsorted; if first.CustomerID then sum=0; sum=sum+Purchase; if last.CustomerID; run;
The two steps above take 3 seconds and 2 seconds respectively. A sizable saving due to the fact that we do not waste CPU power on sort algorithms. Though we did not sort any data (not even inside the hash object), we created almost the same data as in the previous example. The only difference is the order. The answer to why this works is in the blog post AVL Tree Distribution in SAS Hash Objects Explained.
In this post, I demonstrate how to use the hash object along with the NOTSORTED Option to surpass sort algorithms and by group process data. Though we did save time and CPU power with the hash object approach, it does have limitations. Remember that the hash object exists in memory only. So we can not use the approach with a data set too big to fit in memory. The Sort Procedure does not have that problem since it can create temporary utility files on disk.
If you are interested in why/how this approach works, I recommend the book Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study. The first few chapters contain well-written information about the underlying mechanisms of the Hash Object.
Also, read the related posts Complicated Fuzzy Grouping With The SAS Hash Object and Partial Key Lookup in SAS Hash Object.
You can download the entire program from this post here.