When we calculate descriptive statistics in SAS, we think of Proc Means or Proc Summary or similar procedures. However, at times it can be beneficial to calculate statistics in the SAS data step. For example, when the logic behind the statistic is not straight forward. Or when speed is of the essence. In this post, I will demonstrate how to use the hash object to calculate different types of sums. All in the data step.
In the examples below I will use the example data.
data have; input id $ x; datalines; 002 2 003 5 003 1 001 5 002 9 001 6 001 3 002 4 003 7 ;
Group Sums With the SAS Hash Object
First, let us see how to calculate group sums. Suppose, we want to keep all the original observations in the data. However, for each observation, we want to add the total group sum that the ID belongs to. First, I declare a hash object at the first iteration of the data step. In the same Do Group, I read the have data set. I use a simple DoW loop structure to load the group sums into the object. The process of this is pretty straight forward. If the ID is not in the object already, set s to zero. If it is there, add the currect PDV value of x to s. In any case, we replace and update the sum that currently resides in the object.
Now, the rest is easy. Simply read the data sequentially and lookup the group sums in h with a simple Find() Call.
data _null_; dcl hash h(); h.definekey('id'); h.definedata('id', 's'); h.definedone(); do until (z); set have end = z; if h.find() ne 0 then s = 0; s + x; h.replace(); end; h.output(dataset : 'want'); run;
Next, consider the case where we aggregate the data. Meaning we want one row for each it and the corresponding sum. The code below does just that. First, we declare the object h. I let s denote the group sum. Next, I read the entire data set with a DoW Loop. For each observation, I look up the current ID and sum. If not found, we set the sum to zero. If found, we add the current value of x. In either case, we use the Replace() Method and replace the value in h.
data _null_; dcl hash h(); h.definekey('id'); h.definedata('id', 's'); h.definedone(); do until (z); set have end = z; if h.find() ne 0 then s = 0; else s + x; h.replace(); end; h.output(dataset : 'want'); run;
The code above creates the same data as the Proc Summary step below. Besides the fact that Proc Summary sorts the output data by the Class Variable. You can create the exact same output in the hash code if you use the Ordered : “Y” argument in the Declare Statement. Furthermore, the hash approach to calculating statistics like this is not too far from the behind-the-scenes work of Proc Summary with a Class Statement. Proc Summary uses a similar structure with AVL trees to store and retrieve values.
proc summary data = have nway; class id; var x; output out=want(drop=_:) sum=; run;
Cumulative Sums With the SAS Hash Object
Next, let us see an example of how to create a cumulative sum with the SAS hash object. Most SAS programmers would instantly think of sorting, By-Group Processing and Retaining to create a cumulative sum. However, suppose we want to keeo the original (possibly unsorted) order of the data. We can do this in the single pass with the hash object.
First, we declare the hash object h in the first iteration of the data step. I let cs denote the cumulative sum. Next, I simply read the input data. Finally, I retrieve the cumulative sum from the hash object for each id. If it is not found, I set it directly to x. If it is found, I add x to the value. I keep track of the cumulative sum by replacing in h. Finally, the implicit output statement takes care of the rest.
data want; if _N_ = 1 then do; dcl hash h(); h.definekey('id'); h.definedata('cs'); h.definedone(); end; set have; if h.find() then cs = x; else cs = sum(cs, x); h.replace(); run;
Recently, I wrote a blog post about the rarely used Suminc and Keysum Arguments. It turns out that these can be used to create the same result as above with very little coding. A neat little peace of code.
data want; if _N_ = 1 then do; dcl hash h(suminc : 'x'); h.definekey('id'); h.definedone(); end; set have; h.ref(); h.sum(sum : cum_sum); run;
In this post, I demonstrate how to use the SAS hash object to calculate different types of sums in SAS. It may be easier and simpler to compute aggregates like these with pre-fabricated SAS procedures. However, when speed is of the essence, the hash object approach can come in handy. The reason is that the data step calculates only the statistics you want it to. SAS procedures have pre-defined statistics that will be calculated. Whether you want them or not.
You can download the entire code from this post here.