Last week, I blogged about extracting every n’th observation from a SAS data set. Today, I will demonstrate how to select the top N of a variable in a SAS data set by group. There are many ways this can be done. I will go through three widely different methods. Using PROC RANK, PROC SUMMARY and a Data Step Hash Object Method. In the example to come, I use sashelp.iris and select the top three SepalLength for each Species. You will see that the different methods yield slightly different results.
Proc Rank Approach
First, let us see how to use PROC RANK to select the top 3 Sepallength for each species. I specify the descending so that the highest values get the lowest rank. Next, I specify ties=low. This means that if two values are equal, they are both assigned the lowest of their corresponding ranks. In the Ranks Statement, I specify r as the variable to hold the ranks. Next, I use PROC SORT to sort the SAS data set by Species and r to get my final data set. Notice that the data set contains 11 observations, even though there are only 3 by-groups in the data. This is because for species=’Virginica’, there are 4 observations with the second highest value of sepallength.
The PROC RANK method is quite common. And it is quite flexible, even though it is very simple. You can easily choose how to deal with ties with a single procedure option. On the down-side, the method requires two steps. Consequently, there are two full reads of the data set and and entire sort algorithm.
proc rank data=sashelp.iris out=rank descending ties=low; by species; var SepalLength; ranks r; run; proc sort data=rank out=want; where r <= 3; by species r; run;
Proc Summary Approach
Next, let us see how to use PROC SUMMARY. Not surprisingly, I use Species in my Class Statement and Sepallength in the Var Statement. In the Output Statement, I first specify the name of the output data set. Next, I specify the IDGROUP Option. The IDGROUP Option is the key to this approach. It identifies multiple extremes. In this case, I want the maximum extremes of the Sepallength variable, and I want three of them. I specify this in the parentheses after the IDGROUP option. Run the code and check the results. You can see that the procedure creates three new variables with the extremes. Therefore, if we want the data in long form as above, we have to run the SAS PROC TRANSPOSE step as well.
I like this approach for several reasons. It utilizes an options, which has this very purpose. Also, if you want the final data in wide form, this is what the procedure provides. In terms of efficiency, obviously some kind of sort or search must take place behind the scenes. The documentation describes the search time in big O notation. I highly recommend that you familiarize yourself with the IDGROUP Option. As an interesting side note, the options can also be used to transpose data.
proc summary data=sashelp.iris nway noprint; class species; var SepalLength; output out=temp(drop=_type_ _freq_) idgroup (max(SepalLength) out (SepalLength) =) / autolabel autoname; run; proc transpose data=temp out=want(drop=_: rename=col1=SepalLength); by Species; run;
A Hash Object Approach
None of the approaches above utilizes the fact that sashelp.iris is already sorted and grouped by Species. Therefore, let us look at a data step hash object approach, that utilizes the shape of the data. I describe the steps of the approach in four steps below
- In the first step, I declare the hash object and specify the sashelp.iris data set with zero observations. This allows me to use the all:’Y’ notation in the DefineData Statement. I use the ordered:’D’ argument. This is very important because this is what makes me find the maximum values. Finally, I declare a Hash Iterator Object and link it to the hash object h.
- In step two, I use a DoW Loop to read all observations from the By-Group into the SAS hash object. I use the Add() Method to read in the By-Group data one observation at the time. Remember that when we add observations to the hash object, they are ordered descending by Sepallength.
- Next, I use the hash iterator method, next() to retrieve the first element (with the largest value of Sepallength) from the hash obejct. Then the second largest and so on. I continue this iteration until there are no more elements in the object or we have retrieved three values.
- At the end of step 3, I have retrieved and outputted 3 observations with the highest value of Sepallength. However, the iterator object still resides inside the hash object. This locks the By-Group on which it dwells. Therefore, I have to remove the iterator from h again. I do this by calling the first() and prev() methods. Finally, I call the Clear() Method to remove all elements from the object.
Run the code and verify the results. I like this method because it takes advantage of the shape of the data. Consequently, we retrieve the desired result in a single pass of the data. Furthermore, it requires no explicit sort. The hash object takes care of that part.
data want; if _N_ = 1 then do; declare hash h (dataset : 'sashelp.iris(obs=0)', ordered : 'd', multidata : 'Y'); /* 1 */ h.definekey ('SepalLength'); h.definedata (all : 'y'); h.definedone(); declare hiter hi ('h'); end; do until (last.Species); /* 2 */ set sashelp.iris; by Species; h.add(); end; do _N_ = 1 by 1 while (hi.next() = 0 & _N_ <= 3); /* 3 */ output; end; _N_ = hi.first(); /* 4 */ _N_ = hi.prev(); h.clear(); run;
In this post, I have presented three different methods to select Top N By Group in SAS. I use two different method using the Rank and Summary Procedures and a data step method utilizing the dynamic nature of the hash object. Furthermore, I discuss the pros and cons of each approach. The correct approach depends on the situation and how you want to deal with ties in the data. Also, you have to consider the initial shape of your data. Is it sorted already? Then the hash object approach may be the way to go.
You can download the entire code from this post here.