Select By Group Conditional On Observation
Usually, I prefer the Data Step over PROC SQL in SAS. I know exactly how the program compiles and executes and in what order. However, problems do arise where an SQL solution is simpler and more intuitive. In that case, I have to force myself out of my data step comfort zone. A classic example is to select an entire By Group of data in SAS based on a single variable value in an observation. This post shows two different ways to accomplish this using PROC SQL.
The data step is inherently iterative. It reads in observations one at the time, modifies it in the PDV and writes the observation to a new data set. But what if you only want to write the observation to the data set if some criterion is fulfilled? Easy, just use a Subsetting If Statement right? Well, what if you want to write all observations within a By Group to the data set if just one observation within the By Group fulfills the criterion? Then the Data Step is not the ideal tool (though it can be done).
Instead, we use the SQL Procedure, which lets us think in terms of data groups instead of iterating through single observations. In the examples to come, I use the Sashelp.Stocks data set, which contains trading data fro three different stocks (IBM, Microsoft and Intel). Now, I am interested only in the stocks that at some point in time has a trading volume above 100M. Let us look at two different ways of handling this problem.
Example 1: Subquery
The first approach involves a subquery. If we look at the program from the inside out, the Select Clause within parenthesis creates a list of stock names, that at some point has a volume above 100M. The outer Select Clause select all variables for the stocks in the list from the subquery.
proc sql; create table stocks_subquery as select * from sashelp.stocks where stock in (select distinct stock from sashelp.stocks where volume>10e7) order by stock, date; quit;
If we look at the created data set stocks_subquery, we can see that only Intel and Microsoft has a trading volume above 100M at some point. Furthermore, we see that we effectively select the relevant By Groups, conditional on observation values.
Example 2: Use Having Clause
Next, let us look at another approach. The syntax is almost similar to Example 1. However, instead of a subquery, we use a Group By Statement along with a Having Clause. The key to this approach is the Having Clause. Here, I create implicit indicator variables for each observation with Volume>10e7 inside the SUM Function. Next, I use the SUM function to identify if the indicator variable is different from zero for any observation inside the By Group. If the indicator variable is different from zero, the sum will be greater than zero and we select the entire By Group.
proc sql; create table stocks_havingclause as select * from sashelp.stocks group by stock having sum(Volume>10e7) > 0 order by stock, date; quit;
We can run a PROC COMPARE to confirm that the Two SAS Data Sets Are Identical.
proc compare data=stocks_subquery compare=stocks_havingclause; run;
This post demonstrates two different approaches to how you can select entire By Groups based on single observations in SAS. The two approaches in the examples above are quite similar. Therefore, the approach you choose relies on the context of your problem and what approach makes most sense to you.
If you liked this trick, check out my Examples page for more Data Science Examples.
You can download the entire program from this post here.