The SAS Lag Function Explained By Example
The SAS Lag Function is a common function to most SAS programmers. Yet, it is one of the most misunderstood functions in the entire SAS language. In this post, I will explain by example how the function works. Furthermore, I will demonstrate how to make the function produce the desired result. Even when handling By Groups.
In the examples to come, I use the simple data set here
/* Example data */ data have; input id x; datalines; 1 1 1 2 1 3 2 4 2 5 2 6 ;
Think of a Queue, Not a Lookback
When programmers first encounter the Lag Function, they visualize a lookback. Which makes sense based on the name. However, this is a crucial mistake. Lag Function is not a lookback. Rather it is a queue. When you use the Lag2 Function, SAS sets up a queue with two elements. Let us think of it like this [ . | . ]. Both elements are missing at the start of execution. Each time it executes, SAS returns the right-most value from the queue. Furthermore, the present value of the value we want to lag, is inserted into the queue from the left. The documentation describes a queue from bottom to top. However, I find it easier to present here with left to right. The principle is the same.
Let us look at an example. In the code below, I read the example data above. Then, I use the Lag2 Function and return it to the variable lag_x.
data want; set have; lag_x = lag2(x); run;
Let us take a look at how SAS processes this code. Below, you can see what the queue contains at each observation. Also, you can see what the function returns. Before SAS call the function the first time, the queue contains [ . | . ]. Therefore, it makes sense that it returns missing values in the two first calls.
/* id x Queue content 1 1 [ 1 | . ] Returned value: . 1 2 [ 2 | 1 ] Returned value: . 1 3 [ 3 | 2 ] Returned value: 1 2 4 [ 4 | 3 ] Returned value: 2 2 5 [ 5 | 4 ] Returned value: 3 2 6 [ 6 | 5 ] Returned value: 4 */
The Lag Function often produce undesirable results when we use it in conditional logic. This is a direct consequence of the queue construct and the fact that the queue is only affected when the function is actually executed. Consider the code example below. Here, the line if mod(_N_, 2) = 0 means “at every second observation”. Consequently, we execute the Lag2 Function at every second observation.
data want; set have; if mod(_N_, 2) = 0 then y = lag2(x); run;
Let us look at the results. If it was a sample lookback, we expect the second observation to me missing, the fourth observation to hold the value of x from the second observation and the sixth observation to hold the value of x from the fourth observation. However, this is not the case. The function is clearly only executed on every second observation. This causes results that we do not want.
/* id x Queue content 1 1 [ . | . ] 1 2 [ 2 | . ] Returned value: . 1 3 [ 2 | . ] 2 4 [ 4 | 2 ] Returned value: . 2 5 [ 4 | 2 ] 2 6 [ 6 | 4 ] Returned value: 2 */
There is a way to make the Lag Function execute regardless of whether the condition is true or not. We can use the Ifn Function like below. The second and third argument executes regardless of whether the first argument i true or false. A small feature with an enormous impact in this situation. Run the code below and verify that you get the results you expect if we consider the lag a lookback function.
data want; set have; y = ifn(mod(_N_, 2) = 0, lag2(x), .); run;
Finally, let us use what we learn above to handle a common situation where the Lag Function is misused. Handling By Groups. I want to lag the variable x. However, on the first observation of each By Group, I want a missing value. I do not want an x value from another By Group. Using what we learn above, does the code below yield the desired result?
data want; set have; by id; if first.id=0 then lagx=lag(x); run;
The answer is no. SAS executes the function only when the condition is true. Therefore, you can not rely on the code above. Instead, use the Ifn Function like this whenever you want to lag across by Groups.
data want; set have; by id; lagx = ifn(first.id, ., lag(x)); run;
In this post, I explain how the SAS Lag Function works. We learn to think of the lag function as a queue, not a lookback. Also, we see how to handle conditional logic and By Groups when lagging. There are dozens of learning material out there on the topic. However, the best article that I encountered was Leads and Lags: Static and Dynamic Queues in the SAS DATA STEP by Mark Keintz. I highly recommend reading it.
In a previous post, I write about how to Simulate a Lead Function in SAS. Also, see the posts Dynamic Lags in SAS with the Hash Object and Use Temporary Arrays to Store Lagged Values in SAS.
You can download the entire code from this post here.