A common task among data scientists and SAS programmers is to identify consecutive events. The term consecutive is not always clear though. In some cases, it means that the state of some variable should be the same over for example 3 observations in a row. In other cases it means that there should be some specific increase between observations for eg 3 observations in a row. Today, I will investigate the first of them. Namely, the case where some state should be constant over some number of observations. Needless to say, these problems come in all types and shapes. You can find tons of examples of how to handle these problems online. Here, I will try to present a few approaches that are not very common.
Flag Consecutive Events
This problem is a classic and often posted on online fora. Say we want to flag the observations where the event is part of a consecutive line of events of more than three appearances of the same event. Consider the data below. This occurs for id=1 and 3 for the event=20 and 40 respectively.
In the data step that follows, I use a Double DoW Loop to achieve this. In the first DoW loop, I reach either the end of the By-Group (id) or the sequence of consecutive events (event). This way, I am sure to stay within the sequence of consecutive events and within the By-Group. I increase the _N_ Variable by 1 for each iteration. If _N_ increases to 3 or above, I know that a consecutive sequence of 3 or more has been identified and I set the flag variable to 1.
In the second DoW Loop, I read as many observations as in the first DoW Loop. I control this with the _N_ Variable from above. Finally, I use an explicit Output Statement to get as many output observations as in my input data set.
data have; input id $ event; datalines; 1 50 1 20 1 20 1 20 1 30 2 40 2 40 2 20 2 30 3 30 3 30 3 40 3 40 3 40 3 40 3 10 ; /* Flag consequitive events occuring at least 3 times within the same by-group */ data want; do _N_=1 by 1 until (last.event | last.id); set obs; by id event notsorted; flag=ifn(_N_ ge 3, 1, 0); end; do _N_=1 to _N_; set obs; output; end; run;
Select Consecutive Events with the Hash Object
Next, let us take a look at another problem. Consider the data below. I want to find the last 2 ids where a consecutive sequence of ‘High’ values of more than 3 occurs. This one is a bit more complicated. One way to attack the problem is to read the data set backwards and stop when I have found 2 ids. However, this would either require sorting the entire data set or using Point= logic, which is not desirable.
Instead, I utilize the dynamic nature of the hash object. In the data step below, I solve the problem in the 4 steps:
- First, I declare the hash object h and use the Ordered:’A’ argument tag. This is important because in step 4, I want to know which element in the hash object entered last. I use the _N_ variable as the key because it ascends by 1 for each sequence that we encounter. This is because _N_ ascends by 1 each time it passes the Data Statement. I use the id, value _N_ itself and a counter variable c as the data variables. Finally, I create a SAS Hash Iterator Object hi and link it to h.
- Next, I use the same logic as above in the DoW loop. I read the input data set until either the end of the ID By-Group or the end of the same-value sequence of value. For each iteration I add 1 to c.
- In step three, I check whether c is higher or equal to three and if value is equal to ‘high’. If so, I know that I have just read a sequence of ‘high’ values longer than or equal to three. If this is the case, I add the observation to the hash object.
- The final step is carried out when SAS reaches the end of the input data set (end=lr). Now, I use a simple Do Loop and loop as many times as I want observations . In this case, I want two observations or as many as the hash object allows me to. Remember, all the hash object elements now represent the desired ‘high’ sequences and how long they are, represented by c. I call the Last() Method on the iterator object. This will give me the element that entered h latest due to the fact that h is ordered by _N_ and the latest added sequence has the highest value of _N_ by construction. Next, I output the observation. Finally, I remove the iterator object entirely from the hash object again. I can do this simply by calling the Next() Method because the iterator dwells on the last element in the object. Finally, I call the Remove() Method on the hash object. This way, there is a new ‘last’ element in the object, which I find by iterating the while loop again. I keep doing this until the object is empty or I have as many observations as I want. In this simple case two.
data have; input id $ value $; datalines; 1 high 1 medium 1 high 2 medium 2 high 2 high 2 high 2 medium 2 medium 2 low 3 high 3 high 3 medium 3 medium 3 low 3 low 4 low 4 high 4 high 4 high 4 high 5 medium 5 medium 5 medium 5 low 5 low 5 medium 5 medium 5 low 6 low 6 high 6 high 6 high 6 medium 6 medium 7 medium 7 medium 7 high 7 high 7 high 7 high 7 high 7 high 7 high ; data want(drop=rc); if _N_=1 then do; /* 1 */ declare hash h(ordered: 'A'); h.definekey('_N_'); h.definedata('_N_', 'id', 'value', 'c'); h.definedone(); declare hiter hi ('h'); end; do until (last.id | last.value); /* 2 */ set have end=lr; by id value notsorted; c=sum(c, 1); end; if value='high' & c ge 3 then h.add(); /* 3 */ if lr; do _iorc_=1 to 2 while (hi.last()=0); /* 4 */ output; rc=hi.next(); h.remove(); end; run;
In this post, we have seen two examples of how to handle consecutive sequences of same-value events in SAS. I present two problems and solution approaches thereof. Obviously, these kind of problems come in all kind of forms and this is not an exhausting list of solutions at all.
In this post, I consider the term consecutive as sequences of same-value events. This is not always the case. In some situations, consecutive is meant as values that increase by some specified value. A good example is consecutive dates or years. I Investigate these problems in the post Working With Consecutive Dates and Years in SAS.
The hash object in the second example can be considered a stack. The elements that enter last are withdrawn first. If you want to learn more about Stacks and Queues in SAS, read page 270-279 of Data Management Solutions Using SAS Hash Table Operations.
You can download the entire code from this example here.