Random Sampling in SAS Without Replacement

Random sampling lets you select a random subset of observations from a data set. Statisticians use random sampling to draw inferences about populations based on subsets of that population. The subject of random sampling is massive. There are many different techniques that lead to subsets with different attributes. In this post, I will demonstrate how to do random sampling without replacement in SAS. This is probably the simplest random sampling technique to imagine. The term “Without Replacement” means that once an observation is chosen, we can not choose it again.

In the examples to come, I will demonstrate random sampling using data Data Step and PROC SURVEYSELECT. I will do so with the example data below.

data have;
input id x;
datalines;
1 1 
1 2 
1 3 
2 4 
2 5 
2 6 
2 7 
3 8 
3 9 
3 10
3 11
3 12
;

A Data Step Approach

The standard way of randomly selecting k observations from a data set with n observations is this. Initialize a retained variable k with the number of observations you want to choose. In this case 5. Next, Set the entire data set and save the number of observations in a variable n with the Nobs= Option. Next, generate a random uniform variate. This is a random number between 0 and 1. If this number is less than k/n, I do two things. First, I output the current observation. Next, I subtract 1 from k. I do this because I have to choose one observation less now. Finally, I subtract 1 from n at the bottom regardless of whether the current observations was chosen or not. This way, k/n is re balanced as the proportion of observations we have to choose from the remaining observations in the data set.

data want(keep = id x);
   retain k 5;
   set have nobs = n;
   if rand ("uniform") < k/n then do;
      output;
      k = k-1;
   end;
   n = n-1;
run;

You can read more about this technique in the SAS note here.

Another Data Step Approach

Next, I will demonstrate another data step method. However, this method is not as widely used as the one above. First, I declare a temporary array. The array must have more entries than the number of observations in the input data set. In the example below, I declare the array with 15 entries. I initialize the elements with values from 1 to 15. I set the variable h to the number of observations in the input data set have. Next, I use the _N_ Variable to loop as many times as the number of observations I want to pick. In this case 5.

Within the loop, I start by simulating a random integer between 1 and h. Next, I pick the i’th element of s and save the value in the variable p. In the following statement, I read the p’th observation with the Point= Option in the Set Statement and output that. Next, I take the i’th element of the s array and replace it with the h’th element. I do this because the value in the i’th element of s denotes the number of the observation that was just picked. Therefor, we can not pick it again. We can however pick the h’th element. Finally, I subtract 1 from h because the value in the h’th element is not present in another entry in s.

data want (keep = id x);
   array s {15} _temporary_ (1:15);
   call streaminit (123);
   h = n;
   do _n_ = 1 to 5;
      i = rand ("integer", h);
      p = s [i];
      set have point=p nobs=n;
      output;
      s [i] = s [h];
      h = h-1;
   end;
   stop;
run;

This technique may be more complex to grasp at first. However, take a look at the log output below. This is the content of the s element for each iteration of the loop. Also, you can see exactly what values are replaced. Hopefully the algorithm makes more sense when you see this.

BEFORE SAMPLING:
array_s=1|2|3|4|5|6|7|8|9|10|11|12|13|14|15
 
 
h=12 Replaces x=2 in s.
array_s=1|12|3|4|5|6|7|8|9|10|11|12|13|14|15
 
h=11 Replaces x=1 in s.
array_s=11|12|3|4|5|6|7|8|9|10|11|12|13|14|15
 
h=10 Replaces x=6 in s.
array_s=11|12|3|4|5|10|7|8|9|10|11|12|13|14|15
 
h=9 Replaces x=12 in s.
array_s=11|12|3|4|5|10|7|8|9|10|11|9|13|14|15
 
h=8 Replaces x=7 in s.
array_s=11|12|3|4|5|10|8|8|9|10|11|9|13|14|15

I learned this technique from Paul Dorfman at the SAS Community.

PROC SURVEYSELECT

Finally, I will demonstrate how to do random sampling in SAS with PROC SURVEYSELECT. The Surveyselect Procedure is an out-of-the-box procedure designed to do random sampling in SAS. PROC SURVEYSELECT lets you do much more complicated random sampling than simple random sampling without replacement. This is controlled mainly by the Method= Option. In this example, I set Method=SRS (Simple Random Sampling). Also, I set Sampsize=5 to specify that I want 5 observations from the input data set.

proc surveyselect data=have out=want noprint
     method=srs
     sampsize=5;
run;

Summary

In this post, I demonstrate different methods to do random sampling without replacement in SAS. I present two data step techniques and a PROC SURVEYSELECT method. The data step methods are fine and sometimes more efficient than PROC SURVEYSELECT. Especially if you want to draw a very small subset of the input data set, the second data step method is extremely efficient. However, PROC SURVEYSELECT is a procedure designed with the very purpose of random sampling. Thus, it can handle much more complicated random sampling such as stratified random sampling proportional probability sampling and so on. There are tons of exmaples to find online and the procedure is well documented.

You can read a related post in Sampling With Replacement in SAS.

You can download the entire code from this post here.