Random Sampling in SAS With Replacement
Last week, I demonstrated how to do Random Sampling in SAS Without Replacement. The natural next topic is to do random sampling with replacement. The difference between the twp approaches is that in the latter, we are allowed to pick the same observation more than once. This is also known as resampling. This technique is widely used in statistical bootstrap methods and simulation. I will provide links to further reading in these fields at the end of the post. In this post, I will demonstrate how to do so with the Data Step and PROC SURVEYSELECT.
In the examples to come, I will use the simple data set below.
data have; input id x; datalines; 1 1 1 2 1 3 2 4 2 5 2 6 2 7 3 8 3 9 3 10 3 11 3 12 ;
First, let us see how to do random sampling in SAS with the Data Step. The technique is simpler than in the case without replacement. This is because we do not have to keep track of whether the observation has already been picked or not. Therefore, we simply use the _N_ Variable to loop as many times as the number of observations we want to sample. Next, I simulate a random integer between 1 and the total number of observations in the input data set have. We determine this number with the Nobs= Option in the Set Statement. In the same statement, I use the Point= Option to read only the observation with the number simulated in the variable p.
data want; do _n_ = 1 to 5; p = rand ("integer", n); set have point = p nobs = n; output; end; stop; run;
This approach is the standard way in the Data Step. When the data set or the number of observations to be picked is very large, you may want to consider to read the entire data set into memory with the Sasfile Statement. This is done in the blog post Sample with replacement in SAS at The Do Loop Blog.
Next, let us see how to do sampling with replacement in SAS using PROC SURVEYSELECT. I do this with the Method=URS. URS is short for Unrestricted Random Sampling. I set Sampsize=5 to tell SAS that we want to pick 5 observations from the input data set. Some of which may be the same. I use the Outhits Option to specify that I want the same number of observations as specified in the sampsize option. If I omit this, there will be only one observation in the output data set, even though the observation has been chosen more than once. In that case, the number of times the observations was picked is represented in the Numberhits variable.
proc surveyselect data=have out=want(drop=Numberhits) noprint method=urs outhits sampsize=5; run;
This post demonstrates how to do random sampling in SAS with replacement. We see how to do so with the Data Step and PROC SURVEYSELECT. The topic of random sampling is way bigger than I present in this (and last weeks) blog post. The Data Step is a nice tool for simple random sampling techniques. However, for more complicated sampling methods, PROC SURVEYSELECT is the way to go.
In the field of statistics, random sampling techniques are very important. Especially random sampling with replacement. This is often referred to as resampling. If this interests you, you should read the blog post Sampling with replacement: Now easier than ever in the SAS/IML language. Furthermore, read chapther 15 of the masterpiece book Simulating Data with SAS on Resampling at bootstrap methods.
You can download the entire code from this post here.