In previous articles, I write about random sampling in SAS with and without replacement. The approaches we present are pretty basic and out-the-box SAS procedures such as Proc Surveyselect. Also, we use the data step in a few different ways. However, all the approaches have one of the following three shortcomings. 1: We have to check for duplicate random picks. 2: We have to scan the full range of possible values. 3. We have to call some randomization function unnecessarily. In this post, we will see an approach that let’s us overcome all these obstacles.

This article is mostly based on the article Efficient DATA Step Random Sampling Out Of Thin Air by Paul Dorfman.

In the examples to come, I will use the simple data set below.

data have;
   do x = 1 to 1e6;
      y = x * 2;
      output;
   end;
run;

The usual approach – Proc Surveyselect

As a benchmark, let us recall the usual way to do random sampling without replacement in sas. In the code below, we use Proc Surveyselect and select 100k observations from the above data set.

proc surveyselect data=have out=want noprint
     method=srs
     sampsize=100000;
run;

Random Sampling Using the Hash Object – Method 1

Let us see a naive approach of using the hash object to do random sampling. In the code below, I start by declaring the hash object h. I specify a single key variable r. Next, I loop until q=100k. I simulate a random number between 1 and n (the number of observations in the input data). I attempt to insert r into h. If this fails, we know that r hash been encountered before. Therefore, I use the Continue Statement to iterate again. If r has not been encountered before, we use the Set Statement with the Point=r option to read the r’th observation from the input data.

Lastly, we output and add 1 to q. At the bottom of the data step, I use the Stop Statement to prevent data data step from running forever.

data want(keep=x y);
   dcl hash h(hashexp : 20);
   h.definekey("r");
   h.definedone();
 
   do hit = 1 by 1 until (q = 1000000);
      r = ceil(rand('uniform') * n);
      if h.add() then continue;
      set have point = r nobs = n;
      output;
      q + 1;
   end;
   stop;
run;

This approach is nice and simple and performs alright. However when we sample k of n elements and k is almost as big as n, we will spend a lot of time rejecting r. Let us see how to avoid this.

Random Sampling Using the Hash Object – Method 2

Next, let us see how to avoid spending time rejecting simulated values that we have already encountered. In the code below, I once again specify the hash object h. This time with p as key and r as data.

Now, I loop from 1 to 100k. I set S to the end of range. Then I simulate a random integer between 1 and s. Next, I lookup p in h. If this fails, I set r = p and output the p’th observation from the input data.

Next, I lookup using s as the key. If this fails with a non-zero return value, I set r = s. In either case, I use the Replace Method. This cleverly keeps track of the range of values to simulate from and what values have already keen encountered.

data want(keep=x y);
   dcl hash h(hashexp : 20);
   h.definekey("p");
   h.definedata("r");
   h.definedone();
 
   do k = 1 to 100000;
      s = n - k + 1;
      p = ceil(rand("uniform")*s);
 
      if h.find (key : p) ne 0 then r = p ;
      set have point = r nobs = n;
      output;
 
      if h.find(key : s) ne 0 then r = s;
      h.replace(key : p, data : r);
   end;
   stop;
run;

Summary

In this post, we explore how to do random sampling using the hash object in SAS. As it turns out, we can use the hash object to do random sampling that overcomes all three challenges presented at the top. We avoid checking for duplicate picks. We do not have to read the entire input data. Lastly, we do not have to call any random function unnecessarily. Furthermore, the memory needed is strictly bounded by the number of elements we want to sample. Not by the full range of possible elements to sample from.

The points made in this post are from the article at the top. In turn, the inspiration for the article is from this SAS-L thread. I encourage you to dive into both.

Also, read the related post A Seven of Nine Fuzzy Matching Problem.

You can download the entire code from this post here.