Bitmap Search Technique in the SAS Data Step

There are many ways to search for a key value in SAS. Binary index searches, hash searches and so on. Today, I will investigate the bitmap search in the Data Step. Not many SAS programmers know about this technique. However, it is massively efficient when applicable. In the examples and theory to come, I will keep things simple. In the end of the post, I will link to material that explains more about the theory behind bitmapping and more general examples.

The Idea

When we search for a key value, we may want to return some data given that key. However, maybe the search is merely to verify the presence or absence of the key value. The presence or absence of a value can be represented by 0 or 1. In turn,SAS can represent 0 and 1 with a single bit. Yet, when we allocate numeric variables in SAS, we allocate 64 bit (8 byte) by default. Under most operating systems, SAS allocate 53 bit to the mantissa and 11 to the exponent. The bitmap lets us utilize the mantissa part of a numeric value to keep track of the of multiple key values with a single numeric value. One key for each bit in the mantissa (Actually in the version below, I only use 32 bit. You will see why).

For simplicity, let us assume that we have a 10 bit numeric variable with the value 0000000000. I want this to keep track of the presence/absence of a few key variables 2 and 7. I can do this by setting the value to 0001000010 (read right to left). With a little binary logic, this is fairly easy. If I want to add 1 to the second binary digit from the right, I add 2^(2-1)=2. 2 in this context is 0000000010 in binary. The same goes for 7. We add 2^(7-1)=64=0001000000. Naturally we do not want to add it twice, so we have to handle duplicates, but we will get to that. The most important thing to understand is that we can keep track of the presence of many (32) keys within the same 8 byte value. Which does not only reduce memory significantly, but also lets us utilize direct addressing in our search.

Verify the logic above with the small code snippet

data _null_;
   a=0;
   do k=2, 7;
      a+2**(k-1);
   end;
   put a= binary10.;
run;

Test Data Set

Now, let us expand the logic above and see a real word example. Assume that we have 2 SAS data sets. One Large. One Small. Large contains variable k and d. Small is a constructed subset of Large that contains only k. In the example below, I want to extract all observations in Large where k is present in Small. Neither data sets are sorted.

/* Large data set */
proc plan seed=0;
  factors k=1e7 / noprint;
  output out=Large;
run;quit;
 
data Large;
   set Large;
   d=rand('integer', 1, 100);
run;
 
/* Small data set */
proc surveyselect data=Large out=Small seed=123 noprint
     method=srs
     sampsize=5000000;
run;

A Bitmap Search

Let us see how how we can perform a bitmap search in the SAS Data Step. I divide the code below into 7 sections and describe each step below. First, I initialize two temporary arrays and fill them with values. Next, I perform a mapping operation and a search operation in two separate DoW Loops. I describe each step below.

Prepare the Arrays

  1. First, I create two temporary arrays. One to hold the key value representations in each bit (bm). And one to hold the binary powers to be added to each element in bm (bi). I set the number of elements in bm to the maximum key value possible divided by 32 plus 1 (to be sure). This way, I reduce the memory footprint substantially.
  2. Next, I set the initial values of bm and bi. Each element of bm is set to 0. I set the elements in bi to the binary powers that are added later to turn individual bits on in bm. I could also omit bi entirely, and compute the value each time, but it is faster to retrieve the value from memory than re-compute it each time.

The First DoW Loop

  1. In the first DoW Loop, I start by reading the Small data set. For each key value k, I determine the element in bm in which we wil turn on a single bit, which represents exactly this value of k. I do this by dividing by 32 and take the integer value. Consequently, k values from 1 to 32 will turn our attention to bm[1], from 33 to 64 bm[2] and so on.
  2. Next, I determine which bit within the found element of bm to turn on. I do this by finding the remainder of dividing k by 32. With our example from above, if k=34, then we look at the second bit of the second element of bm.
  3. Finally, I turn on the relevant bit of the determined element. I do not want to turn on a bit that is already on. Therefor, I use the BOR Function to do this. The BOR Function returns the binary or of two binary input. So if the bit is 1, it stays 1. If it is 0 it is set to 1.

The Second DoW Loop

  1. In the second DoW Loop, I read the Large Data Set. Step 6 and the following line does exactly the same as step 3 and 4. It turns our attention to the relevant bit within the relevant element of bm.
  2. Finally, I check whether the bit is on or off. I use the Band Function to return the bitwise logical and of bm[x] and bi[P]. The bitwise logical and operator returns zero if either bit considered is zero. Consequently, if the returned value is not 0, then the bit is turned on. That means, that the key value exists in Small. Therefore, I output the observation.
data bmsearch(keep=k d);
   array bm {0 : 312501} _temporary_;                           /*  1  */
   array bi {0 : 32}     _temporary_;
 
   do i=0 to 312500; bm[i]=0;      end;                         /*  2  */
   do j=0 to 32;     bi[j] = 2**j; end;
 
   do until (lr1);
      set small end=lr1;
      x = int(k / 32);                                          /*  1  */
      P = mod(k, 32);                                           /*  2  */
      bm[x] = bor(bm[x], bi[P]);                                /*  3  */
   end;
 
   do until (lr2);
      set large end=lr2;
      x = int(k / 32);                                          /*  1  */
      P = mod(k, 32);                                       
      if band(bm[x], bi[P]) ne 0 then output;                   /*  2  */
   end;
run;

The bitmap search in the data step above runs in about 2 seconds. We can compare that to other search methods that produce the same result. A hash object approach runs in about 20 seconds and a common SQL approach runs in about 30 seconds on my laptop. You can see the code at the link in the bottom. Furthermore, the bitmap search is extremely memory efficient as it uses about 30 times less memory than both alternatives.

Summary and Further Reading

In this post, I have demonstrated how to perform a bitmap search in the SAS Data Step. I briefly introduce the idea and theory. Then I demonstrate the code in a simple setting. Finally, we see that we obtain the desired result at speeds comparable to direct addressing. Furthermore we utilize memory more efficiently than any alternatives out there. As a related post, see An Array Hashing Scheme in SAS.

Confused? I sure was then I first encountered the bitmap. In this post, I intentionally keep things very simple. For example, I use only numeric, integer keys. However, there is much more explanation and detail present in the articles Re-Mapping A Bitmap and Table Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing by Paul M. Dorfman and Lessia S. Shajenko. Especially the first article is a must-read is you are interested in learning more about bitmaps in SAS. Here, the theory and code is vastly expanded from the simple example in this post. You will see bitmap searches of general character keys. Furthermore, you will see how to utilize memory even further by exploiting the entire mantissa, rather than just the 32 bit that the binary logical functions can handle in SAS.

You can download the entire code from this post.