Closest Value Match in SAS Using the Hash Object

Last week, I blogged about a Nearest Neighbor Match in SAS. We demonstrated that we can use the Setcur Method to perform very fast lookups based on the closest value of a key. Furthermore, we saw that this approach vastly outperforms the approach in presented in the blog post Fuzzy Merge in SAS with the Hash Object. Today, we expand the logic to be able to handle by-groups. In many lookup situtation, we do not want to cross group boundaries.

In the examples to come, I will use the sample data below.

/* Handling By-Groups */
data one;
input id n1;
datalines;
1 3.4 
1 7.9 
1 6.1 
2 4.7 
2 9.2 
3 6.3 
3 1.1 
3 8.3 
;
 
proc plan;
   factors id = 3 n2 = 5 / noprint;
   output out=two;
run;quit;

Closest Value Match By Group in SAS

The goal is this. Match the closest value of n1 in one to n2 in two within each ID. Meaning that 7.9 is matched to 8 for ID = 1 because no other number is closer within that ID. So the desired result looks like this

Obs  id  n2  closest 
1    1   1   3.4 
2    1   3   3.4 
3    1   4   3.4 
4    1   2   3.4 
5    1   5   6.1 
6    2   5   4.7 
7    2   1   4.7 
8    2   3   4.7 
9    2   2   4.7 
10   2   4   4.7 
11   3   4   6.3 
12   3   2   1.1 
13   3   1   1.1 
14   3   3   1.1 
15   3   5   6.3

First, let us recall the Do_Over method from the previous post about fuzzy merging. Ne nice thing about this approach is that I do not have to add any special handling of crossing by-groups. The Do_Over Method takes care of it.

The approach here is simple. We create a hash object h with id as key and n1 as data. I intialize n1 to missing and c to a veru big numeric value. Next, I use the Do_Over Method to iterate over all matching ID’s in h. I measure the distance for each pair of n1 and n2 and set the variable closest to the minimum distance.

data want;
 
   if _N_ = 1 then do;
      dcl hash h(dataset : "one", multidata : "Y");
      h.definekey("id");
      h.definedata("n1");
      h.definedone();
   end;
 
   set two;
   n1 = .; c = constant("big");
 
   do while (h.do_over() = 0);
      if abs(n2 - n1) < c then do;
         c = abs(n2 - n1);
         nn = n1;
      end;
   end;
 
   closest = nn;
run;

Closest Value Match Using the Setcur Method

While the approach above is nice and simple, it does require us to traverse the entire range of n2 values for each ID in the input data. Instead, let us utilize the fact that the Setcur Method lets us jump to a specific point in h and traverse forward or backward from there.

In the code below, I create the hash object h. Importantly, we use the Ordered : “Y” argument. This ensures that we traverse h in ascending order. We create the hash iterator i and link it to h. Next, We check if an exact match is present with the Check() Method. Obviously, there can be no closer match.

If no exact match is present, we insert the key value into h using the Add() Method. Now, we can use the Setcur Method to jump into h between the two possible closest values. Namely, the next and previous value. The Ordered : “Y” argument ensures this. We retrieve the previous and next values and check that we do not cross group boundaries. We want to stay within the same ID.

Finally, we find out what value is closest to n2 and assign it to the closest variable. Needless to say, we remove the n2 value from h at the end of the code.

data want(keep = id n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
      h.definekey("_id", "n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   _id = .;
 
   if h.check(key : id, key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : id, key : n2, data : id, data : n2);
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.prev() = 0 & id = _id then pn1 = n1;
 
      if i.setcur(key : id, key : n2) = 0 then
         if i.next() = 0 & id = _id then nn1 = n1;
 
      if      nmiss(nn1) then idx = 1;
      else if nmiss(pn1) then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : id, key : n2);
   end;
run;

Larger Data

Let us compare the two approaches with larger sample data. Below, we create the example data sets one and two.

data one;
   do id = 1 to 10;
      do _N_ = 1 to 10;
         n1 = .1 * ceil(rand('uniform') * 1000);
         output;
      end;
   end;
run;
 
proc plan;
   factors id = 10 n2 = 100 / noprint;
   output out=two;
run;quit;

First, let us use the Do_Over approach. This takes about 30 seconds to run on my system.

data want1(keep = id n2 closest);
 
   if _N_ = 1 then do;
      dcl hash h(dataset : "one", multidata : "Y");
      h.definekey("id");
      h.definedata("n1");
      h.definedone();
   end;
 
   set two;
   n1 = .; c = constant("big");
 
   do while (h.do_over() = 0);
      if abs(n2 - n1) < c then do;
         c = abs(n2 - n1);
         nn = n1;
      end;
   end;
 
   closest = nn;
run;

Next, let us run the Setcur approach. This takes about 3 seconds to run on my system. A quite sizable reduction in run-time.

data want2(keep = id n2 closest);
   if _N_ = 1 then do;
      if 0 then set one;
      dcl hash h(dataset : "one(rename = (id = _id)", ordered : "Y");
      h.definekey("_id", "n1");
      h.definedone();
      dcl hiter i("h");
   end;
 
   set two;
   _id = .;
 
   if h.check(key : id, key : n2) = 0 then closest = n2;
 
   else do;
      h.add(key : id, key : n2, data : id, data : n2);
 
      if i.setcur(key : id, key : n2) = 0 then 
         if i.prev() = 0 & id = _id then pn1 = n1;
 
      if i.setcur(key : id, key : n2) = 0 then 
         if i.next() = 0 & id = _id then nn1 = n1;
 
      if      nmiss(nn1) then idx = 1;
      else if nmiss(pn1) then idx = 2;
      else idx = 1 + (n2 - pn1 > nn1 - n2);
 
      closest = choosen(idx, pn1, nn1);
 
      h.remove(key : id, key : n2);
   end;
run;

Finally, you can verify that the results are identical with the snippet here.

proc compare base = want1 comp = want2;run;

Summary

In this post, we investigate how to perform closest-value matching between two data sets. Even when the data contains groups. This makes the process a bit more complicated, but not much. We see that the Setcur method scales quite well compared to other approaches because we utilize direct addressing, instead of traversing the data unnecessarily.

You can download the entire code from this post here.

Closest Value Match in SAS Using the Hash Object – Part 2

Closest Value Match By Group in SAS

Closest Value Match Using the Setcur Method

Larger Data

Summary